**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [463]:
import numpy as np
import pandas as pd

In [464]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


### Check no.of records and features given in each dataset

In [465]:
books.shape

(271360, 8)

In [466]:
users.shape

(278858, 3)

In [467]:
ratings.shape

(1149780, 3)

## Exploring books dataset

In [468]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### Drop last three columns containing image URLs which will not be required for analysis

In [469]:
books.drop(columns=['imageUrlS','imageUrlM','imageUrlL'],axis=1,inplace=True)

In [470]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [471]:
books.yearOfPublication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [472]:
books.loc[books['yearOfPublication'] == 'DK Publishing Inc']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


In [473]:
books.loc[books['yearOfPublication'].isin(['DK Publishing Inc','Gallimard'])]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
220731,2070426769,"Peuple du ciel, suivi de 'Les Bergers\"";Jean-M...",2003,Gallimard,http://images.amazon.com/images/P/2070426769.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [474]:
books.shape

(271360, 5)

In [475]:
books = books[~books['yearOfPublication'].isin(['DK Publishing Inc','Gallimard'])]

In [476]:
books.shape

(271357, 5)

### Change the datatype of yearOfPublication to 'int'

In [477]:
books.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271357 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271357 non-null object
bookTitle            271357 non-null object
bookAuthor           271356 non-null object
yearOfPublication    271357 non-null object
publisher            271355 non-null object
dtypes: object(5)
memory usage: 12.4+ MB


In [478]:
books['yearOfPublication'] = pd.to_numeric(books['yearOfPublication'])

In [479]:
books.dtypes

ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int64
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [480]:
books.isna().sum()

ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            2
dtype: int64

In [481]:
books.dropna(subset=['publisher'],inplace=True)

In [482]:
books.isna().sum()

ISBN                 0
bookTitle            0
bookAuthor           1
yearOfPublication    0
publisher            0
dtype: int64

In [483]:
books.dropna(inplace=True)

In [484]:
books.isna().sum()

ISBN                 0
bookTitle            0
bookAuthor           0
yearOfPublication    0
publisher            0
dtype: int64

## Exploring Users dataset

In [485]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [486]:
users.sort_values(by=['Age'])['Age'].unique()

array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,
        11.,  12.,  13.,  14.,  15.,  16.,  17.,  18.,  19.,  20.,  21.,
        22.,  23.,  24.,  25.,  26.,  27.,  28.,  29.,  30.,  31.,  32.,
        33.,  34.,  35.,  36.,  37.,  38.,  39.,  40.,  41.,  42.,  43.,
        44.,  45.,  46.,  47.,  48.,  49.,  50.,  51.,  52.,  53.,  54.,
        55.,  56.,  57.,  58.,  59.,  60.,  61.,  62.,  63.,  64.,  65.,
        66.,  67.,  68.,  69.,  70.,  71.,  72.,  73.,  74.,  75.,  76.,
        77.,  78.,  79.,  80.,  81.,  82.,  83.,  84.,  85.,  86.,  87.,
        88.,  89.,  90.,  91.,  92.,  93.,  94.,  95.,  96.,  97.,  98.,
        99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
       110., 111., 113., 114., 115., 116., 118., 119., 123., 124., 127.,
       128., 132., 133., 136., 137., 138., 140., 141., 143., 146., 147.,
       148., 151., 152., 156., 157., 159., 162., 168., 172., 175., 183.,
       186., 189., 199., 200., 201., 204., 207., 20

Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [487]:
users['Age'] = users['Age'].apply(lambda x: np.where(x < 5, np.nan, x))

In [488]:
users.sort_values(by=['Age'])['Age'].unique()

array([  5.,   6.,   7.,   8.,   9.,  10.,  11.,  12.,  13.,  14.,  15.,
        16.,  17.,  18.,  19.,  20.,  21.,  22.,  23.,  24.,  25.,  26.,
        27.,  28.,  29.,  30.,  31.,  32.,  33.,  34.,  35.,  36.,  37.,
        38.,  39.,  40.,  41.,  42.,  43.,  44.,  45.,  46.,  47.,  48.,
        49.,  50.,  51.,  52.,  53.,  54.,  55.,  56.,  57.,  58.,  59.,
        60.,  61.,  62.,  63.,  64.,  65.,  66.,  67.,  68.,  69.,  70.,
        71.,  72.,  73.,  74.,  75.,  76.,  77.,  78.,  79.,  80.,  81.,
        82.,  83.,  84.,  85.,  86.,  87.,  88.,  89.,  90.,  91.,  92.,
        93.,  94.,  95.,  96.,  97.,  98.,  99., 100., 101., 102., 103.,
       104., 105., 106., 107., 108., 109., 110., 111., 113., 114., 115.,
       116., 118., 119., 123., 124., 127., 128., 132., 133., 136., 137.,
       138., 140., 141., 143., 146., 147., 148., 151., 152., 156., 157.,
       159., 162., 168., 172., 175., 183., 186., 189., 199., 200., 201.,
       204., 207., 208., 209., 210., 212., 219., 22

In [489]:
users.Age.isna().sum()

111644

In [491]:
users['Age'] = users['Age'].apply(lambda x: np.where(x > 90, np.nan, x))

In [492]:
users.sort_values(by=['Age'])['Age'].unique()

array([ 5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17.,
       18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30.,
       31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
       44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55., 56.,
       57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69.,
       70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81., 82.,
       83., 84., 85., 86., 87., 88., 89., 90., nan])

In [493]:
users.Age.isna().sum()

112074

In [495]:
users.loc[users['Age']>90].sum()

userID      0.0
Location    0.0
Age         0.0
dtype: float64

### Replace null values in column `Age` with mean

In [496]:
users['Age'] = users['Age'].fillna(users['Age'].mean())

In [497]:
users.Age.isna().sum()

0

No NaN values present

### Change the datatype of `Age` to `int`

In [498]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [499]:
users['Age'] = users.Age.astype('int64')

In [500]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null int64
dtypes: int64(2), object(1)
memory usage: 6.4+ MB


In [501]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [502]:
ratings.shape

(1149780, 3)

In [503]:
n_users = users.shape[0]
n_books = books.shape[0]

In [507]:
n_users
n_books

271354

In [504]:
ratings.head(5)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [508]:
ratings[ratings.ISBN.isin(books.ISBN.values)].count()

userID        1031129
ISBN          1031129
bookRating    1031129
dtype: int64

In [509]:
ratings[~ratings.ISBN.isin(books.ISBN.values)].count()

userID        118651
ISBN          118651
bookRating    118651
dtype: int64

In [510]:
ratings = ratings[ratings.ISBN.isin(books.ISBN.values)]

In [511]:
ratings.shape

(1031129, 3)

### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [512]:
ratings[ratings.userID.isin(users.userID.values)].count()

userID        1031129
ISBN          1031129
bookRating    1031129
dtype: int64

In [513]:
ratings[~ratings.userID.isin(users.userID.values)].count()

userID        0
ISBN          0
bookRating    0
dtype: int64

In [514]:
ratings = ratings[ratings.userID.isin(users.userID.values)]

In [515]:
ratings.shape

(1031129, 3)

### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [516]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [517]:
 sorted(ratings.bookRating.unique())


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [518]:
ratings[ratings.bookRating != 0].count()

userID        383838
ISBN          383838
bookRating    383838
dtype: int64

In [519]:
ratings = ratings[ratings.bookRating != 0]

In [520]:
ratings.shape

(383838, 3)

### Find out which rating has been given highest number of times

In [521]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 383838 entries, 1 to 1149778
Data columns (total 3 columns):
userID        383838 non-null int64
ISBN          383838 non-null object
bookRating    383838 non-null int64
dtypes: int64(2), object(1)
memory usage: 11.7+ MB


In [522]:
ratings.groupby(['bookRating']).count()

Unnamed: 0_level_0,userID,ISBN
bookRating,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1481,1481
2,2375,2375
3,5118,5118
4,7617,7617
5,45355,45355
6,31687,31687
7,66401,66401
8,91803,91803
9,60776,60776
10,71225,71225


Rating '8' has been given the maximum number of times

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [523]:
ratings.head()

Unnamed: 0,userID,ISBN,bookRating
1,276726,0155061224,5
3,276729,052165615X,3
4,276729,0521795028,6
8,276744,038550120X,7
16,276747,0060517794,9


In [525]:
ratings.groupby(['userID'],as_index=False).count()

Unnamed: 0,userID,ISBN,bookRating
0,8,7,7
1,9,1,1
2,12,1,1
3,14,3,3
4,16,1,1
5,17,4,4
6,19,1,1
7,22,1,1
8,26,2,2
9,32,1,1


In [526]:
ratings.columns

Index(['userID', 'ISBN', 'bookRating'], dtype='object')

In [527]:
ratings2 = ratings.groupby(['userID'],as_index=False).count()

In [528]:
ratings2.columns

Index(['userID', 'ISBN', 'bookRating'], dtype='object')

In [529]:
ratings2[ratings2['ISBN'] >= 100]

Unnamed: 0,userID,ISBN,bookRating
481,2033,129,129
508,2110,103,103
554,2276,196,196
967,4017,154,154
1055,4385,212,212
1294,5582,132,132
1457,6242,134,134
1461,6251,217,217
1539,6543,174,174
1548,6575,233,233


In [530]:
ratings3 = ratings2[ratings2['ISBN'] >= 100]

In [531]:
ratings.columns

Index(['userID', 'ISBN', 'bookRating'], dtype='object')

In [532]:
ratings4 = ratings[ratings.userID.isin(ratings3.userID.values)]

In [533]:
ratings4.nunique()

userID          449
ISBN          66572
bookRating       10
dtype: int64

In [None]:
# Now Build the full data set at User, Book and Rating level for all the users that have rated at least 100 Books

In [535]:
# ratings_user_set=ratings_ignore_zero.merge(df_users, how='inner', on=['userID']).drop(labels="count",axis=1)
ratings_user_set=ratings4.merge(books, how='inner', on=['ISBN'])

In [536]:
ratings_user_set.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,277427,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
1,11676,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
2,12538,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
3,52584,002542730X,10,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc
4,110934,002542730X,6,Politically Correct Bedtime Stories: Modern Ta...,James Finn Garner,1994,John Wiley &amp; Sons Inc


In [537]:
ratings_user_set.nunique()

userID                 449
ISBN                 66572
bookRating              10
bookTitle            61648
bookAuthor           29474
yearOfPublication       89
publisher             6027
dtype: int64

### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [442]:
ratings4.isna().sum()

userID        0
ISBN          0
bookRating    0
dtype: int64

In [443]:
books.isna().sum()

ISBN                 0
bookTitle            0
bookAuthor           0
yearOfPublication    0
publisher            0
dtype: int64

In [444]:
users.isna().sum()

userID      0
Location    0
Age         0
dtype: int64

In [558]:
#Build the Books dataframe that has list of Books available only from the ratings_user_set. 
books_df=books[books['ISBN'].isin(ratings_user_set['ISBN'])==True]

### Generate the predicted ratings using SVD with no.of singular values to be 50

In [538]:
# We want the format of ratings matrix to be one row per user and one column per book. 
#Why pivot? Because users, books and ratings are not given in a single table as input
# we can pivot ratings4 to get that and call the new variable R_df.
#Fill not available values as 0.0 - sprase martix
R_df = ratings_user_set.pivot(index = 'userID', columns ='ISBN', values = 'bookRating').fillna(0)
R_df.tail()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
274061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
274301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
275970,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
277427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [539]:
R_df.shape

(449, 66572)

In [540]:
from scipy.sparse.linalg import svds
#singluar value decomposition
#k: Number of singular values and vectors to compute. Must be 1 <= k < min(R_df.shape)
# R_df is to compute the SVD on
# The singular values - sigma
U, sigma, Vt = svds(R_df, k = 50)

In [541]:
#diag
sigma = np.diag(sigma)

In [542]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)

In [543]:
#predictions of book ratings
preds_df.head()

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.025341,-0.002146,-0.001431,-0.002146,-0.002146,0.002971,-0.00392,0.007035,0.007035,0.012316,...,0.00018,0.000226,0.042081,-0.016804,-0.080028,0.004746,0.028314,0.00012,-0.001693,0.067503
1,-0.010012,-0.003669,-0.002446,-0.003669,-0.003669,0.001075,0.00144,-0.0035,-0.0035,0.001612,...,-0.000363,0.000403,0.008142,0.001104,-0.029224,0.000999,0.002363,-0.000242,2.9e-05,-0.013059
2,-0.015054,-0.015457,-0.010304,-0.015457,-0.015457,0.007281,-0.014033,0.011941,0.011941,0.011796,...,-0.000455,0.001907,0.047982,0.005737,0.117859,0.006945,0.003119,-0.000304,0.009009,-0.057692
3,-0.021499,0.035602,0.023735,0.035602,0.035602,0.030307,0.024215,-0.001053,-0.001053,0.067579,...,0.002971,0.009912,0.086248,-0.008818,0.016154,0.028848,-0.000125,0.001981,0.031201,-0.046664
4,0.002077,-0.007965,-0.00531,-0.007965,-0.007965,0.002947,0.003057,0.000231,0.000231,0.00608,...,0.00212,0.001597,-0.012181,0.00942,0.673459,0.002591,-0.008229,0.001413,0.004918,0.047773


In [544]:
pd.DataFrame(all_user_predicted_ratings).shape

(449, 66572)

### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [545]:
userID = 2110

In [546]:
user_id = 2 #2nd row in ratings matrix and predicted matrix

### Get the predicted ratings for userID `2110` and sort them in descending order

In [547]:
sorted_user_predictions=preds_df.iloc[user_id].sort_values(ascending=False)
sorted_user_predictions.head()

ISBN
0316666343    1.015397
059035342X    0.778665
0345350499    0.697309
0440214041    0.665439
044021145X    0.663549
Name: 2, dtype: float64

### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [548]:
user_data=ratings_user_set[ratings_user_set["userID"]==2110].drop(labels=["bookTitle","bookAuthor","yearOfPublication","publisher"],axis=1)

In [549]:
user_data.head()

Unnamed: 0,userID,ISBN,bookRating
1516,2110,60987529,7
1533,2110,64472779,8
1535,2110,140022651,10
1536,2110,142302163,8
1537,2110,151008116,5


In [550]:
user_data.shape

(103, 3)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [551]:
book_data=books.merge(user_data, how='inner', on=['ISBN']).drop(labels=["userID","bookRating"],axis=1)

In [552]:
book_data.shape

(103, 5)

In [553]:
book_data.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,0151008116,Life of Pi,Yann Martel,2002,Harcourt
1,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books
2,0064472779,All-American Girl,Meg Cabot,2003,HarperTrophy
3,0345307674,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books
4,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket


In [555]:
user_full_info=pd.merge(user_data,book_data,how='inner',on='ISBN')

In [556]:
user_full_info.head()

Unnamed: 0,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,2110,60987529,7,Confessions of an Ugly Stepsister : A Novel,Gregory Maguire,2000,Regan Books
1,2110,64472779,8,All-American Girl,Meg Cabot,2003,HarperTrophy
2,2110,140022651,10,Journey to the Center of the Earth,Jules Verne,1965,Penguin Books
3,2110,142302163,8,The Ghost Sitter,Peni R. Griffin,2002,Puffin Books
4,2110,151008116,5,Life of Pi,Yann Martel,2002,Harcourt


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [559]:
# Get list of all Books that the user hasn't rated into a DataFrame called recommendations
recommendations = books_df[~books_df['ISBN'].isin(user_full_info['ISBN'])].merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left', left_on = 'ISBN', right_on = 'ISBN').rename(columns = {2: 'Predictions'})

In [560]:
# Sort the recommendations in descending order and get the top 10 Recommendations for the user 2110
recommendations.sort_values('Predictions', ascending = False).iloc[0:10,:]

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,Predictions
237,0316666343,The Lovely Bones: A Novel,Alice Sebold,2002,"Little, Brown",1.015397
1164,0345350499,The Mists of Avalon,MARION ZIMMER BRADLEY,1987,Del Rey,0.697309
1322,0440214041,The Pelican Brief,John Grisham,1993,Dell,0.665439
252,044021145X,The Firm,John Grisham,1992,Bantam Dell Publishing Group,0.663549
292,0312195516,The Red Tent (Bestselling Backlist),Anita Diamant,1998,Picador USA,0.64284
10050,0345318862,Golem in the Gears (Xanth Novels (Paperback)),PIERS ANTHONY,1986,Del Rey,0.639465
2545,0345313151,Bearing an Hourglass (Incarnations of Immortal...,Piers Anthony,1991,Del Rey Books,0.631446
3287,0380752891,"Man from Mundania (Xanth Trilogy, No 12)",Piers Anthony,1990,Harper Mass Market Paperbacks,0.629143
19741,051511605X,Undue Influence,Steven Paul Martini,1995,Jove Books,0.617955
4570,043936213X,Harry Potter and the Sorcerer's Stone (Book 1),J. K. Rowling,2001,Scholastic,0.614288


In [562]:
# Test to see if any of the recommendations given already exist for the user 2110. Result should be false
ratings4[ratings4["userID"]==2110].ISBN.isin(recommendations['ISBN']).unique()

array([False])