**About Book Crossing Dataset**<br>

This dataset has been compiled by Cai-Nicolas Ziegler in 2004, and it comprises of three tables for users, books and ratings. Explicit ratings are expressed on a scale from 1-10 (higher values denoting higher appreciation) and implicit rating is expressed by 0.

Reference: http://www2.informatik.uni-freiburg.de/~cziegler/BX/ 

**Objective**

This project entails building a Book Recommender System for users based on user-based and item-based collaborative filtering approaches.

#### Execute the below cell to load the datasets

In [1]:
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore


import warnings
warnings.filterwarnings('ignore')
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
#Loading data
books = pd.read_csv("books.csv", sep=";", error_bad_lines=False, encoding="latin-1")
books.columns = ['ISBN', 'bookTitle', 'bookAuthor', 'yearOfPublication', 'publisher', 'imageUrlS', 'imageUrlM', 'imageUrlL']

users = pd.read_csv('users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
users.columns = ['userID', 'Location', 'Age']

ratings = pd.read_csv('ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userID', 'ISBN', 'bookRating']

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'


### Check no.of records and features given in each dataset

In [3]:
ratings.info()
users.info()
books.info()
books.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
userID        1149780 non-null int64
ISBN          1149780 non-null object
bookRating    1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
ISBN                 271360 non-null object
bookTitle            271360 non-null object
bookAuthor           271359 non-null object
yearOfPublication    271360 non-null object
publisher            271358 non-null object
imageUrlS            271360 non-null object
imageUrlM            271360 non-null object
imageUrlL            271357 no

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
count,271360,271360,271359,271360,271358,271360,271360,271357
unique,271360,242135,102023,202,16807,271044,271044,271041
top,674003934,Selected Poems,Agatha Christie,2002,Harlequin,http://images.amazon.com/images/P/074321689X.0...,http://images.amazon.com/images/P/067100316X.0...,http://images.amazon.com/images/P/042515601X.0...
freq,1,27,632,13903,7535,2,2,2


## Exploring books dataset

In [4]:
books.head()
users.head()
ratings.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher,imageUrlS,imageUrlM,imageUrlL
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Drop last three columns containing image URLs which will not be required for analysis

In [5]:
books.drop("imageUrlS",axis=1,inplace=True)
books.drop("imageUrlM",axis=1,inplace=True)
books.drop("imageUrlL",axis=1,inplace=True)

In [6]:
books.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


**yearOfPublication**

### Check unique values of yearOfPublication


In [7]:
books.yearOfPublication.unique()

array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

As it can be seen from above that there are some incorrect entries in this field. It looks like Publisher names 'DK Publishing Inc' and 'Gallimard' have been incorrectly loaded as yearOfPublication in dataset due to some errors in csv file.


Also some of the entries are strings and same years have been entered as numbers in some places. We will try to fix these things in the coming questions.

### Check the rows having 'DK Publishing Inc' as yearOfPublication

In [8]:
books[books["yearOfPublication"] == 'DK Publishing Inc']

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
209538,078946697X,"DK Readers: Creating the X-Men, How It All Beg...",2000,DK Publishing Inc,http://images.amazon.com/images/P/078946697X.0...
221678,0789466953,"DK Readers: Creating the X-Men, How Comic Book...",2000,DK Publishing Inc,http://images.amazon.com/images/P/0789466953.0...


### Drop the rows having `'DK Publishing Inc'` and `'Gallimard'` as `yearOfPublication`

In [9]:
indexNames = books[books["yearOfPublication"].isin(['DK Publishing Inc','Gallimard']) ].index
print(indexNames)
books.drop(indexNames , inplace=True)
books[books["yearOfPublication"].isin(['DK Publishing Inc','Gallimard']) ]
books.yearOfPublication.unique()

Int64Index([209538, 220731, 221678], dtype='int64')


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher


array([2002, 2001, 1991, 1999, 2000, 1993, 1996, 1988, 2004, 1998, 1994,
       2003, 1997, 1983, 1979, 1995, 1982, 1985, 1992, 1986, 1978, 1980,
       1952, 1987, 1990, 1981, 1989, 1984, 0, 1968, 1961, 1958, 1974,
       1976, 1971, 1977, 1975, 1965, 1941, 1970, 1962, 1973, 1972, 1960,
       1966, 1920, 1956, 1959, 1953, 1951, 1942, 1963, 1964, 1969, 1954,
       1950, 1967, 2005, 1957, 1940, 1937, 1955, 1946, 1936, 1930, 2011,
       1925, 1948, 1943, 1947, 1945, 1923, 2020, 1939, 1926, 1938, 2030,
       1911, 1904, 1949, 1932, 1928, 1929, 1927, 1931, 1914, 2050, 1934,
       1910, 1933, 1902, 1924, 1921, 1900, 2038, 2026, 1944, 1917, 1901,
       2010, 1908, 1906, 1935, 1806, 2021, '2000', '1995', '1999', '2004',
       '2003', '1990', '1994', '1986', '1989', '2002', '1981', '1993',
       '1983', '1982', '1976', '1991', '1977', '1998', '1992', '1996',
       '0', '1997', '2001', '1974', '1968', '1987', '1984', '1988',
       '1963', '1956', '1970', '1985', '1978', '1973', '1980'

### Change the datatype of yearOfPublication to 'int'

In [10]:
books.yearOfPublication.astype(int).dtypes

books["yearOfPublication"]=pd.DataFrame(books.yearOfPublication.astype(int))

#books1["yearOfPublication"]=books.yearOfPublication.astype(int)



dtype('int32')

In [11]:
books.info()
books.describe()
books.dtypes

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271357 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271357 non-null object
bookTitle            271357 non-null object
bookAuthor           271356 non-null object
yearOfPublication    271357 non-null int32
publisher            271355 non-null object
dtypes: int32(1), object(4)
memory usage: 11.4+ MB


Unnamed: 0,yearOfPublication
count,271357.0
mean,1959.760817
std,257.994226
min,0.0
25%,1989.0
50%,1995.0
75%,2000.0
max,2050.0


ISBN                 object
bookTitle            object
bookAuthor           object
yearOfPublication     int32
publisher            object
dtype: object

### Drop NaNs in `'publisher'` column


In [12]:
books.info()
books.isnull().sum().sum()
books.isna().sum().sum()
books.publisher.isnull().sum()
books.describe()
books.shape
#books = books.publisher.dropna()
books.dropna(subset=['publisher'],inplace=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 271357 entries, 0 to 271359
Data columns (total 5 columns):
ISBN                 271357 non-null object
bookTitle            271357 non-null object
bookAuthor           271356 non-null object
yearOfPublication    271357 non-null int32
publisher            271355 non-null object
dtypes: int32(1), object(4)
memory usage: 11.4+ MB


3

3

2

Unnamed: 0,yearOfPublication
count,271357.0
mean,1959.760817
std,257.994226
min,0.0
25%,1989.0
50%,1995.0
75%,2000.0
max,2050.0


(271357, 5)

In [13]:
books.isnull().sum().sum()
books.isna().sum().sum()
books.publisher.isnull().sum()

1

1

0

## Exploring Users dataset

In [14]:
print(users.shape)
users.head()

(278858, 3)


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Get all unique values in ascending order for column `Age`

In [15]:
#users.Age.unique()
#users.sort_values('Age', ascending=[True])
sorted(users['Age'].unique())


[nan,
 0.0,
 1.0,
 2.0,
 3.0,
 4.0,
 5.0,
 6.0,
 7.0,
 8.0,
 9.0,
 10.0,
 11.0,
 12.0,
 13.0,
 14.0,
 15.0,
 16.0,
 17.0,
 18.0,
 19.0,
 20.0,
 21.0,
 22.0,
 23.0,
 24.0,
 25.0,
 26.0,
 27.0,
 28.0,
 29.0,
 30.0,
 31.0,
 32.0,
 33.0,
 34.0,
 35.0,
 36.0,
 37.0,
 38.0,
 39.0,
 40.0,
 41.0,
 42.0,
 43.0,
 44.0,
 45.0,
 46.0,
 47.0,
 48.0,
 49.0,
 50.0,
 51.0,
 52.0,
 53.0,
 54.0,
 55.0,
 56.0,
 57.0,
 58.0,
 59.0,
 60.0,
 61.0,
 62.0,
 63.0,
 64.0,
 65.0,
 66.0,
 67.0,
 68.0,
 69.0,
 70.0,
 71.0,
 72.0,
 73.0,
 74.0,
 75.0,
 76.0,
 77.0,
 78.0,
 79.0,
 80.0,
 81.0,
 82.0,
 83.0,
 84.0,
 85.0,
 86.0,
 87.0,
 88.0,
 89.0,
 90.0,
 91.0,
 92.0,
 93.0,
 94.0,
 95.0,
 96.0,
 97.0,
 98.0,
 99.0,
 100.0,
 101.0,
 102.0,
 103.0,
 104.0,
 105.0,
 106.0,
 107.0,
 108.0,
 109.0,
 110.0,
 111.0,
 113.0,
 114.0,
 115.0,
 116.0,
 118.0,
 119.0,
 123.0,
 124.0,
 127.0,
 128.0,
 132.0,
 133.0,
 136.0,
 137.0,
 138.0,
 140.0,
 141.0,
 143.0,
 146.0,
 147.0,
 148.0,
 151.0,
 152.0,
 156.0,
 157.0,
 159.0,


Age column has some invalid entries like nan, 0 and very high values like 100 and above

### Values below 5 and above 90 do not make much sense for our book rating case...hence replace these by NaNs

In [16]:
users.isnull().sum().sum()
users.isna().sum().sum()
users.Age.isnull().sum()
users.info()

110762

110762

110762

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


In [17]:
#users['Age'] = users['Age'].apply(lambda x: np.NaN if 90 <  x || x < 5 else x)
#users[(users['Age'] < 5) | (users['Age'] > 90)]['Age'] = np.nan
users[(users['Age'] < 5) | (users['Age'] > 90)].count()
users.Age[(users.Age < 5) | (users.Age > 90)] = np.nan
users.isnull().sum().sum()
users.isna().sum().sum()
users.Age.isnull().sum()
users[(users['Age'] < 5) | (users['Age'] > 90)].count()


userID      1312
Location    1312
Age         1312
dtype: int64

112074

112074

112074

userID      0
Location    0
Age         0
dtype: int64

### Replace null values in column `Age` with mean

In [18]:
users['Age'].fillna((users['Age'].mean()), inplace=True)
users.isnull().sum().sum()
users.isna().sum().sum()
users.Age.isnull().sum()

0

0

0

### Change the datatype of `Age` to `int`

In [19]:
users["Age"]=users["Age"].astype('int64',errors='ignore')

In [20]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
userID      278858 non-null int64
Location    278858 non-null object
Age         278858 non-null int64
dtypes: int64(2), object(1)
memory usage: 6.4+ MB


In [21]:
print(sorted(users.Age.unique()))

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90]


## Exploring the Ratings Dataset

### check the shape

In [22]:
ratings.shape

(1149780, 3)

In [23]:
n_users = users.shape[0]
n_books = books.shape[0]

In [24]:
n_users
n_books
ratings.head(5)

278858

271355

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


### Ratings dataset should have books only which exist in our books dataset. Drop the remaining rows

In [25]:
ratings.bookRating.value_counts()

0     716109
8     103736
10     78610
7      76457
9      67541
5      50974
6      36924
4       8904
3       5996
2       2759
1       1770
Name: bookRating, dtype: int64

In [26]:
ratings.head()
users.head()
books.head()

#ratings_books_df = pd.merge(ratings, books.drop_duplicates(['ISBN']), on="ISBN", how="left") 
ratings =   ratings[ratings.ISBN.isin(books.ISBN)]
ratings.shape
ratings.head()
ratings.info()

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


Unnamed: 0,userID,Location,Age
0,1,"nyc, new york, usa",34
1,2,"stockton, california, usa",18
2,3,"moscow, yukon territory, russia",34
3,4,"porto, v.n.gaia, portugal",17
4,5,"farnborough, hants, united kingdom",34


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company


(1031130, 3)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031130 entries, 0 to 1149778
Data columns (total 3 columns):
userID        1031130 non-null int64
ISBN          1031130 non-null object
bookRating    1031130 non-null int64
dtypes: int64(2), object(1)
memory usage: 31.5+ MB


### Ratings dataset should have ratings from users which exist in users dataset. Drop the remaining rows

In [27]:
#ratings_books_users_df = pd.merge(ratings_books_df, users.drop_duplicates(['userID']), on="userID", how="left") 
ratings =   ratings[ratings.userID.isin(users.userID)]
ratings.shape
ratings.head()
ratings.info()

(1031130, 3)

Unnamed: 0,userID,ISBN,bookRating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1031130 entries, 0 to 1149778
Data columns (total 3 columns):
userID        1031130 non-null int64
ISBN          1031130 non-null object
bookRating    1031130 non-null int64
dtypes: int64(2), object(1)
memory usage: 31.5+ MB


### Consider only ratings from 1-10 and leave 0s in column `bookRating`

In [28]:
ratings.bookRating.value_counts()
indexNames_bookRating0 = ratings[ratings["bookRating"] == 0].index
#indexNames.info()
print(indexNames_bookRating0)

ratings.drop(indexNames_bookRating0 , inplace=True)
ratings.shape
#books[books["yearOfPublication"].isin(['DK Publishing Inc','Gallimard']) ]
#books.yearOfPublication.unique()

0     647291
8      91804
10     71225
7      66401
9      60776
5      45355
6      31687
4       7617
3       5118
2       2375
1       1481
Name: bookRating, dtype: int64

Int64Index([      0,       2,       5,      10,      11,      12,      13,
                 14,      15,      17,
            ...
            1149764, 1149765, 1149766, 1149767, 1149768, 1149769, 1149770,
            1149772, 1149774, 1149776],
           dtype='int64', length=647291)


(383839, 3)

### Find out which rating has been given highest number of times

In [29]:
ratings.bookRating.value_counts().head(1)

8    91804
Name: bookRating, dtype: int64

### **Collaborative Filtering Based Recommendation Systems**

### For more accurate results only consider users who have rated atleast 100 books

In [30]:
#ratings_books_users_df.head()
#ratings_books_users_df[["userID","bookRating"]].groupby(["userID"]).count()
#ratings_books_users_df[["userID","bookRating"]].groupby( [ "userID"]).size()
#ratings.groupby('userID').head()
ratings_100_books=ratings["userID"].value_counts().to_frame().reset_index()
ratings_100_books.columns = ['userID', 'counts']
#users_rated_100_books.head()
ratings_100_books = ratings_100_books[(ratings_100_books['counts'] >100 )]

ratings =   ratings[ratings.userID.isin(ratings_100_books.userID)]

#ratings_books_users_df = pd.merge(users_rated_100_books,ratings_books_df, on="userID", how="left") 
ratings_100_books.shape
ratings.shape
ratings.head()

#pd.DataFrame(ratings_books_users_df[["userID","bookRating"]]["userID"].value_counts()).iloc[0]
#pd.DataFrame(ratings_books_users_df.groupby( [ "userID"],as_index=False).size().reset_index()).head().info()
#ratings_books_users_df[["userID","bookRating"]]["userID"].value_counts().head()
#ratings_books_users_df["userID"].value_counts().head()
#users_rated_100_books=pd.DataFrame(ratings_books_users_df["userID"].value_counts()).head()
#pd.DataFrame(ratings_books_users_df["userID"].value_counts()).columns
#users.sort_values('Age', ascending=[True])
#sorted(users['Age'].unique())
#ratings_books_users_df.groupby(["userID", "ISBN"]).count()
#ratings_books_users_df.groupby(["userID"]).count()

(440, 2)

(102369, 3)

Unnamed: 0,userID,ISBN,bookRating
1456,277427,002542730X,10
1458,277427,003008685X,8
1461,277427,0060006641,10
1465,277427,0060542128,7
1474,277427,0061009059,9


### Generating ratings matrix from explicit ratings


#### Note: since NaNs cannot be handled by training algorithms, replace these by 0, which indicates absence of ratings

In [31]:
ratings.isna().sum()

userID        0
ISBN          0
bookRating    0
dtype: int64

In [32]:
from sklearn.model_selection import train_test_split
trainDF, tempDF = train_test_split(ratings, test_size=0.2, random_state=100)
#creating a copy of tempDF as testDF
testDF = tempDF.copy()
#Assigning ratings of tempDF to nan
tempDF.rating = np.nan
tempDF.head()
#Remove missing values in testDF
testDF = testDF.dropna()
testDF.head()
#Creating ratings with trainDF and tempDF
ratings = pd.concat([trainDF, tempDF]).reset_index()
ratings.shape
ratings.head()


Unnamed: 0,userID,ISBN,bookRating
525415,127200,0879737964,3
456788,109901,0380727501,9
104281,23902,0395510821,8
1139570,274061,067082982X,10
449941,107784,088166247X,10


Unnamed: 0,userID,ISBN,bookRating
525415,127200,0879737964,3
456788,109901,0380727501,9
104281,23902,0395510821,8
1139570,274061,067082982X,10
449941,107784,088166247X,10


(102369, 4)

Unnamed: 0,index,userID,ISBN,bookRating
0,63716,12982,0385147635,8
1,426668,101851,459000982X,10
2,377654,91113,0312974256,9
3,1035227,247429,087123579X,10
4,104674,23902,0706400674,7


In [33]:
ratings.shape
#Fill not available values as 0.0 - sprase martix
R_df = ratings.pivot(index = 'userID', columns ='ISBN', values = 'bookRating').fillna(0)
R_df.shape
R_df.tail()


(102369, 4)

(440, 66074)

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
274061,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
274301,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
275970,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
277427,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278418,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Generate the predicted ratings using SVD with no.of singular values to be 50

In [34]:
from scipy.sparse.linalg import svds
#singluar value decomposition
#Compute the largest k singular values/vectors for a sparse matrix.
#k: Number of singular values and vectors to compute. Must be 1 <= k < min(R_df.shape)
# R_df is to compute the SVD on
# The singular values - sigma
U, sigma, Vt = svds(R_df, k = 50)
#diag
sigma = np.diag(sigma)
#I also need to add the user means back to get the predicted 5-star ratings
# U shape of 671* 5 when k=5
# sigma shape of 5*5
# Vt of shape 5*9100
#np.dot - Dot product of two arrays
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 

len(all_user_predicted_ratings)
#all_user_predicted_ratings
len(all_user_predicted_ratings[0])

preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)

preds_df.shape
R_df.shape

#predictions of ratings
preds_df.head()

440

66074

(440, 66074)

(440, 66074)

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.025435,-0.002179,-0.001453,-0.002179,-0.002179,0.002992,-0.003815,0.007085,0.007085,0.012476,...,0.000208,0.000274,0.04212,-0.016812,-0.07944,0.004883,0.02804,0.000138,-0.001522,0.067883
1,-0.009927,-0.003617,-0.002411,-0.003617,-0.003617,0.001039,0.001498,-0.003512,-0.003512,0.001624,...,-0.00036,0.000395,0.007998,0.001164,-0.028259,0.001008,0.002274,-0.00024,2.6e-05,-0.01293
2,-0.014924,-0.015591,-0.010394,-0.015591,-0.015591,0.007366,-0.014016,0.011928,0.011928,0.012008,...,-0.000433,0.001967,0.048691,0.005677,0.118079,0.006984,0.003151,-0.000288,0.009096,-0.058054
3,-0.02102,0.035453,0.023636,0.035453,0.035453,0.030357,0.024524,-0.001135,-0.001135,0.067559,...,0.003022,0.009995,0.088258,-0.008757,0.015976,0.028634,0.000253,0.002015,0.031009,-0.047275
4,0.002035,-0.008156,-0.005438,-0.008156,-0.008156,0.003119,0.002917,0.000222,0.000222,0.006312,...,0.002145,0.001677,-0.011525,0.009334,0.673907,0.002657,-0.00819,0.00143,0.00508,0.047187


### Take a particular user_id

### Lets find the recommendations for user with id `2110`

#### Note: Execute the below cells to get the variables loaded

In [35]:
userID=2110


In [36]:


ratings[ratings["userID"].isin(['2110']) ]


Unnamed: 0,index,userID,ISBN,bookRating
283,14463,2110,0345317580,10
1917,14606,2110,1565111575,10
3350,14576,2110,0679805265,10
4051,14462,2110,0345314255,10
5074,14582,2110,0743486625,10
7543,14565,2110,0671532642,8
7614,14601,2110,093317490X,7
8001,14513,2110,0441000150,10
8123,14579,2110,068808527X,8
12904,14467,2110,0345375580,10


In [37]:
user_id = 2 #2nd row in ratings matrix and predicted matrix
ratings.iloc[0:1,:]

Unnamed: 0,index,userID,ISBN,bookRating
0,63716,12982,385147635,8


### Get the predicted ratings for userID `2110` and sort them in descending order

In [38]:
ratings[ratings['userID']==userID].sort_values('bookRating',ascending=False).shape
ratings[ratings['userID']==userID].sort_values('bookRating',ascending=False)

(103, 4)

Unnamed: 0,index,userID,ISBN,bookRating
283,14463,2110,0345317580,10
61938,14568,2110,0671695304,10
77059,14548,2110,059046678X,10
75491,14459,2110,0345283929,10
74533,14608,2110,1570420564,10
73955,14466,2110,0345362276,10
73629,14457,2110,0345260627,10
68223,14552,2110,0590629794,10
65273,14593,2110,0812505042,10
56471,14461,2110,0345307674,10


### Create a dataframe with name `user_data` containing userID `2110` explicitly interacted books

In [39]:
user_data = ratings[ratings['userID']==userID]

In [40]:
user_data.shape
user_data.head()

(103, 4)

Unnamed: 0,index,userID,ISBN,bookRating
283,14463,2110,345317580,10
1917,14606,2110,1565111575,10
3350,14576,2110,679805265,10
4051,14462,2110,345314255,10
5074,14582,2110,743486625,10


In [41]:
user_data.shape

(103, 4)

### Combine the user_data and and corresponding book data(`book_data`) in a single dataframe with name `user_full_info`

In [42]:
book_data = books[books.ISBN.isin(user_data.ISBN)]

In [43]:
book_data.shape

(103, 5)

In [44]:
book_data.head()

Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
246,0151008116,Life of Pi,Yann Martel,2002,Harcourt
904,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books
1000,0064472779,All-American Girl,Meg Cabot,2003,HarperTrophy
1302,0345307674,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books
1472,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket


In [45]:
user_full_info = pd.merge(user_data, book_data, on='ISBN', how='outer')
user_full_info.shape

(103, 8)

In [46]:
user_full_info.head()

Unnamed: 0,index,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,14463,2110,345317580,10,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
1,14606,2110,1565111575,10,Return of the Jedi: The Original Radio Drama,Anthony Daniels,1996,Highbridge Audio
2,14576,2110,679805265,10,Long Shot (Three Investigators Crimebusters (P...,Megan Stine,1993,Random House Children's Books
3,14462,2110,345314255,10,Sword of Shannara,Terry Brooks,1995,Del Rey Books
4,14582,2110,743486625,10,Damnation Alley,Roger Zelazny,2004,I Books


### Get top 10 recommendations for above given userID from the books not already rated by that user

In [47]:
user_full_info.head(10)

Unnamed: 0,index,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,14463,2110,0345317580,10,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
1,14606,2110,1565111575,10,Return of the Jedi: The Original Radio Drama,Anthony Daniels,1996,Highbridge Audio
2,14576,2110,0679805265,10,Long Shot (Three Investigators Crimebusters (P...,Megan Stine,1993,Random House Children's Books
3,14462,2110,0345314255,10,Sword of Shannara,Terry Brooks,1995,Del Rey Books
4,14582,2110,0743486625,10,Damnation Alley,Roger Zelazny,2004,I Books
5,14565,2110,0671532642,8,RESTAURNT END UNIV (Hitchhiker's Trilogy (Pape...,Douglas Adams,1984,Pocket
6,14601,2110,093317490X,7,The Yucatan: A Guide to the Land of Maya Myste...,Antoinette May,1993,Wide World Publishing
7,14513,2110,0441000150,10,Quantum Leap: The Wall (Quantum Leap),Ashley McConnell,1994,Ace Books
8,14579,2110,068808527X,8,Close Friends,Peter Jenkins,1989,Harpercollins
9,14467,2110,0345375580,10,The Elf Queen of Shannara (Heritage of Shannar...,Terry Brooks,1993,Del Rey Books


In [48]:
user_full_info.head(10)
user_full_info.head(10)#preds_df.shape
#sorted_user_predictions = ratings[ratings['userID']==userID].sort_values('bookRating',ascending=False)
preds_df.shape
ratings.shape
book_data.shape
#book_data
preds_df.head()
book_data.head()
ratings.head()

#ratings[~ratings['ISBN'].isin(user_full_info['ISBN'])]['userID'].isin(['2110']).value_counts()
#sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)
#ratings[ratings['userID']==userID].sort_values('bookRating',ascending=False)
#         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left', left_on = 'movieId', right_on = 'movieId').
 #        rename(columns = {user_row_number: 'Predictions'}))

Unnamed: 0,index,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,14463,2110,0345317580,10,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
1,14606,2110,1565111575,10,Return of the Jedi: The Original Radio Drama,Anthony Daniels,1996,Highbridge Audio
2,14576,2110,0679805265,10,Long Shot (Three Investigators Crimebusters (P...,Megan Stine,1993,Random House Children's Books
3,14462,2110,0345314255,10,Sword of Shannara,Terry Brooks,1995,Del Rey Books
4,14582,2110,0743486625,10,Damnation Alley,Roger Zelazny,2004,I Books
5,14565,2110,0671532642,8,RESTAURNT END UNIV (Hitchhiker's Trilogy (Pape...,Douglas Adams,1984,Pocket
6,14601,2110,093317490X,7,The Yucatan: A Guide to the Land of Maya Myste...,Antoinette May,1993,Wide World Publishing
7,14513,2110,0441000150,10,Quantum Leap: The Wall (Quantum Leap),Ashley McConnell,1994,Ace Books
8,14579,2110,068808527X,8,Close Friends,Peter Jenkins,1989,Harpercollins
9,14467,2110,0345375580,10,The Elf Queen of Shannara (Heritage of Shannar...,Terry Brooks,1993,Del Rey Books


Unnamed: 0,index,userID,ISBN,bookRating,bookTitle,bookAuthor,yearOfPublication,publisher
0,14463,2110,0345317580,10,Magic Kingdom for Sale - Sold! (Magic Kingdom ...,Terry Brooks,1990,Del Rey Books
1,14606,2110,1565111575,10,Return of the Jedi: The Original Radio Drama,Anthony Daniels,1996,Highbridge Audio
2,14576,2110,0679805265,10,Long Shot (Three Investigators Crimebusters (P...,Megan Stine,1993,Random House Children's Books
3,14462,2110,0345314255,10,Sword of Shannara,Terry Brooks,1995,Del Rey Books
4,14582,2110,0743486625,10,Damnation Alley,Roger Zelazny,2004,I Books
5,14565,2110,0671532642,8,RESTAURNT END UNIV (Hitchhiker's Trilogy (Pape...,Douglas Adams,1984,Pocket
6,14601,2110,093317490X,7,The Yucatan: A Guide to the Land of Maya Myste...,Antoinette May,1993,Wide World Publishing
7,14513,2110,0441000150,10,Quantum Leap: The Wall (Quantum Leap),Ashley McConnell,1994,Ace Books
8,14579,2110,068808527X,8,Close Friends,Peter Jenkins,1989,Harpercollins
9,14467,2110,0345375580,10,The Elf Queen of Shannara (Heritage of Shannar...,Terry Brooks,1993,Del Rey Books


(440, 66074)

(102369, 4)

(103, 5)

ISBN,0000913154,0001046438,000104687X,0001047213,0001047973,000104799X,0001048082,0001053736,0001053744,0001055607,...,B000092Q0A,B00009EF82,B00009NDAN,B0000DYXID,B0000T6KHI,B0000VZEJQ,B0000X8HIE,B00013AX9E,B0001I1KOG,B000234N3A
0,0.025435,-0.002179,-0.001453,-0.002179,-0.002179,0.002992,-0.003815,0.007085,0.007085,0.012476,...,0.000208,0.000274,0.04212,-0.016812,-0.07944,0.004883,0.02804,0.000138,-0.001522,0.067883
1,-0.009927,-0.003617,-0.002411,-0.003617,-0.003617,0.001039,0.001498,-0.003512,-0.003512,0.001624,...,-0.00036,0.000395,0.007998,0.001164,-0.028259,0.001008,0.002274,-0.00024,2.6e-05,-0.01293
2,-0.014924,-0.015591,-0.010394,-0.015591,-0.015591,0.007366,-0.014016,0.011928,0.011928,0.012008,...,-0.000433,0.001967,0.048691,0.005677,0.118079,0.006984,0.003151,-0.000288,0.009096,-0.058054
3,-0.02102,0.035453,0.023636,0.035453,0.035453,0.030357,0.024524,-0.001135,-0.001135,0.067559,...,0.003022,0.009995,0.088258,-0.008757,0.015976,0.028634,0.000253,0.002015,0.031009,-0.047275
4,0.002035,-0.008156,-0.005438,-0.008156,-0.008156,0.003119,0.002917,0.000222,0.000222,0.006312,...,0.002145,0.001677,-0.011525,0.009334,0.673907,0.002657,-0.00819,0.00143,0.00508,0.047187


Unnamed: 0,ISBN,bookTitle,bookAuthor,yearOfPublication,publisher
246,0151008116,Life of Pi,Yann Martel,2002,Harcourt
904,015216250X,So You Want to Be a Wizard: The First Book in ...,Diane Duane,2001,Magic Carpet Books
1000,0064472779,All-American Girl,Meg Cabot,2003,HarperTrophy
1302,0345307674,Return of the Jedi (Star Wars),James Kahn,1983,Del Rey Books
1472,0671527215,Hitchhikers's Guide to the Galaxy,Douglas Adams,1984,Pocket


Unnamed: 0,index,userID,ISBN,bookRating
0,63716,12982,0385147635,8
1,426668,101851,459000982X,10
2,377654,91113,0312974256,9
3,1035227,247429,087123579X,10
4,104674,23902,0706400674,7
