# Introduction

Recommender systems are implemented everywhere on the Internet. Given the gigantic amount of data available online and the limited time and visualization humans have, these systems play a crucial part in facilitating data transfer and maximize satisfaction rate of Internet users. The potential output is an algorithm that takes a book title and return a group of recommended books.

The dataset consists of 3 table files: books information, books ratings and users information.
Books information include the title, book ID, author, year, publisher and some links to images. Books ratings includes user ID, book ID and the rating. User information include user ID, age and location.

I choose to use both a supervised algorithm Matrix Factorization and an unsupervised algorithm K-Means. These 2 methods are commonly used for recommemder systems. At the end, I would like to compare the results from the 2 methods to determine if they are consistent.

The results show slight overlap between the 2 different methods. However, the difference is significant. The consistency could be improved if more factors are considered such as age, location, author, etc. 

After all, effectiveness of a recommender system is typical measured by number of users' clicks. This data can be extremely helpful in improving the performance. A good recommender system tends to be one that has been used for long enough to predict trends in users behavior.



In [228]:
import pandas as pd
import numpy as np

books = pd.read_csv("BX-Books.csv", error_bad_lines = False, sep = ';', encoding = 'latin-1')
ratings = pd.read_csv("BX-Book-Ratings.csv", error_bad_lines = False, sep = ';', encoding = 'latin-1')
readers = pd.read_csv('BX-Users.csv', error_bad_lines = False, sep = ';', encoding = 'latin-1')

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  interactivity=interactivity, compiler=compiler, result=result)


In [229]:
books.head()

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


# Columns author, year, publisher and images seem irrelevant

In [230]:
columns = ['Book-Author','Year-Of-Publication','Publisher', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L']
books = books.drop(columns, axis = 1)
books.head()

Unnamed: 0,ISBN,Book-Title
0,195153448,Classical Mythology
1,2005018,Clara Callan
2,60973129,Decision in Normandy
3,374157065,Flu: The Story of the Great Influenza Pandemic...
4,393045218,The Mummies of Urumchi


In [231]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [232]:
ratings = pd.merge(ratings, books)
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title
0,276725,034545104X,0,Flesh Tones: A Novel
1,2313,034545104X,5,Flesh Tones: A Novel
2,6543,034545104X,0,Flesh Tones: A Novel
3,8680,034545104X,5,Flesh Tones: A Novel
4,10314,034545104X,9,Flesh Tones: A Novel


In [233]:
ratings.shape

(1031136, 4)

In [234]:
ratings_count = ratings.groupby(by = ['Book-Title'])['Book-Rating'].count().reset_index()
ratings_count = ratings_count.rename(columns = {'Book-Rating': 'Ratings Count'})
ratings_count.head()

Unnamed: 0,Book-Title,Ratings Count
0,A Light in the Storm: The Civil War Diary of ...,4
1,Always Have Popsicles,1
2,Apple Magic (The Collector's series),1
3,"Ask Lily (Young Women of Faith: Lily Series, ...",1
4,Beyond IBM: Leadership Marketing and Finance ...,1


In [235]:
ratings_count[ratings_count['Ratings Count'] >= 50].shape

(2444, 2)

# To make the result more significant, remove unpopular books from the data set. We will only consider books with 50 or more ratings. There are 2444 books in consideration.

In [236]:
ratings = pd.merge(ratings, ratings_count)

In [237]:
ratings.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Ratings Count
0,276725,034545104X,0,Flesh Tones: A Novel,60
1,2313,034545104X,5,Flesh Tones: A Novel,60
2,6543,034545104X,0,Flesh Tones: A Novel,60
3,8680,034545104X,5,Flesh Tones: A Novel,60
4,10314,034545104X,9,Flesh Tones: A Novel,60


In [238]:
ratings_pop = ratings[ratings['Ratings Count'] >= 50]
ratings_pop.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Ratings Count
0,276725,034545104X,0,Flesh Tones: A Novel,60
1,2313,034545104X,5,Flesh Tones: A Novel,60
2,6543,034545104X,0,Flesh Tones: A Novel,60
3,8680,034545104X,5,Flesh Tones: A Novel,60
4,10314,034545104X,9,Flesh Tones: A Novel,60


In [239]:
readers.head()

Unnamed: 0,User-ID,Location,Age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


In [240]:
readers = readers[readers['Age'] > 0].drop(columns = 'Location')

In [241]:
data = pd.merge(ratings_pop, readers)

In [242]:
data.head()

Unnamed: 0,User-ID,ISBN,Book-Rating,Book-Title,Ratings Count,Age
0,2313,034545104X,5,Flesh Tones: A Novel,60,23.0
1,2313,0812533550,9,Ender's Game (Ender Wiggins Saga (Paperback)),249,23.0
2,2313,0679745580,8,In Cold Blood (Vintage International),55,23.0
3,2313,0399146431,5,The Bonesetter's Daughter,384,23.0
4,2313,0060173289,9,Divine Secrets of the Ya-Ya Sisterhood : A Novel,130,23.0


In [243]:
data_pivot = data.pivot_table(index = 'User-ID', columns = 'Book-Title', values = 'Book-Rating').fillna(0)

In [244]:
data_pivot.head()

Book-Title,10 Lb. Penalty,16 Lighthouse Road,1984,1st to Die: A Novel,2010: Odyssey Two,204 Rosewood Lane,2061: Odyssey Three,24 Hours,2nd Chance,3rd Degree,...,YOU BELONG TO ME,Year of Wonders,You Belong To Me,You Shall Know Our Velocity,Young Wives,Zen and the Art of Motorcycle Maintenance: An Inquiry into Values,Zoya,"\O\"" Is for Outlaw""","\Surely You're Joking, Mr. Feynman!\"": Adventures of a Curious Character""",stardust
User-ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
51,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [245]:
X = data_pivot.T

In [246]:
from sklearn.decomposition import TruncatedSVD

SVD = TruncatedSVD(n_components = 10, random_state = 0).fit_transform(X)

In [285]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 50, random_state = 0).fit(SVD)

In [286]:
BookList = list(data_pivot.columns)

In [319]:
for cluster in range(5):
    print("\nCluster #{}".format(cluster))    
    mov = []
    for bookID in np.where(kmeans.labels_ == cluster)[0]:        
        mov.append(BookList[bookID])
    for i in range (3):
        print (mov[i])


Cluster #0
Divine Secrets of the Ya-Ya Sisterhood: A Novel
Girl with a Pearl Earring
House of Sand and Fog

Cluster #1
2061: Odyssey Three
A Second Chicken Soup for the Woman's Soul (Chicken Soup for the Soul Series)
A Spell for Chameleon (Xanth Novels (Paperback))

Cluster #2
American Gods
Angela's Ashes: A Memoir
Animal Farm

Cluster #3
Harry Potter and the Chamber of Secrets (Book 2)
Harry Potter and the Goblet of Fire (Book 4)
Harry Potter and the Prisoner of Azkaban (Book 3)

Cluster #4
Carrie
Desperation
Dolores Claiborne


# Books that are in the same cluster should be recommended with one another.

In [373]:
BookList.index('Carrie')

340

In [374]:
corr = np.corrcoef(SVD)
corrbook = corr[340]
bookLs = data_pivot.columns
list(bookLs[(corrbook > 0.97) & (corrbook < 1.0)])

['BAG OF BONES : A NOVEL',
 'Bag of Bones',
 'Cold Fire',
 'Desperation',
 'Different Seasons',
 'Dolores Claiborne',
 "Everything's Eventual : 14 Dark Tales",
 'Eyes of the Dragon',
 'Four Past Midnight',
 "Gerald's Game",
 'Hearts In Atlantis',
 'Insomnia',
 'Misery',
 'Needful Things: The Last Castle Rock Story',
 'Nightmares &amp; Dreamscapes',
 'Rose Madder',
 'Skeleton Crew',
 'Strangers',
 'The Bachman Books: Rage, the Long Walk, Roadwork, the Running Man',
 'The Dark Half',
 'The Dead Zone',
 'The Green Mile: The Complete Serial Novel',
 'The Regulators',
 'The Tommyknockers']

# The above books have high correlation with the book "Carrie". There is some overlap with the results from K-Means method as seen in Cluster 4.