# COLLABORATIVE  FILTERING - Finding Similar Books and Movies

We'll start by loading up the Goodreads dataset. Using Pandas, we can very quickly load the rows of the rating and item files that we care about, and merge them together so we can work with book names instead of ID's. (In a real production job, you'd stick with ID's and worry about the names at the display layer to make things more efficient. But this lets us understand what's going on better for now.)

In [1]:
import pandas as pd
import numpy as np

import warnings
warnings.simplefilter('ignore')

###  Load the data set of Book Ratings

In [2]:
pathToRatings = 'https://raw.githubusercontent.com/diksha-cl/Data-files/master/bookratings.csv'
ratings = pd.read_csv(pathToRatings)
ratings.head(n=5)

Unnamed: 0,userId,itemId,rating
0,22,264,2
1,1138,264,5
2,1160,264,3
3,1217,264,3
4,1572,264,3


In [3]:
print("Number of ratings:", ratings.shape[0])
print("Unique users:", ratings['userId'].unique().size)
print("Unique books:", ratings['itemId'].unique().size)

Number of ratings: 212395
Unique users: 3000
Unique books: 1891


###  Load the item/book details.

In [4]:
pathToDetails = 'https://raw.githubusercontent.com/diksha-cl/Data-files/master/bookInfo.csv'
items=pd.read_csv(pathToDetails)

In [5]:
items.sample(n=5)

Unnamed: 0,itemId,title,details
1420,2019,"The Maze of Bones (The 39 Clues, #1)",Rick Riordan
1070,1321,"One Piece, Volume 01: Romance Dawn (One Piece,...","Eiichirō Oda, Andy Nakatani"
737,616,Plain Truth,Jodi Picoult
822,1843,Prep,Curtis Sittenfeld
685,1325,"The Likeness (Dublin Murder Squad, #2)",Tana French


# Build the Pivot Table of ratings

In [6]:
pivotTable = ratings.pivot_table(index=['userId'],columns=['itemId'],values='rating')
pivotTable.shape

(3000, 1891)

In [7]:
pivotTable.sample(n=5)

itemId,1,2,3,4,5,6,7,8,10,11,...,2998,3105,3132,3150,3231,3345,3384,3422,3436,7373
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
33851,,,,,,,,,,,...,,,,,,,,,,
41320,,,,4.0,,,,3.0,,5.0,...,,,,,,,,,,
50660,,5.0,,,,,4.0,,,,...,,,,,,,,,,
10823,,,,,,,5.0,,,,...,,,,,,,,,,5.0
34476,,,,,5.0,3.0,,,,,...,,,,,,,,,,


##  Find out the correlation matrix of all books with each other

In [8]:
corrTable = pivotTable.corr(min_periods=250)

In [9]:
#View the corrtable
corrTable.head()

itemId,1,2,3,4,5,6,7,8,10,11,...,2998,3105,3132,3150,3231,3345,3384,3422,3436,7373
itemId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.234702,0.388496,0.031843,-0.015047,0.18635,-0.008045,0.066032,0.090484,0.245843,...,,,,,,,,,,
2,0.234702,1.0,0.135253,0.119011,0.068083,0.150993,0.278121,0.01233,0.251825,0.165599,...,,,,,,,,,,
3,0.388496,0.135253,1.0,0.008118,-0.090979,0.170944,0.02874,0.116441,0.174361,0.278934,...,,,,,,,,,,
4,0.031843,0.119011,0.008118,1.0,0.333149,,0.123229,0.273784,0.296222,0.172932,...,,,,,,,,,,
5,-0.015047,0.068083,-0.090979,0.333149,1.0,,0.164837,0.397155,0.15106,0.048222,...,,,,,,,,,,


### For a given item, find other items whose ratings are highly correlated. 

In [10]:
def itemsFromIDs(items, IDlist):
    df = pd.DataFrame(columns=items.columns)
    for id in IDlist:
        item = items[items.itemId == id] 
        df = pd.concat([df, item], axis=0)
    
    df.reset_index(inplace=True, drop=True)
    return df

In [11]:
itemsFromIDs(items, [1,2])

Unnamed: 0,itemId,title,details
0,1,The Hunger Games,Suzanne Collins
1,2,Harry Potter and the Philosopher's Stone,"J.K. Rowling, Mary GrandPré"


In [13]:
def relatedRecos(itemName):
    ItemID = items[items.title == itemName]["itemId"].iloc[0]
    my_corr=corrTable.loc[ItemID]

    top10 = my_corr.dropna().sort_values(ascending=False)[:10]
    top10itemIDs = list(top10.index)

    top10Items = itemsFromIDs(items, top10itemIDs)
    
    return top10Items

In [14]:
itemName = 'Harry Potter and the Deathly Hallows' 

top10Recos = relatedRecos(itemName)
top10Recos

Unnamed: 0,itemId,title,details
0,25,Harry Potter and the Deathly Hallows,"J.K. Rowling, Mary GrandPré"
1,27,Harry Potter and the Half-Blood Prince (Harry ...,"J.K. Rowling, Mary GrandPré"
2,24,Harry Potter and the Goblet of Fire,"J.K. Rowling, Mary GrandPré"
3,21,Harry Potter and the Order of the Phoenix,"J.K. Rowling, Mary GrandPré"
4,18,Harry Potter and the Prisoner of Azkaban,"J.K. Rowling, Mary GrandPré, Rufus Beck"
5,23,Harry Potter and the Chamber of Secrets,"J.K. Rowling, Mary GrandPré"
6,2,Harry Potter and the Philosopher's Stone,"J.K. Rowling, Mary GrandPré"
7,10,Pride and Prejudice,Jane Austen
8,17,"Catching Fire (The Hunger Games, #2)",Suzanne Collins
9,26,"The Da Vinci Code (Robert Langdon, #2)",Dan Brown


In [15]:
itemName = 'Of Mice and Men'

top10Recos = relatedRecos(itemName)
top10Recos

Unnamed: 0,itemId,title,details
0,32,Of Mice and Men,John Steinbeck
1,14,Animal Farm,George Orwell
2,58,The Adventures of Huckleberry Finn,"Mark Twain, John Seelye, Guy Cardwell"
3,15,The Diary of a Young Girl,"Anne Frank, Eleanor Roosevelt, B.M. Mooyaart-D..."
4,29,Romeo and Juliet,"William Shakespeare, Robert Jackson"
5,28,Lord of the Flies,William Golding
6,4,To Kill a Mockingbird,Harper Lee
7,8,The Catcher in the Rye,J.D. Salinger
8,7,The Hobbit,J.R.R. Tolkien
9,5,The Great Gatsby,F. Scott Fitzgerald


In [16]:
itemName = 'The Kite Runner'

top10Recos = relatedRecos(itemName)
top10Recos

Unnamed: 0,itemId,title,details
0,11,The Kite Runner,Khaled Hosseini
1,67,A Thousand Splendid Suns,Khaled Hosseini
2,31,The Help,Kathryn Stockett
3,33,Memoirs of a Geisha,Arthur Golden
4,57,The Secret Life of Bees,Sue Monk Kidd
5,3,"Twilight (Twilight, #1)",Stephenie Meyer
6,46,Water for Elephants,Sara Gruen
7,1,The Hunger Games,Suzanne Collins
8,15,The Diary of a Young Girl,"Anne Frank, Eleanor Roosevelt, B.M. Mooyaart-D..."
9,26,"The Da Vinci Code (Robert Langdon, #2)",Dan Brown


In [17]:
itemName = 'The Hunger Games'

top10Recos = relatedRecos(itemName)
top10Recos

Unnamed: 0,itemId,title,details
0,1,The Hunger Games,Suzanne Collins
1,17,"Catching Fire (The Hunger Games, #2)",Suzanne Collins
2,20,"Mockingjay (The Hunger Games, #3)",Suzanne Collins
3,3,"Twilight (Twilight, #1)",Stephenie Meyer
4,12,"Divergent (Divergent, #1)",Veronica Roth
5,73,"The Host (The Host, #1)",Stephenie Meyer
6,64,My Sister's Keeper,Jodi Picoult
7,52,"Eclipse (Twilight, #3)",Stephenie Meyer
8,69,"Insurgent (Divergent, #2)",Veronica Roth
9,53,"Eragon (The Inheritance Cycle, #1)",Christopher Paolini


### Try to find recommendations based of your favorite books

In [18]:
def searchForItems(items, searchStr):
    df = items[items['title'].str.contains(searchStr, case=False)]
    return list(df['title'])

In [19]:
searchForItems(items, "Lost")

['Wild: From Lost to Found on the Pacific Crest Trail',
 'The Lost Symbol (Robert Langdon, #3)',
 'Paradise Lost',
 'The Lost World (Jurassic Park, #2)',
 'The Lost Colony (Artemis Fowl, #5)',
 'The Lost Hero (The Heroes of Olympus, #1)',
 'City of Lost Souls (The Mortal Instruments, #5)']

#### Netflix Recommendation Engine (Much much more sophisticated)

1. Explicit behavior (like ratings given)
2. Looks at implicit user behavior (in how many sittings did you finish watching the movie, what time did you watch the movie)
3. Tagging of movie content (crime thriller, adventure, female leads etc etc). All of this is generated by human taggers.


#### Amazon recommendation engine.

1. Purchased shopping carts = real money from real people spent on real items = powerful data and a lot of it.
2. Items added to carts but abandoned.
3. Pricing experiments online (A/B testing, etc.) where they offer the same products at different prices and see the results
4. Wishlists - what's on them specifically for you - and in aggregate it can be treated similarly to another stream of basket analysis data
5. Referral sites (identification of where you came in from can hint other items of interest)
6. Dwell times (how long before you click back and pick a different item)
7. Ratings by you or those in your social network/buying circles - if you rate things you like you get more of what you like and if you confirm with the "i already own it" button they create a very complete profile of you
8. Demographic information (your shipping address, etc.) - they know what is popular in your general area for your kids, yourself, your spouse, etc.
9. user segmentation = did you buy 3 books in separate months for a toddler? likely have a kid or more.. etc.
10. Direct marketing click through data - did you get an email from them and click through? They know which email it was and what you clicked through on and whether you bought it as a result.
11. Click paths in session - what did you view regardless of whether it went in your cart
12. Number of times viewed an item before final purchase

#### Market Basket Analysis
See what is their in your basket and provide recommendations based on what people with such a basket might additionally buy.

1. For example, if basket has milk, then possibly egg and bread and cereal.
2. If basket has beer, then maybe peanuts etc/

### Try to find recommendations based on your favorite movies