# Content-Based Filtering model  
## Approach 1: Recommendation through Description of the Content
In this approach based on the **description** of the item, the user is suggested an item. The description goes deeper into the product details, i.e title, summary, taglines, genre. It provides much more information about the item. The format of these details are in text format(string) and is important to convert 

**Term Frequency-Inverse Document Frequency(TF-IDF)**
TF-IDF is used in Information Retrieval for feature extraction purposes and it is a sub-area of Natural Language Processing(NLP).

Term Frequency- Frequency of the word in the current document to the total number of words in the document. It signifies the occurrence of the word in a document and gives higher weight when the frequency is more so it is divided by document length to normalize.

![TF](https://cdn-images-1.medium.com/max/800/1*5s81Q5RYPUSxDqPr0gG4qw.png)

Inverse Document Frequency- Total Number of Documents to the frequency occurrence of documents containing the word. It signifies the rarity of the word as the word occurring the document is less the IDF increases. It helps in giving a higher score to rare terms in the documents.

![IDF](https://cdn-images-1.medium.com/max/800/1*BwOL05kcXjty9ctYUvntbQ.png)

TF-IDF
In the End, TF-IDF is a measure used to evaluate how important a word is to a document in a document corpus. The importance of the word increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

![TFIDF](https://cdn-images-1.medium.com/max/800/1*XP-YYe-mDQWVEDJWQG0-rg.png)

In [1]:
import numpy as np
import pandas as pd
import sklearn
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel
from Data_cleaning import *

In [2]:
#import books, users, ratings clean data
books, users, ratings = get_clean_data()

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  if self.run_code(code, result):
A value is trying to be set on a copy of

In [3]:
books.head()

Unnamed: 0,isbn,title,author,pub_year,publisher,url_s,url_m,url_l
0,195153448,Classical Mythology,Mark P. O. Morford,2002.0,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991.0,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999.0,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999.0,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


In [4]:
ratings.head()

Unnamed: 0,user,isbn,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


In [5]:
#number of ratings received by each book
usersPerIsbn = ratings['isbn'].value_counts()
usersPerIsbn

0971880107    2502
0316666343    1295
0385504209     883
0060928336     732
0312195516     723
044023722X     647
0679781587     639
0142001740     615
067976402X     614
0671027360     586
0446672211     585
059035342X     571
0316601950     568
0375727345     552
044021145X     529
0452282152     526
0440214041     523
0804106304     519
0440211727     517
0345337662     506
0060930535     494
0440226430     482
0312278586     474
0743418174     470
0671021001     468
0345370775     466
0446605239     465
0156027321     462
0440241073     456
0671003755     446
              ... 
8775600382       1
086051286X       1
3442433061       1
0374520011       1
8481304123       1
037319062X       1
8420428701       1
0552125075       1
0945466242       1
0441009891       1
0733534104       1
0142004073       1
0871312212       1
0743212088       1
0385601689       1
345272552        1
0394731115       1
1570820341       1
0062503227       1
1569244405       1
1858600049       1
0964483807  

Filtering books with less than 10 ratings from users

In [6]:
books_10 = books[books['isbn'].isin(usersPerIsbn[usersPerIsbn>10].index)]

In [7]:
books_10.shape

(15452, 8)

There are only **15,452** books remaining.

Using `Book-Title` only for TFIDF

In [8]:
stopwords_list = stopwords.words('english')

In [9]:
vectorizer = TfidfVectorizer(analyzer='word')

In [10]:
#build book-title tfidf matrix
tfidf_matrix = vectorizer.fit_transform(books_10['title'])

In [11]:
tfidf_feature_name = vectorizer.get_feature_names()

In [12]:
tfidf_matrix.shape

(15452, 10972)

In [13]:
# comping cosine similarity matrix using linear_kernal of sklearn
cosine_similarity = linear_kernel(tfidf_matrix, tfidf_matrix)

In [14]:
books_10 = books_10.reset_index(drop=True)

In [15]:
indices = pd.Series(books_10['title'].index)

In [16]:
#Function to get the most similar books
def recommend(index, method):
    id = indices[index]
    # Get the pairwise similarity scores of all books compared that book,
    # sorting them and getting top 5
    similarity_scores = list(enumerate(method[id]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:6]
    
    #Get the books index
    books_index = [i[0] for i in similarity_scores]
    
    #Return the top 5 most similar books using integar-location based indexing (iloc)
    return books_10['title'].iloc[books_index]

In [17]:
#input the index of the book
recommend(1000, cosine_similarity)

11488                                          The Odyssey
14633                                          The Odyssey
4000                                           Killer Hair
9422                             The Odyssey (Classics S.)
4174     Longitude: The True Story of a Lone Genius Who...
Name: title, dtype: object

In [18]:
books_10.iloc[1000]

isbn                                                076790351X
title        Beethoven's Hair : An Extraordinary Historical...
author                                          RUSSELL MARTIN
pub_year                                                  2001
publisher                                             Broadway
url_s        http://images.amazon.com/images/P/076790351X.0...
url_m        http://images.amazon.com/images/P/076790351X.0...
url_l        http://images.amazon.com/images/P/076790351X.0...
Name: 1000, dtype: object

Using `Book-Title`, `Book-Author`, `Publisher` as content for TFIDF

In [19]:
books_10['all_content'] = books_10['title'] + books_10['author'] + books_10['publisher']

In [20]:
tfidf_all_content = vectorizer.fit_transform(books_10['all_content'])

In [21]:
tfidf_all_content.shape

(15452, 25806)

In [22]:
# comping cosine similarity matrix using linear_kernal of sklearn
cosine_similarity_all_content = linear_kernel(tfidf_all_content, tfidf_all_content)

In [23]:
recommend(33, cosine_similarity_all_content)

2576                                       The Tao of Pooh
5983                                      The Te of Piglet
14378                      Tao Te Ching (Penguin Classics)
13886    The Eye of the World : Book One of 'The Wheel ...
13869    The Return of the King (Lord of the Rings (Pap...
Name: title, dtype: object

### Book description
Since we want to have more detail for buildling tfidf matrix, we have further scrape the book description from online available api. Book detail is scraped based on the *15,452* books

In [24]:
books_n = pd.read_csv('books_n_description.csv')

In [25]:
books_wd = books_n[books_n['description'].notnull()].copy()

In [26]:
# only retain record with more than 5 characters description
books_wd = books_n[books_n['description'].notnull()].copy()
books_wd = books_wd[books_wd['description'].map(len) >5]

In [27]:
books_wd.reset_index(drop=True, inplace=True)

In [28]:
books_wd.drop(columns=['Unnamed: 0'], inplace =True)

In [29]:
books_wd.head()

Unnamed: 0,isbn,title,author,pub_year,publisher,categories,description
0,2005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,Actresses,"In a small town in Canada, Clara Callan reluct..."
1,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999.0,Farrar Straus Giroux,Medical,"Describes the great flu epidemic of 1918, an o..."
2,399135782,The Kitchen God's Wife,Amy Tan,1991.0,Putnam Pub Group,Fiction,A Chinese immigrant who is convinced she is dy...
3,440234743,The Testament,John Grisham,1999.0,Dell,Fiction,"A suicidal billionaire, a burnt-out Washington..."
4,452264464,Beloved (Plume Contemporary Fiction),Toni Morrison,1994.0,Plume,Fiction,Staring unflinchingly into the abyss of slaver...


Using book `description` for TFIDF

In [30]:
tfidf_des = vectorizer.fit_transform(books_wd['description'])

### Cosine Similarity

In [31]:
from sklearn.metrics.pairwise import linear_kernel

# comping cosine similarity matrix using linear_kernal of sklearn
cosine_sim_des = linear_kernel(tfidf_des, tfidf_des)

In [32]:
indices_n = pd.Series(books_wd['isbn'])

In [33]:
inddict = indices_n.to_dict()

In [34]:
#changing the selection of books from index to isbn
inddict = dict((v,k) for k,v in inddict.items())

In [48]:
def recommend_cosine(isbn):
    id = inddict[isbn]
    # Get the pairwise similarity scores of all books compared that book,
    # sorting them and getting top 5
    similarity_scores = list(enumerate(cosine_sim_des[id]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:6]
    
    #Get the books index
    books_index = [i[0] for i in similarity_scores]
    
    #Return the top 5 most similar books using integar-location based indexing (iloc)
    return books_wd.iloc[books_index]

In [49]:
recommend_cosine("067100669X")

Unnamed: 0,isbn,title,author,pub_year,publisher,categories,description
600,0440224624,The Loop,Nicholas Evans,1999.0,Dell Publishing Company,Fiction,"The author of the number-one best-seller, The ..."
1192,0385512147,Bringing Elizabeth Home : A Journey of Faith a...,ED SMART,2003.0,Doubleday,Religion,The parents of kidnapping victim Elizabeth Sma...
570,0345387007,Tangle Box,TERRY BROOKS,1995.0,Del Rey,Fiction,Relates a tale about the magical kingdom of La...
525,1573225789,The Color of Water: A Black Man's Tribute to H...,James McBride,1997.0,Riverhead Books,Biography &amp; Autobiography,An African American man describes life as the ...
66,044651747X,Puerto Vallarta Squeeze,Robert James Waller,1995.0,Warner Books,Fiction,The author of the blockbuster The Bridges of M...


---

### Euclidean Distance

```
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances

v = TfidfVectorizer()
X = v.fit_transform(your_documents)
D = euclidean_distances(X)
```

Now `D[i, j]` is the Euclidean distance between document vectors `X[i]` and `X[j]`.

In [38]:
from sklearn.metrics.pairwise import euclidean_distances

In [39]:
D = euclidean_distances(tfidf_des)

In [40]:
def recommend_euclidean_distance(isbn):
    ind = inddict[isbn]
    distance = list(enumerate(D[ind]))
    distance = sorted(distance, key=lambda x: x[1])
    distance = distance[1:6]
    #Get the books index
    books_index = [i[0] for i in distance]

    #Return the top 5 most similar books using integar-location based indexing (iloc)
    return books_wd.iloc[books_index]

In [41]:
recommend_euclidean_distance("067100669X")

Unnamed: 0,isbn,title,author,pub_year,publisher,categories,description
600,0440224624,The Loop,Nicholas Evans,1999.0,Dell Publishing Company,Fiction,"The author of the number-one best-seller, The ..."
1192,0385512147,Bringing Elizabeth Home : A Journey of Faith a...,ED SMART,2003.0,Doubleday,Religion,The parents of kidnapping victim Elizabeth Sma...
570,0345387007,Tangle Box,TERRY BROOKS,1995.0,Del Rey,Fiction,Relates a tale about the magical kingdom of La...
525,1573225789,The Color of Water: A Black Man's Tribute to H...,James McBride,1997.0,Riverhead Books,Biography &amp; Autobiography,An African American man describes life as the ...
66,044651747X,Puerto Vallarta Squeeze,Robert James Waller,1995.0,Warner Books,Fiction,The author of the blockbuster The Bridges of M...


---

### Pearson's Correlation

In [42]:
from scipy.stats import pearsonr
tfidf_des_array = tfidf_des.toarray()

In [43]:
def recommend_pearson(isbn):
    ind = inddict[isbn]
    correlation = []
    for i in range(len(tfidf_des_array)):
        correlation.append(pearsonr(tfidf_des_array[ind], tfidf_des_array[i])[0])
    correlation = list(enumerate(correlation))
    sorted_corr = sorted(correlation, reverse=True, key=lambda x: x[1])[1:6]
    books_index = [i[0] for i in sorted_corr]
    return books_wd.iloc[books_index]

In [46]:
recommend_pearson('067100669X')

Unnamed: 0,isbn,title,author,pub_year,publisher,categories,description
600,0440224624,The Loop,Nicholas Evans,1999.0,Dell Publishing Company,Fiction,"The author of the number-one best-seller, The ..."
1192,0385512147,Bringing Elizabeth Home : A Journey of Faith a...,ED SMART,2003.0,Doubleday,Religion,The parents of kidnapping victim Elizabeth Sma...
570,0345387007,Tangle Box,TERRY BROOKS,1995.0,Del Rey,Fiction,Relates a tale about the magical kingdom of La...
525,1573225789,The Color of Water: A Black Man's Tribute to H...,James McBride,1997.0,Riverhead Books,Biography &amp; Autobiography,An African American man describes life as the ...
66,044651747X,Puerto Vallarta Squeeze,Robert James Waller,1995.0,Warner Books,Fiction,The author of the blockbuster The Bridges of M...


---

### Comparison of 3 recommenders

In [47]:
#Target book
books_wd.loc[books_wd['isbn'] == "067100669X"]

Unnamed: 0,isbn,title,author,pub_year,publisher,categories,description
1163,067100669X,Tears of Rage,John Walsh,1998.0,Pocket,True Crime,The author relates the story of his son's abdu...


In [50]:
recommend_cosine("067100669X")

Unnamed: 0,isbn,title,author,pub_year,publisher,categories,description
600,0440224624,The Loop,Nicholas Evans,1999.0,Dell Publishing Company,Fiction,"The author of the number-one best-seller, The ..."
1192,0385512147,Bringing Elizabeth Home : A Journey of Faith a...,ED SMART,2003.0,Doubleday,Religion,The parents of kidnapping victim Elizabeth Sma...
570,0345387007,Tangle Box,TERRY BROOKS,1995.0,Del Rey,Fiction,Relates a tale about the magical kingdom of La...
525,1573225789,The Color of Water: A Black Man's Tribute to H...,James McBride,1997.0,Riverhead Books,Biography &amp; Autobiography,An African American man describes life as the ...
66,044651747X,Puerto Vallarta Squeeze,Robert James Waller,1995.0,Warner Books,Fiction,The author of the blockbuster The Bridges of M...


In [51]:
recommend_euclidean_distance("067100669X")

Unnamed: 0,isbn,title,author,pub_year,publisher,categories,description
600,0440224624,The Loop,Nicholas Evans,1999.0,Dell Publishing Company,Fiction,"The author of the number-one best-seller, The ..."
1192,0385512147,Bringing Elizabeth Home : A Journey of Faith a...,ED SMART,2003.0,Doubleday,Religion,The parents of kidnapping victim Elizabeth Sma...
570,0345387007,Tangle Box,TERRY BROOKS,1995.0,Del Rey,Fiction,Relates a tale about the magical kingdom of La...
525,1573225789,The Color of Water: A Black Man's Tribute to H...,James McBride,1997.0,Riverhead Books,Biography &amp; Autobiography,An African American man describes life as the ...
66,044651747X,Puerto Vallarta Squeeze,Robert James Waller,1995.0,Warner Books,Fiction,The author of the blockbuster The Bridges of M...


In [52]:
recommend_pearson('067100669X')

Unnamed: 0,isbn,title,author,pub_year,publisher,categories,description
600,0440224624,The Loop,Nicholas Evans,1999.0,Dell Publishing Company,Fiction,"The author of the number-one best-seller, The ..."
1192,0385512147,Bringing Elizabeth Home : A Journey of Faith a...,ED SMART,2003.0,Doubleday,Religion,The parents of kidnapping victim Elizabeth Sma...
570,0345387007,Tangle Box,TERRY BROOKS,1995.0,Del Rey,Fiction,Relates a tale about the magical kingdom of La...
525,1573225789,The Color of Water: A Black Man's Tribute to H...,James McBride,1997.0,Riverhead Books,Biography &amp; Autobiography,An African American man describes life as the ...
66,044651747X,Puerto Vallarta Squeeze,Robert James Waller,1995.0,Warner Books,Fiction,The author of the blockbuster The Bridges of M...


### Pros:

* Unlike Collaborative Filtering, if the items have sufficient descriptions, we avoid the “new item problem”.
* Content representations are varied and they open up the options to use different approaches like: text processing techniques, the use of semantic information, inferences, etc…
* It is easy to make a more transparent system: we use the same content to explain the recommendations.


### Cons:
* Content-Based RecSys tend to over-specialization: they will recommend items similar to those already consumed, with a tendecy of creating a “filter bubble”.