# Content-Based Filtering model  
Content-based filtering approaches leverage **description or attributes** from items the user has interacted to recommend similar items. It depends only on the user **previous choices**, making this method robust to **avoid the cold-start problem**. For textual items, like articles, news and books, it is simple to use the article **category** or **raw text** to build **item profiles** and **user profiles**.

Suppose I watch a particular genre movie I will be recommended movies w.r.t that specific genre. The Title, Year of Release, Director, Cast are also helpful in identifying similar movie content.

## Approach 2: Using Rated Content to Recommend
In this approach contents of the product are already **rated** and based on the **user’s preference**. An **item score** is predicted to products and recommendation can be made.

Usually `rating table` (user rating), `item profile` (book genres) are the only material we've got.

* `rating table`: user-to-book relationship
* `item profile`: attribute-to-book relationship  
![](https://www.analyticsvidhya.com/wp-content/uploads/2015/08/81.png)

Then we will create the `user profile` so that we can understand what attribute the users actually prefer.
* `user profile`: user-to-attribute relationship  
![](https://www.analyticsvidhya.com/wp-content/uploads/2015/08/91.png)

Thus, with the `user profile`, we can get all the item score which is the user preference from `user profile` and `item profile`.  
![](https://www.analyticsvidhya.com/wp-content/uploads/2015/08/121.png)

Let's go through the code!

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.preprocessing import MinMaxScaler
from merge_data import *

b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  exec(code_obj, self.user_global_ns, self.user_ns)
A value is trying to b

In [2]:
#import book csv with description, and more than 10 user ratings
df_n_des = pd.read_csv('books_n_description.csv', index_col=0)

In [3]:
df_n_des.head()

Unnamed: 0,isbn,title,author,pub_year,publisher,categories,description
0,2005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,Actresses,"In a small town in Canada, Clara Callan reluct..."
1,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999.0,Farrar Straus Giroux,Medical,"Describes the great flu epidemic of 1918, an o..."
2,399135782,The Kitchen God's Wife,Amy Tan,1991.0,Putnam Pub Group,Fiction,A Chinese immigrant who is convinced she is dy...
3,440234743,The Testament,John Grisham,1999.0,Dell,Fiction,"A suicidal billionaire, a burnt-out Washington..."
4,452264464,Beloved (Plume Contemporary Fiction),Toni Morrison,1994.0,Plume,Fiction,Staring unflinchingly into the abyss of slaver...


In [4]:
df_n_des.shape

(15452, 7)

In [5]:
#make a copy to books_wd which categories cell is not null
books_wd = df_n_des[df_n_des['categories'].notnull()].copy()

# filter out books with less than 5 characters in categories
books_wd = books_wd[books_wd['categories'].map(len) >1]

In [6]:
books_wd.shape

(1466, 7)

We have **1,466** books in total

### Item profile

In [7]:
df_item = books_wd[['isbn','categories']]

In [8]:
df_item.head()

Unnamed: 0,isbn,categories
0,2005018,Actresses
1,374157065,Medical
2,399135782,Fiction
3,440234743,Fiction
4,452264464,Fiction


In [9]:
# one-hot encoding for category
df_genre = pd.get_dummies(df_item['categories'])

let’s treat all articles as having unit weight.  
For binary representation, we can perform normalization by dividing the term occurrence by the sqrt of number of attributes in the article.  

In [10]:
#normalized
df_genre_normalized = df_genre.apply(lambda x: x/np.sqrt(df_genre.sum(axis=1)))

In [11]:
#create item profile
df_item = pd.concat([df_item, df_genre_normalized], axis=1)

df_item.drop(columns='categories', inplace=True)

In [12]:
df_item.sort_values('isbn', inplace=True)

In [13]:
df_item.set_index('isbn', inplace=True)

In [14]:
df_item.head()

Unnamed: 0_level_0,Accidents,Action and adventure,Actors,Actresses,Adoptees,Adventure stories,Affirmations,African American fiction,African American men,African American psychologists,...,"Ryan, Jack (Fictitious character)","Savich, Dillon (Fictitious character)",Science fiction,Self-Help,Social Science,Star Trek fiction,Travel,Travelers' writings,True Crime,Yorkshire (England)
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002005018,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000648302X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000649840X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0020264763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0020264801,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
df_item.shape

(1466, 158)

In `df_item`, there are **1,466** books and **158** genres.

Let's load the merged dataframe of `books`, `ratings`, `users` and get the record which matched the 1,466 books in `df_item`.

In [16]:
df_merged = merge_data_frame()

In [17]:
df_merged.head()

Unnamed: 0,user,isbn,rating,location,age,country,province,title,author,pub_year,publisher,url_s,url_m,url_l
0,276725,034545104X,0,"tyler, texas, usa",,usa,texas,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
1,2313,034545104X,5,"cincinnati, ohio, usa",23.0,usa,ohio,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
2,6543,034545104X,0,"strafford, missouri, usa",34.0,usa,missouri,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
3,8680,034545104X,5,"st. charles county, missouri, usa",2.0,usa,missouri,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
4,10314,034545104X,9,"beaverton, oregon, usa",,usa,oregon,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...


In [18]:
#top users number and their number of ratings given
df_merged.user.value_counts().sort_values(ascending=False)[:10]

11676     11144
198711     6456
153662     5814
98391      5779
35859      5646
212898     4289
278418     3996
76352      3329
110973     2971
235105     2943
Name: user, dtype: int64

In [19]:
df_final = df_merged[df_merged['isbn'].isin(df_item.index)]

In [36]:
#create isbn and category dict
isbn_cat_dict = pd.Series(books_wd['categories'].values,index=books_wd['isbn']).to_dict()

#create isbn and description dict
isbn_des_dict = pd.Series(books_wd['description'].values,index=books_wd['isbn']).to_dict()

In [39]:
df_final['category'] = df_final['isbn'].map(isbn_cat_dict)
df_final['description'] = df_final['isbn'].map(isbn_des_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [40]:
df_final.shape

(72821, 16)

In [42]:
df_final = df_final[['user','rating','isbn','title','author','publisher','pub_year','category','description']]

In [78]:
df_final.head()

Unnamed: 0,user,rating,isbn,title,author,publisher,pub_year,category,description
501,276746,0,449006522,Manhattan Hunt Club,JOHN SAUL,Ballantine Books,2002.0,Fiction,When college student Jeff Converse is wrongly ...
502,278026,8,449006522,Manhattan Hunt Club,JOHN SAUL,Ballantine Books,2002.0,Fiction,When college student Jeff Converse is wrongly ...
503,243,6,449006522,Manhattan Hunt Club,JOHN SAUL,Ballantine Books,2002.0,Fiction,When college student Jeff Converse is wrongly ...
504,645,0,449006522,Manhattan Hunt Club,JOHN SAUL,Ballantine Books,2002.0,Fiction,When college student Jeff Converse is wrongly ...
505,2010,0,449006522,Manhattan Hunt Club,JOHN SAUL,Ballantine Books,2002.0,Fiction,When college student Jeff Converse is wrongly ...


There are **72,821** ratings of the **1,466** books.

### Rating dataframe

In [44]:
rating = pd.pivot_table(df_final, values='rating', index=['isbn'], columns = ['user'])

In [45]:
rating.sort_index(axis=1, inplace=True)

In [46]:
rating.head()

user,10,100004,100009,10001,100029,100035,10005,100053,100066,100067,...,99885,99894,999,9991,99946,99955,99963,99973,99996,99997
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002005018,,,,,,,,,,,...,,,,,,,,,,
000648302X,,,,,,,,,,,...,,,,,,,,,,
000649840X,,,,,,,,,,,...,,,,,,,,,,
0020264763,,,,,,,,,,,...,,,,,,,,,,
0020264801,,,,,,,,,,,...,,,,,,,,,,


### User Profile

In [47]:
# get all the user number
users_no = rating.columns

In [48]:
#create an empty dataframe
df_users = pd.DataFrame(columns = df_item.columns)

In [49]:
df_users

Unnamed: 0,Accidents,Action and adventure,Actors,Actresses,Adoptees,Adventure stories,Affirmations,African American fiction,African American men,African American psychologists,...,"Ryan, Jack (Fictitious character)","Savich, Dillon (Fictitious character)",Science fiction,Self-Help,Social Science,Star Trek fiction,Travel,Travelers' writings,True Crime,Yorkshire (England)


In [50]:
for i in tqdm(range(len(users_no))):
    working_df = df_item.mul(rating.iloc[:,i], axis=0)
    working_df.replace(0, np.NaN, inplace=True)    
    df_users.loc[users_no[i]] = working_df.mean(axis=0)

100%|██████████| 22950/22950 [08:05<00:00, 47.30it/s]


In [54]:
#user profile
df_users.head()

Unnamed: 0,Accidents,Action and adventure,Actors,Actresses,Adoptees,Adventure stories,Affirmations,African American fiction,African American men,African American psychologists,...,"Ryan, Jack (Fictitious character)","Savich, Dillon (Fictitious character)",Science fiction,Self-Help,Social Science,Star Trek fiction,Travel,Travelers' writings,True Crime,Yorkshire (England)
10,,,,,,,,,,,...,,,,,,,,,,
100004,,,,,,,,,,,...,,,,,,,,,,
100009,,,,,,,,,,,...,,,,,,,,,,
10001,,,,,,,,,,,...,,,,,,,,,,
100029,,,,,,,,,,,...,,,,,,,,,,


### IDF
Let’s consider how common different terms are among our documents.  
The dot product of article vectors and IDF vectors gives us the **weighted scores** of each article.

In [103]:
document_frequency = df_item.sum()

In [104]:
idf = (len(books_wd)/document_frequency).apply(np.log) #log inverse of DF

In [108]:
#The dot product of article vectors and IDF vectors gives us the weighted scores of each article.
idf_df_item = df_item.mul(idf.values)

### Make Prediction

In [None]:
#make an empty dataframe
df_predict = pd.DataFrame()

In [110]:
#user predict by tfidf
for i in tqdm(range(len(users_no))):
    working_df = idf_df_item.mul(df_users.iloc[i], axis=1)
    df_predict[users_no[i]] = working_df.sum(axis=1)

100%|██████████| 22950/22950 [04:18<00:00, 88.80it/s] 


In [111]:
#all user score predict of all books
df_predict.head()

Unnamed: 0_level_0,10,100004,100009,10001,100029,100035,10005,100053,100066,100067,...,99885,99894,999,9991,99946,99955,99963,99973,99996,99997
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002005018,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000648302X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000649840X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0020264763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0020264801,0.0,0.0,2.777364,3.471705,0.0,2.777364,0.0,3.124534,2.198746,2.430193,...,0.0,0.0,2.430193,0.0,1.909438,0.0,0.0,2.083023,3.008811,2.893087


In [129]:
def recommender(user_no):
    
    #get all book isbn
    isbn_no = df_predict.index

    #user predicted rating to all books
    user_predicted_rating = df_predict['33570']

    #combine book rating and book detail
    user_rating_book = pd.concat([user_predicted_rating,books_wd.set_index('isbn')], axis=1)

    #books already read by user
    already_read = df_final[df_final['user'].isin(['33570'])]['isbn']

    #recommendation without books being read by user
    all_rec = user_rating_book[~user_rating_book.index.isin(already_read)]

    return all_rec.sort_values(by=[user_no], ascending=False).iloc[0:10]

In [130]:
recommender('33570')

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,33570,title,author,pub_year,publisher,categories,description
0449005569,2.430193,Love: A User's Guide,CLARE NAYLOR,1999.0,Ballantine Books,Fiction,When she catches the eye of London's handsomes...
0449005410,2.430193,Horse Heaven (Ballantine Reader's Circle),Jane Smiley,2001.0,Ballantine Books,Fiction,A novel set in the world of thoroughbred racin...
0449130703,2.430193,The Number of the Beast,Robert Heinlein,1989.0,Fawcett Books,Fiction,The unusual adventures of four geniuses who ar...
0449130509,2.430193,Winterbourne,Susan Carroll,1987.0,Fawcett,Fiction,Beloved author Susan Carroll took the romance ...
0449006689,2.430193,"Murder in Havana (Truman, Margaret, Capital Cr...",Margaret Truman,2002.0,Fawcett Books,Fiction,Asked to investigate an American pharmaceutica...
0449006344,2.430193,Angel Falls,KRISTIN HANNAH,2001.0,Ballantine Books,Fiction,Liam will do anything to break his wife out of...
080411787X,2.430193,Acts of Love,JUDITH MICHAEL,1997.0,Ivy Books,Fiction,A collection of letters written by a young act...
0804117934,2.430193,The Silent Cry (William Monk Novels (Paperback)),Anne Perry,1998.0,Ivy Books,Fiction,Victorian-era criminal investigator William Mo...
0449004503,2.430193,Death Rounds,PETER CLEMENT,1999.0,Fawcett,Fiction,"The author, a former emergency room physician ..."
0441003257,2.430193,Good Omens,Neil Gaiman,1996.0,Ace Books,Fiction,When the armies of Heaven and Hell decide it's...


In [65]:
#books Read by user 33570
df_final[df_final['user'].isin(['33570'])].sort_values(by=['rating'], ascending=False)

Unnamed: 0,user,rating,isbn,title,author,publisher,pub_year,category,description
520,33570,8,449006522,Manhattan Hunt Club,JOHN SAUL,Ballantine Books,2002.0,Fiction,When college student Jeff Converse is wrongly ...
188024,33570,6,553583468,"Whisper of Evil (Hooper, Kay. Evil Trilogy.)",Kay Hooper,Bantam Books,2002.0,Fiction,As a series of grisly murders terrorizes the s...


### Pros
* **User independence**: collaborative filtering needs other users' rating to find the similarity between the users and then give the suggestion. Instead, content-based method only have to analyze the items and user profile for recommendation.
* **Transparency**: collaborative method gives you the recommendation because some unknown users have the same taste like you, but content-based method can tell you they recommend you the items based on what features. 
* **No cold start**: opposite to collaborative filtering, new items can be suggested before being rated by a substantial number of users.

### Cons
* **Limited content analysis**: if the content does not contain enough information to discriminate the items precisely, the recommendation will be not precisely at the end.
* **Over-specialization**: content-based method provides a limit degree of novelty, since it has to match up the features of profile and items. A totally perfect content-based filtering may suggest nothing "surprised." 
* **New user**: when there's not enough information to build a solid profile for a user, the recommendation could not be provided correctly. 


Reference  
https://www.analyticsvidhya.com/blog/2015/08/beginners-guide-learn-content-based-recommender-systems/  
https://towardsdatascience.com/learning-to-make-recommendations-745d13883951  
http://findoutyourfavorite.blogspot.com/2012/04/content-based-filtering.html