# Content-Based Filtering model  
Content-based filtering approaches leverage **description or attributes** from items the user has interacted to recommend similar items. It depends only on the user **previous choices**, making this method robust to **avoid the cold-start problem**. For textual items, like articles, news and books, it is simple to use the article **category** or **raw text** to build **item profiles** and **user profiles**.

Suppose I watch a particular genre movie I will be recommended movies w.r.t that specific genre. The Title, Year of Release, Director, Cast are also helpful in identifying similar movie content.

## Approach 2: Using Rated Content to Recommend
In this approach contents of the product are already **rated** and based on the **user’s preference**. An **item score** is predicted to products and recommendation can be made.

Usually `rating table` (user rating), `item profile` (book genres) are the only material we've got.

* `rating table`: user-to-book relationship
* `item profile`: attribute-to-book relationship  
![](https://www.analyticsvidhya.com/wp-content/uploads/2015/08/81.png)

Then we will create the `user profile` so that we can understand what attribute the users actually prefer.
* `user profile`: user-to-attribute relationship  
![](https://www.analyticsvidhya.com/wp-content/uploads/2015/08/91.png)

Thus, with the `user profile`, we can get all the item score which is the user preference from `user profile` and `item profile`.  
![](https://www.analyticsvidhya.com/wp-content/uploads/2015/08/121.png)

Let's go through the code!

In [50]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from merge_data import *

In [2]:
#import book csv with description, and more than 10 user ratings
df_n_des = pd.read_csv('books_n_description.csv', index_col=0)

In [3]:
df_n_des.head()

Unnamed: 0,isbn,title,author,pub_year,publisher,categories,description
0,2005018,Clara Callan,Richard Bruce Wright,2001.0,HarperFlamingo Canada,Actresses,"In a small town in Canada, Clara Callan reluct..."
1,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999.0,Farrar Straus Giroux,Medical,"Describes the great flu epidemic of 1918, an o..."
2,399135782,The Kitchen God's Wife,Amy Tan,1991.0,Putnam Pub Group,Fiction,A Chinese immigrant who is convinced she is dy...
3,440234743,The Testament,John Grisham,1999.0,Dell,Fiction,"A suicidal billionaire, a burnt-out Washington..."
4,452264464,Beloved (Plume Contemporary Fiction),Toni Morrison,1994.0,Plume,Fiction,Staring unflinchingly into the abyss of slaver...


In [8]:
df_n_des.shape

(15452, 7)

In [7]:
#make a copy to books_wd which categories cell is not null
books_wd = df_n_des[df_n_des['categories'].notnull()].copy()

# filter out books with less than 5 characters in categories
books_wd = books_wd[books_wd['categories'].map(len) >1]

In [8]:
books_wd.shape

(1466, 7)

We have **1,466** books in total

### Item profile

In [21]:
df_item = books_wd[['isbn','categories']]

In [22]:
df_item.head()

Unnamed: 0,isbn,categories
0,2005018,Actresses
1,374157065,Medical
2,399135782,Fiction
3,440234743,Fiction
4,452264464,Fiction


In [23]:
# one-hot encoding for category
df_genre = pd.get_dummies(df_item['categories'])

let’s treat all articles as having unit weight.  
For binary representation, we can perform normalization by dividing the term occurrence by the sqrt of number of attributes in the article.  

In [24]:
#normalized
df_genre_normalized = df_genre.apply(lambda x: x/np.sqrt(df_genre.sum(axis=1)))

In [25]:
#create item profile
df_item = pd.concat([df_item, df_genre_normalized], axis=1)

df_item.drop(columns='categories', inplace=True)

In [26]:
df_item.sort_values('isbn', inplace=True)

In [27]:
df_item.set_index('isbn', inplace=True)

In [28]:
df_item.head()

Unnamed: 0_level_0,Accidents,Action and adventure,Actors,Actresses,Adoptees,Adventure stories,Affirmations,African American fiction,African American men,African American psychologists,...,"Ryan, Jack (Fictitious character)","Savich, Dillon (Fictitious character)",Science fiction,Self-Help,Social Science,Star Trek fiction,Travel,Travelers' writings,True Crime,Yorkshire (England)
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002005018,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000648302X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000649840X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0020264763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0020264801,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
df_item.shape

(1466, 158)

In `df_item`, there are **1,466** books and **158** genres.

Let's load the merged dataframe of `books`, `ratings`, `users` and get the record which matched the 1,466 books in `df_item`.

In [32]:
df_merged = merge_data_frame()

In [33]:
df_merged.head()

Unnamed: 0,user,isbn,rating,location,age,country,province,title,author,pub_year,publisher,url_s,url_m,url_l
0,276725,034545104X,0,"tyler, texas, usa",,usa,texas,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
1,2313,034545104X,5,"cincinnati, ohio, usa",23.0,usa,ohio,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
2,6543,034545104X,0,"strafford, missouri, usa",34.0,usa,missouri,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
3,8680,034545104X,5,"st. charles county, missouri, usa",2.0,usa,missouri,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...
4,10314,034545104X,9,"beaverton, oregon, usa",,usa,oregon,Flesh Tones: A Novel,M. J. Rose,2002.0,Ballantine Books,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...,http://images.amazon.com/images/P/034545104X.0...


In [35]:
#top users number and their number of ratings given
df_merged.user.value_counts().sort_values(ascending=False)[:10]

11676     11144
198711     6456
153662     5814
98391      5779
35859      5646
212898     4289
278418     3996
76352      3329
110973     2971
235105     2943
Name: user, dtype: int64

In [37]:
df_final = df_merged[df_merged['isbn'].isin(df_item.index)]

In [38]:
df_final.shape

(72821, 14)

In [39]:
df_final.head()

Unnamed: 0,user,isbn,rating,location,age,country,province,title,author,pub_year,publisher,url_s,url_m,url_l
501,276746,449006522,0,"fort worth, ,",,,,Manhattan Hunt Club,JOHN SAUL,2002.0,Ballantine Books,http://images.amazon.com/images/P/0449006522.0...,http://images.amazon.com/images/P/0449006522.0...,http://images.amazon.com/images/P/0449006522.0...
502,278026,449006522,8,"east orange, new jersey, usa",56.0,usa,new jersey,Manhattan Hunt Club,JOHN SAUL,2002.0,Ballantine Books,http://images.amazon.com/images/P/0449006522.0...,http://images.amazon.com/images/P/0449006522.0...,http://images.amazon.com/images/P/0449006522.0...
503,243,449006522,6,"arden hills, minnesota, usa",,usa,minnesota,Manhattan Hunt Club,JOHN SAUL,2002.0,Ballantine Books,http://images.amazon.com/images/P/0449006522.0...,http://images.amazon.com/images/P/0449006522.0...,http://images.amazon.com/images/P/0449006522.0...
504,645,449006522,0,"ottawa, ontario, canada",,canada,ontario,Manhattan Hunt Club,JOHN SAUL,2002.0,Ballantine Books,http://images.amazon.com/images/P/0449006522.0...,http://images.amazon.com/images/P/0449006522.0...,http://images.amazon.com/images/P/0449006522.0...
505,2010,449006522,0,"colfax, illinois, usa",,usa,illinois,Manhattan Hunt Club,JOHN SAUL,2002.0,Ballantine Books,http://images.amazon.com/images/P/0449006522.0...,http://images.amazon.com/images/P/0449006522.0...,http://images.amazon.com/images/P/0449006522.0...


There are **72,821** ratings of the **1,466** books.

### Rating dataframe

In [45]:
rating = pd.pivot_table(df_final, values='rating', index=['isbn'], columns = ['user'])

In [46]:
rating.sort_index(axis=1, inplace=True)

In [47]:
rating.head()

user,10,100004,100009,10001,100029,100035,10005,100053,100066,100067,...,99885,99894,999,9991,99946,99955,99963,99973,99996,99997
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002005018,,,,,,,,,,,...,,,,,,,,,,
000648302X,,,,,,,,,,,...,,,,,,,,,,
000649840X,,,,,,,,,,,...,,,,,,,,,,
0020264763,,,,,,,,,,,...,,,,,,,,,,
0020264801,,,,,,,,,,,...,,,,,,,,,,


### User Profile

In [48]:
# get all the user number
users_no = rating.columns

In [52]:
#create an empty dataframe
df_users = pd.DataFrame(columns = df_item.columns)

In [53]:
df_users

Unnamed: 0,Accidents,Action and adventure,Actors,Actresses,Adoptees,Adventure stories,Affirmations,African American fiction,African American men,African American psychologists,...,"Ryan, Jack (Fictitious character)","Savich, Dillon (Fictitious character)",Science fiction,Self-Help,Social Science,Star Trek fiction,Travel,Travelers' writings,True Crime,Yorkshire (England)


In [54]:
for i in tqdm(range(len(users_no))):
    working_df = df_item.mul(rating.iloc[:,i], axis=0)
    working_df.replace(0, np.NaN, inplace=True)    
    df_users.loc[users_no[i]] = working_df.mean(axis=0)

100%|██████████| 22950/22950 [09:58<00:00, 38.38it/s]


In [56]:
#user profile
df_users.head()

Unnamed: 0,Accidents,Action and adventure,Actors,Actresses,Adoptees,Adventure stories,Affirmations,African American fiction,African American men,African American psychologists,...,"Ryan, Jack (Fictitious character)","Savich, Dillon (Fictitious character)",Science fiction,Self-Help,Social Science,Star Trek fiction,Travel,Travelers' writings,True Crime,Yorkshire (England)
10,,,,,,,,,,,...,,,,,,,,,,
100004,,,,,,,,,,,...,,,,,,,,,,
100009,,,,,,,,,,,...,,,,,,,,,,
10001,,,,,,,,,,,...,,,,,,,,,,
100029,,,,,,,,,,,...,,,,,,,,,,


### IDF
Let’s consider how common different terms are among our documents.  
The dot product of article vectors and IDF vectors gives us the **weighted scores** of each article.

In [57]:
document_frequency = df_item.sum()

In [58]:
idf = 1/document_frequency

In [59]:
#make an empty dataframe
df_predict = pd.DataFrame()

In [61]:
#The dot product of article vectors and IDF vectors gives us the weighted scores of each article.
idf_df_item = df_item.mul(idf)

### Make Prediction

In [62]:
#user predict by tfidf
for i in tqdm(range(len(users_no))):
    working_df = idf_df_item.mul(df_users.iloc[i], axis=1)
    df_predict[users_no[i]] = working_df.sum(axis=1)

100%|██████████| 22950/22950 [04:38<00:00, 82.51it/s] 


In [64]:
#all user score predict of all books
df_predict.head()

Unnamed: 0_level_0,10,100004,100009,10001,100029,100035,10005,100053,100066,100067,...,99885,99894,999,9991,99946,99955,99963,99973,99996,99997
isbn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0002005018,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000648302X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000649840X,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0020264763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0020264801,0.0,0.0,0.007722,0.009653,0.0,0.007722,0.0,0.008687,0.006113,0.006757,...,0.0,0.0,0.006757,0.0,0.005309,0.0,0.0,0.005792,0.008366,0.008044


In [113]:
def recommender_1(user_no):
    
    #get all book isbn
    isbn_no = df_predict.index
    
    #user predicted rating to all books
    user_predicted_rating = df_predict[user_no]
    
    #combine book rating and book detail
    user_rating_book_detail = pd.concat([user_predicted_rating,books_wd.set_index('isbn')], axis=1)
    
    #sort top 10 rating books
    return user_rating_book_detail.sort_values(by=[user_no], ascending=False).iloc[0:10]

In [114]:
#input the user number (e.g. 11676) to get top 10 books recommendation
recommender_1('11676')

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,11676,title,author,pub_year,publisher,categories,description
038542471X,10.0,The Client,John Grisham,1993.0,Doubleday Books,California,
0525460543,10.0,The Prince of Egypt Collector's Edition Storybook,Walt Disney,,Dreamworks Entertainment,Bible,Recounts the Biblical story of Moses.
1558531025,10.0,Life's Little Instruction Book (Life's Little ...,H. Jackson Brown,1991.0,Thomas Nelson,Conduct of life,
2277241202,10.0,L' Alchimiste,Paul Coelho,,Editions 84,Andalusia (Spain),Conte philosophique - voyage - Espagne - Afriq...
0060972777,9.0,This Boy's Life: A Memoir,Tobias Wolff,1989.0,Perennial,"Authors, American",The author chronicles the tumultuous events of...
0553204963,9.0,James Herriots Yorkshire,James Herriot,1982.0,Bantam Doubleday Dell,Yorkshire (England),
0399148612,9.0,Without Fail (Jack Reacher Novels (Hardcover)),Lee Child,2002.0,G. P. Putnam's Sons,Military police,Hired by the Secret Service to test their shie...
0590442376,8.0,Prom Dress,Lael Littke,1989.0,Scholastic Paperbacks (Mm),Antiques,The beautiful prom dress that Robin finds in h...
067101398X,8.0,"End Game (Star Trek New Frontier, No 4)",Peter David,1997.0,Star Trek,Star Trek fiction,Captain Mackenzie Calhoun - Wearing a veneer o...
0679442790,8.0,The Reader,Bernhard Schlink,1997.0,Random House,Female offenders,Hailed for its eroticism and the the moral cla...


### Pros
* **User independence**: collaborative filtering needs other users' rating to find the similarity between the users and then give the suggestion. Instead, content-based method only have to analyze the items and user profile for recommendation.
* **Transparency**: collaborative method gives you the recommendation because some unknown users have the same taste like you, but content-based method can tell you they recommend you the items based on what features. 
* **No cold start**: opposite to collaborative filtering, new items can be suggested before being rated by a substantial number of users.

### Cons
* **Limited content analysis**: if the content does not contain enough information to discriminate the items precisely, the recommendation will be not precisely at the end.
* **Over-specialization**: content-based method provides a limit degree of novelty, since it has to match up the features of profile and items. A totally perfect content-based filtering may suggest nothing "surprised." 
* **New user**: when there's not enough information to build a solid profile for a user, the recommendation could not be provided correctly. 


Reference  
https://www.analyticsvidhya.com/blog/2015/08/beginners-guide-learn-content-based-recommender-systems/  
https://towardsdatascience.com/learning-to-make-recommendations-745d13883951  
http://findoutyourfavorite.blogspot.com/2012/04/content-based-filtering.html