## IBM Machine Learning Course Lab 12 - Content Based Recommenders

This is my own attempt at Lab 12 of 'Machine Learning with Python' by IBM on Coursera. It includes my own insight when solving problems. The method of analysis presented here is far more rigorous than that required by the course.



## Introduction: what are content-based recommenders?

- Creates a profile on the user based on the data that they give, for example: ratings on a movie, time spent on a webpage, clicks
- Based on the user data, as well as item data, uses statistical analysis to predict what sort of 'rating' a user would give an item. If the rating is high, it recommends it to the user

So for example: suppose that you are starting a movie streaming service like netflix. You have a database for movies, their genres, release date, etc etc

You also have a profile on a user that has watched 5 action films, and rated them fairly highly. You can use that data to create a model that would recommend action films.


The problems with this method are:
- the models would not be able to capture any interdependencies / hidden data: i.e. someone might like watching action films on netflix that have a certain actors (this is assuming that the data for the cast of the movie is not stored in the database).
- does not recommend anything out of sample that the user might also enjoy. I.e. if the user has watched only action films, the recommender will always recommend action films, even though the user might enjoy a comedy film.



In [1]:
""" importing necessary packages """
#Dataframe manipulation library
import pandas as pd
#Math functions, we'll only need the sqrt function so let's import only that
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
!wget -O moviedataset.zip https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
print('unziping ...')
!unzip -o -j moviedataset.zip

--2020-08-04 10:29:52--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/moviedataset.zip
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 160301210 (153M) [application/zip]
Saving to: ‘moviedataset.zip’


2020-08-04 10:31:17 (1.83 MB/s) - ‘moviedataset.zip’ saved [160301210/160301210]

unziping ...
Archive:  moviedataset.zip
  inflating: links.csv               
  inflating: movies.csv              
  inflating: ratings.csv             
  inflating: README.txt              
  inflating: tags.csv                


In [64]:
""" Reading the data """

movies_df = pd.read_csv('movies.csv') #data frame for the movies

ratings_df = pd.read_csv('ratings.csv') #data frame for the ratings

print(ratings_df.head())

movies_df.head()

   userId  movieId  rating   timestamp
0       1      169     2.5  1204927694
1       1     2471     3.0  1204927438
2       1    48516     5.0  1204927435
3       2     2571     3.5  1436165433
4       2   109487     4.0  1436165496


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [65]:
""" data processing """

movies_df['year']=movies_df.title.str.extract('(\(\d\d\d\d\))',expand = False)
#this is quite a complex function, this is what it's saying:
#the first () in the extract function is a placeholder that tells the function that the extrat process is 
#starting, everything after, that has a \ behind it means: extract something of THAT format.
#i.e. \( \d \d \d \d \) means extract something of the format: ( d d d d )
#the bracket is necesary as it defines a text 'group'
#\d means digit!

# https://www.coursera.org/lecture/python-text-mining/demonstration-regex-with-pandas-and-named-groups-wh4nJ
#useful link above ^^^

movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#so now, we're redoing the process without the parentheses, effectively removing them

#we still have a problem in that the year is still saved in the 'title'. We can removed this by doing a replace
#function 

movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d)\)','')
print(movies_df['title'].head())
print(movies_df.title.apply(lambda y: len(y)))
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
print(movies_df['title'].head())
print(movies_df.title.apply(lambda y: len(y)))

movies_df.head()

#the strip() function can be used leading and trailing characters from the strings
#as you can see from the outputs, the length of the entries int he array has decreased by 1, because the white
#space from them has been removed! (remember, there was whitespace due to the existence of the date!)

#as you can see, the apply function is quite powerful as you can literally apply functions to each 
#entry in the dataframe

#what is interesting to note, is why there is not a requirement to change movies_df to str (as we did for
#replace and extract, when strip is a string function?)



0                      Toy Story 
1                        Jumanji 
2               Grumpier Old Men 
3              Waiting to Exhale 
4    Father of the Bride Part II 
Name: title, dtype: object
0        10
1         8
2        17
3        18
4        28
         ..
34203    11
34204    11
34205    21
34206     5
34207    22
Name: title, Length: 34208, dtype: int64
0                      Toy Story
1                        Jumanji
2               Grumpier Old Men
3              Waiting to Exhale
4    Father of the Bride Part II
Name: title, dtype: object
0         9
1         7
2        16
3        17
4        27
         ..
34203    10
34204    10
34205    20
34206     4
34207    21
Name: title, Length: 34208, dtype: int64


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


## Exploring the use of lambda functions and str functions a bit further

In [66]:

#is it because perhaps it has already been converted due to it's previous declaration?
test_series = movies_df.movieId #series of type int
test_series = test_series.apply(lambda y: y+1)
print(test_series) #as you can see this works well!

#test_series = test_series.apply(lambda x: x.strip([2])) #this gives an error as the data type is INT
test_series = test_series.astype(str)
test_series = test_series.apply(lambda z: z.strip('2'))

#test_series = test_series.str #this ONLY works if the cells contain string types! (hence the astype method)
#^^^ surprisingly, there was no need to apply the .str extension. In fact, applying it does not work as 
#.apply does not work for 'string' inputs... so what does .str do exaclty?
print(test_series)

0             2
1             3
2             4
3             5
4             6
          ...  
34203    151698
34204    151702
34205    151704
34206    151710
34207    151712
Name: movieId, Length: 34208, dtype: int64
0              
1             3
2             4
3             5
4             6
          ...  
34203    151698
34204     15170
34205    151704
34206    151710
34207     15171
Name: movieId, Length: 34208, dtype: object


## Exploring .str

In [67]:
test_2 = movies_df.movieId
print(type(test_2)) #series
test_2 = test_2.astype(str)
print(type(test_2.str)) #interesting....
#it seems that the .str converts the series to something special..

#however if we try:
test_2 = test_2.str
print(type(test_2)) #this yields stringmethods again!
#BUT, if we actually apply on that string...
test_2 = test_2.replace('1','2')
print(type(test_2)) #it reconverts it back to a series... 
#this is a good exploration... you learnt something new :)

<class 'pandas.core.series.Series'>
<class 'pandas.core.strings.StringMethods'>
<class 'pandas.core.strings.StringMethods'>
<class 'pandas.core.series.Series'>


## Back to the content-based recommendation system...

In [68]:
# the genre has an ugly | in it... can we remove this?

movies_df.genres = movies_df.genres.str.split('|')
movies_df.genres.head() #as you can see, the split method came in handy because instead of 
#now storing the genres as a long piece of text, they are stored inside an array! Could be quite useful eh?


0    [Adventure, Animation, Children, Comedy, Fantasy]
1                       [Adventure, Children, Fantasy]
2                                    [Comedy, Romance]
3                             [Comedy, Drama, Romance]
4                                             [Comedy]
Name: genres, dtype: object

In [69]:
#that being said, for content based recommendation systems what we need is a separate
#matrix that holds the genres...

#new df
df_with_genres = movies_df.copy() #remember to use copy so that you're creating a new df, and not referring
#to the old one !!!!

#we want to essentially create a new COLUMNS for every genre, and add 0 or 1!
for index, row in movies_df.iterrows(): #VERY POWERFUL!
    for genre in row.genres:
        #genre now refers to the actual value from the row.genres arrays!
        df_with_genres.at[index, genre] = 1 #remember, genre is a string, so what we are doing here is creating 
        # a new column, AND filling the 'index' with 1 for that genre if it exists
print(df_with_genres.head())

#now we must fill the NaN values with 0 (as content-based recos. require numerical dummy variables)

df_with_genres = df_with_genres.fillna(0)
df_with_genres.head() #we can now probably delete the 'genres' column as well!
#notice that there are so many extra columns added, because of the types of genres 

   movieId                        title  \
0        1                    Toy Story   
1        2                      Jumanji   
2        3             Grumpier Old Men   
3        4            Waiting to Exhale   
4        5  Father of the Bride Part II   

                                              genres  year  Adventure  \
0  [Adventure, Animation, Children, Comedy, Fantasy]  1995        1.0   
1                     [Adventure, Children, Fantasy]  1995        1.0   
2                                  [Comedy, Romance]  1995        NaN   
3                           [Comedy, Drama, Romance]  1995        NaN   
4                                           [Comedy]  1995        NaN   

   Animation  Children  Comedy  Fantasy  Romance  ...  Horror  Mystery  \
0        1.0       1.0     1.0      1.0      NaN  ...     NaN      NaN   
1        NaN       1.0     NaN      1.0      NaN  ...     NaN      NaN   
2        NaN       NaN     1.0      NaN      1.0  ...     NaN      NaN   
3     

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [71]:
ratings_df.head()
#we don't need timestamp, so let's cancel that shit
ratings_df = ratings_df.drop('timestamp',1) #need to specify 1 as that tells it to remove from columns
ratings_df

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0
...,...,...,...
22884372,247753,49530,5.0
22884373,247753,69481,3.0
22884374,247753,74458,4.0
22884375,247753,76093,5.0


In [72]:
""" creation of the active user """
#remember that the active user is the one who will get stuff recommended to them

userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies


Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


In [79]:
#lets get the correct Id's for the movies mentioned here (based on our initial movies database)



#Filtering out the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
print((movies_df['title'].isin(inputMovies['title'].tolist())))
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)
#Final input dataframe
#If a movie you added in above isn't here, then it might not be in the original 
#dataframe or it might spelled differently, please check capitalisation.
inputMovies

0         True
1         True
2        False
3        False
4        False
         ...  
34203    False
34204    False
34205    False
34206    False
34207    False
Name: title, Length: 34208, dtype: bool


Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


In [209]:
""" now let's get the genre information """

userMovies = df_with_genres[(df_with_genres['movieId'].isin(inputMovies['movieId'].tolist()))]
userMovies
#df_with_genres.loc[df_with_genres['movieId'].isin(inputMovies['movieId'].tolist())]
#alternative way of getting the same output

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
293,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1246,1274,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1885,1968,"Breakfast Club, The","[Comedy, Drama]",1985,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [89]:
userMovies = userMovies.reset_index(drop = True) #index is reset, but we still have the column with the correct
#movie Id
userGenreTable = userMovies.drop(['movieId','title','genres','year'], 1)

userGenreTable

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## How does content-based filtering work?

<p>steps</p>
- Create an attribute matrix with attributes as binary dummy variables, and the rows as the items in the list
- Remmeber that the table above is the characteristics of the items that the user has interated with 
- Create a vector which contains the ratings that the user has given
- Dot the matrix and the vector together, giving you an aggregate of scores for each attribute
- now: you have a new vector which is weighted. Multiply this vector pointwise with the attribute vector of the UNWATCHED movies, i.e. for each item, multiply it with the weighted scores vector and then normalize!
- your new output will be a vector of the items with a score. The higher the score, the more recommended the item

In [91]:
#Dot produt to get weights
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
#The user profile
userProfile

Adventure             10.0
Animation              8.0
Children               5.5
Comedy                13.5
Fantasy                5.5
Romance                0.0
Drama                 10.0
Action                 4.5
Crime                  5.0
Thriller               5.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 4.5
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

In [200]:
#Now let's get the genres of every movie in our original dataframe
genreTable = df_with_genres.set_index(df_with_genres['movieId'])
#And drop the unnecessary information
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable.head()

recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())

""" below is for the sake of learning and exploration """
print(userProfile)
print(genreTable.head())
print(genreTable*userProfile)
print((genreTable*userProfile).sum(axis = 1))

recommendationTable_df.head()

Adventure             10.0
Animation              8.0
Children               5.5
Comedy                13.5
Fantasy                5.5
Romance                0.0
Drama                 10.0
Action                 4.5
Crime                  5.0
Thriller               5.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 4.5
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64
         Adventure  Animation  Children  Comedy  Fantasy  Romance  Drama  \
movieId                                                                    
1              1.0        1.0       1.0     1.0      1.0      0.0    0.0   
2              1.0        0.0       1.0     0.0      1.0      0.0    0.0   
3              0.0        0.0       0.0     1.0      0.0      1.0    0.0   
4              0.0        0.0       0.0     1.0      0.0      1.0

pandas.core.series.Series

In [104]:
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head()

movieId
5018      0.748252
26093     0.734266
27344     0.720280
148775    0.685315
6902      0.678322
dtype: float64

In [234]:
#The final recommendation table
print(recommendationTable_df.head(20).keys()) #keys writes the index of the df as an array
print(recommendationTable_df.head(20).tolist())

movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())] #this works

#movies_df[movies_df['movieId'].isin(recommendationTable_df.head(20)).keys()] #this does not work, because
#size of the series that you areinputting inside movies_df is not the same as the size of movies_df !

#movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20)).tolist()]
#it's fairly confusing as to why this does NOT work, despite the inputs to .loc being identical... (see below)
#print(movies_df['movieId'].isin(recommendationTable_df.head(20)).keys()) 
#movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20)).keys()]
#movies_df['movieId'].isin(recommendationTable_df.head(20).keys())
#movies_df['movieId'].isin(recommendationTable_df.head(20).tolist())
#recommendationTable_df.head(20).tolist()


""" the above is void. I've left it there to show thought progression """

Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
            20],
           dtype='int64', name='movieId')
[0.5944055944055944, 0.2937062937062937, 0.1888111888111888, 0.32867132867132864, 0.1888111888111888, 0.20279720279720279, 0.1888111888111888, 0.21678321678321677, 0.06293706293706294, 0.2727272727272727, 0.32867132867132864, 0.1888111888111888, 0.32867132867132864, 0.13986013986013987, 0.20279720279720279, 0.2097902097902098, 0.13986013986013987, 0.1888111888111888, 0.1888111888111888, 0.5314685314685315]
    movieId                           title  \
0         1                       Toy Story   
1         2                         Jumanji   
2         3                Grumpier Old Men   
3         4               Waiting to Exhale   
4         5     Father of the Bride Part II   
5         6                            Heat   
6         7                         Sabrina   
7         8                    Tom and Huck   
8         9                  

" the above is void. I've left it there to show thought progression "

In [159]:
""" why keys might be better to use than tolist and WHEN """
print(movies_df['movieId'].isin(recommendationTable_df.head(20).keys()))
print(movies_df['movieId'].isin(recommendationTable_df.head(20).tolist()))
print(movies_df['movieId'].isin(recommendationTable_df.head(20)))
print(type(movies_df['movieId'].isin(recommendationTable_df.head(20).keys())))
print(type(movies_df['movieId'].isin(recommendationTable_df.head(20).tolist())))
print(type(movies_df['movieId'].isin(recommendationTable_df.head(20))))




#they have the same attributes, so why is it that one messes up?

""" the above is void, I've left it theree to show thoughr progression"""

0        False
1        False
2        False
3        False
4        False
         ...  
34203    False
34204    False
34205    False
34206    False
34207    False
Name: movieId, Length: 34208, dtype: bool
0        False
1        False
2        False
3        False
4        False
         ...  
34203    False
34204    False
34205    False
34206    False
34207    False
Name: movieId, Length: 34208, dtype: bool
0        False
1        False
2        False
3        False
4        False
         ...  
34203    False
34204    False
34205    False
34206    False
34207    False
Name: movieId, Length: 34208, dtype: bool
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [239]:
# you can find the top 20 rated movies in many different ways:
print(movies_df.title.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())])


#or, without .loc

print(movies_df.title[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())])

#You CANNOT use 'tolist' because that was conerting the recommendationtable into a list with the ratings..
#the ratings do not exist in movies_df, so the code was trying to find those values in movies_df...
#hencewhy it was dumb of you to confuse this

#the other mistake that you were making, was putting the .keys() function outside the .isin, what this was doing
#was returning a 34208 long series of bools... without any filtering!!!!
#very silly mistakes, but lessons learnt

0                          Toy Story
1                            Jumanji
2                   Grumpier Old Men
3                  Waiting to Exhale
4        Father of the Bride Part II
5                               Heat
6                            Sabrina
7                       Tom and Huck
8                       Sudden Death
9                          GoldenEye
10           American President, The
11       Dracula: Dead and Loving It
12                             Balto
13                             Nixon
14                  Cutthroat Island
15                            Casino
16             Sense and Sensibility
17                        Four Rooms
18    Ace Ventura: When Nature Calls
19                       Money Train
Name: title, dtype: object
0                          Toy Story
1                            Jumanji
2                   Grumpier Old Men
3                  Waiting to Exhale
4        Father of the Bride Part II
5                               Heat
6          

## Conclusion

Nice exercise.. took me a while longer but it was worth it.

### pros of content based filtering
- highly personalised to user
- learns user prefernces

### cons of content based filtering
- doesn't take into account what other users think of an item, so it solely bases the judgmeents on the attribute of the item, and as such, if you don't have enoug attributes it may give shitty recommendations (i.e. someone might generally like action films, but there are SO many action films that it might recommend a movie just because it's an action film, not because it's good 
- data extraction takes time / not intuitive
- detemrining the correct characteristics to choose is not always clear 
- out of sample data may never be recommended, i.e. if a user has never watched a drama film, it won't recommend it even though it might be good