# CONTENT-BASED RECOMMENDER SYSTEM USING MULTIPLE FEATURES
### PROJECT OVERVIEW
Content-based recommender system is considered an effective intelligenct system which recommend items to users based on similarity with items that the user had interacted with. Often times, the description field is vectorized and used to determine the similarity between items. However, a recommender system is more effective when multiple input fields are combined, vectorized and used to claculate the similarity values. In this project, I shall make use of 7 different columns of the The Movie Data Base (tmbd_5000_movies) dataset downloaded from kaggle (https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata) as at October 2022 to build a recommender system for the movies dataset. 

### DATASET
The old version of the TMBD dataset was recently removed and replaced with new dataset. Several fields of the new dataset contain json files hence I shall use the load data function at https://www.kaggle.com/code/sohier/tmdb-format-introduction/notebook to load the data after which I will write a function to extract the needed information from the json fields. 

#### STEP 1: IMPORTATION OF LIBRARIES AND LAODING OF DATASET

In [103]:
#importation of Json and pandas
import json
import pandas as pd

In [104]:
#the load_data function
def load_tmdb_movies(path):
    df = pd.read_csv(path)
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.date())
    json_columns = ['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df

In [105]:
#application of the load data function
movies = load_tmdb_movies("tmdb_5000_movies.csv")

In [106]:
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://www.avatarmovie.com/,19995,"[{'id': 1463, 'name': 'culture clash'}, {'id':...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{'name': 'Ingenious Film Partners', 'id': 289...","[{'iso_3166_1': 'US', 'name': 'United States o...",2009-12-10,2787965087,162.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",http://disney.go.com/disneypictures/pirates/,285,"[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{'name': 'Walt Disney Pictures', 'id': 2}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",2007-05-19,961000000,169.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...","[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2015-10-26,880674609,148.0,"[{'iso_639_1': 'fr', 'name': 'Français'}, {'is...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",http://www.thedarkknightrises.com/,49026,"[{'id': 849, 'name': 'dc comics'}, {'id': 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{'name': 'Legendary Pictures', 'id': 923}, {'...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-07-16,1084939099,165.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",http://movies.disney.com/john-carter,49529,"[{'id': 818, 'name': 'based on novel'}, {'id':...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{'name': 'Walt Disney Pictures', 'id': 2}]","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-03-07,284139100,132.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


#### STEP 2: DATA CLEANING AND PREPARATION

In [107]:
#function to be used in extraction the required data from the json fields
def get_list(x):
    names = [i['name'] for i in x]
    return names


In [108]:
#Application of the function to the json fields
features = ['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']
for feature in features:
    movies[feature] = movies[feature].apply(get_list)

In [109]:
movies[['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']]

Unnamed: 0,genres,keywords,production_countries,production_companies,spoken_languages
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[United States of America, United Kingdom]","[Ingenious Film Partners, Twentieth Century Fo...","[English, Español]"
1,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...",[United States of America],"[Walt Disney Pictures, Jerry Bruckheimer Films...",[English]
2,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[United Kingdom, United States of America]","[Columbia Pictures, Danjaq, B24]","[Français, English, Español, Italiano, Deutsch]"
3,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",[United States of America],"[Legendary Pictures, Warner Bros., DC Entertai...",[English]
4,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...",[United States of America],[Walt Disney Pictures],[English]
...,...,...,...,...,...
4798,"[Action, Crime, Thriller]","[united states–mexico barrier, legs, arms, pap...","[Mexico, United States of America]",[Columbia Pictures],[Español]
4799,"[Comedy, Romance]",[],[],[],[]
4800,"[Comedy, Drama, Romance, TV Movie]","[date, love at first sight, narration, investi...",[United States of America],"[Front Street Pictures, Muse Entertainment Ent...",[English]
4801,[],[],"[United States of America, China]",[],[English]


In [110]:
#preparation of the non json fields to be used
movies['overview'] = movies['overview'].fillna('')
movies['tagline'] = movies['tagline'].fillna('')

In [111]:
#function to be used in converting the non json fields to list to make them iterable
def convert(x):
    list = [i for i in x.split()]
    return list

In [112]:
#application of the function to the non json fields
features = ['overview', 'tagline']
for feature in features:
    movies[feature] = movies[feature].apply(convert)

In [113]:
#the non json fields now converted to lists
movies[['overview', 'tagline']]

Unnamed: 0,overview,tagline
0,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Enter, the, World, of, Pandora.]"
1,"[Captain, Barbossa,, long, believed, to, be, d...","[At, the, end, of, the, world,, the, adventure..."
2,"[A, cryptic, message, from, Bond’s, past, send...","[A, Plan, No, One, Escapes]"
3,"[Following, the, death, of, District, Attorney...","[The, Legend, Ends]"
4,"[John, Carter, is, a, war-weary,, former, mili...","[Lost, in, our, world,, found, in, another.]"
...,...,...
4798,"[El, Mariachi, just, wants, to, play, his, gui...","[He, didn't, come, looking, for, trouble,, but..."
4799,"[A, newlywed, couple's, honeymoon, is, upended...","[A, newlywed, couple's, honeymoon, is, upended..."
4800,"[""Signed,, Sealed,, Delivered"", introduces, a,...",[]
4801,"[When, ambitious, New, York, attorney, Sam, is...","[A, New, Yorker, in, Shanghai]"


In [114]:
#definition of function to be used in combining all the field to be used into a single field
def description(x):
    return ' '.join(x['genres']) + ' ' + ' '.join(x['keywords'])+ ' ' + ' '.join(x['production_countries'])+ ' ' + ' '.join(x['production_companies'])+ ' ' + ' '.join(x['spoken_languages'])+ ' ' + ' '.join(x['overview'])+ ' ' + ' '.join(x['tagline'])


In [115]:
#application of the function
movies['description'] = movies.apply(description, axis=1)

In [116]:
movies['description']

0       Action Adventure Fantasy Science Fiction cultu...
1       Adventure Fantasy Action ocean drug abuse exot...
2       Action Adventure Crime spy based on novel secr...
3       Action Crime Drama Thriller dc comics crime fi...
4       Action Adventure Science Fiction based on nove...
                              ...                        
4798    Action Crime Thriller united states–mexico bar...
4799    Comedy Romance     A newlywed couple's honeymo...
4800    Comedy Drama Romance TV Movie date love at fir...
4801      United States of America China  English When...
4802    Documentary obsession camcorder crush dream gi...
Name: description, Length: 4803, dtype: object

#### STEP 3: VECTORIZATION AND ESTIMATION OF COSINE SIMILARITY MATRIX

In [117]:
#I will use the CountVectorizer instead of TF-IDF because the presence of some words across multiple document is not noise in this case and such words should not be downgraded
from sklearn.feature_extraction.text import CountVectorizer

In [118]:
#vectorizing the newly created description field using the countvectorizer
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(movies['description'])

In [119]:
count_matrix.shape

(4803, 26281)

In [120]:
#importation of the required library
from sklearn.metrics.pairwise import cosine_similarity

In [121]:
#computation of the cosine sililarity matrix
cosine_similarity = cosine_similarity(count_matrix, count_matrix)

#### STEP 4: PREDICTIONS

In [122]:
#reseting of index of the main DataFrame and construction reverse mapping
movies = movies.reset_index()
indices = pd.Series(movies.index, index=movies['title'])

In [123]:
indices[:10]

title
Avatar                                      0
Pirates of the Caribbean: At World's End    1
Spectre                                     2
The Dark Knight Rises                       3
John Carter                                 4
Spider-Man 3                                5
Tangled                                     6
Avengers: Age of Ultron                     7
Harry Potter and the Half-Blood Prince      8
Batman v Superman: Dawn of Justice          9
dtype: int64

In [124]:
#I shall now write a function that will takes in movie title as input and outputs 10 most similar movies
def get_recommendations(title, cosine_similarity=cosine_similarity):
    #getting the index of the movie that matches the title
    idx = indices[title]
    #getting the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_similarity[idx]))
    #sorting the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    #getting the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]
    #getting the movie indices
    movie_indices = [i[0] for i in sim_scores]
    #getting the top 10 most similar movies
    return movies['title'].iloc[movie_indices]

In [125]:
#applying the recommender function
get_recommendations('The Dark Knight Rises')

65                              The Dark Knight
119                               Batman Begins
1359                                     Batman
428                              Batman Returns
299                              Batman Forever
210                              Batman & Robin
3854    Batman: The Dark Knight Returns, Part 2
9            Batman v Superman: Dawn of Justice
3819                                   Defendor
1664                              Dead Man Down
Name: title, dtype: object

In [126]:
get_recommendations('Batman Forever')

210                              Batman & Robin
65                              The Dark Knight
1359                                     Batman
3                         The Dark Knight Rises
428                              Batman Returns
119                               Batman Begins
9            Batman v Superman: Dawn of Justice
3854    Batman: The Dark Knight Returns, Part 2
72                                Suicide Squad
14                                 Man of Steel
Name: title, dtype: object