# MOVIE RECOMMENDATION SYSTEM 2022
A recommendation system is an information filtering system whose main goal is to predict the rating or preference a user might give to an item. This helps create personalized content and better product search experience. One popular use is recommending to users which movie to watch. This is because significant dependencies exist between users and item centric activity. For example a user who is interested in s historical documentary is more likely to be interested in another historical documentary or an educational program, rather than in an action movie.

A recommendation system can use either of these two techniques:
* Content based filtering
* Collaborative filtering 

In content based filtering, the algorithm seeks to make recommendations based on how similar the properties or features of an item are to other items. 

In collaborative filtering, we use similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B.

Here we are going to explore both methods and assess which recommendation system gives us the best results. Increasing sales is the primary goal of a recommender system. By recommending carefully selected items to users, recommender systems bring relevant items to the attention of users. This increases the sales volumes and profits to the merchants.

----

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Create Experiment with Comet</a>

<a href=#four>4. Exploratory Data Analysis (EDA)</a>

<a href=#five>4. Data Engineering</a>

<a href=#six>6. Modeling</a>

<a href=#seven>7. Model Performance</a>

<a href=#eight>8. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---


In [1]:
#basic libraries for processing data
import numpy as np
import pandas as pd

#visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# save experiments
from comet_ml import Experiment

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---


There are six csv files containing information about the movies. In this section we are going to extract information form the various csv files and add to the main training dataset. 

In [23]:
#read all the csv files
train_df= pd.read_csv('train.csv')
imdb_df=pd.read_csv('imdb_data.csv')
genome_scores_df=pd.read_csv('genome_scores.csv')
genome_tags_df=pd.read_csv('genome_tags.csv')
movies_df=pd.read_csv('movies.csv')
tags_df=pd.read_csv('tags.csv')

In [31]:
#look at the main training file
train_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [37]:
#add the columns title_cast, budget, runtime, director and plot_keywords to the main trainig file from the imdb dataset.
train_df['title_cast'] = train_df['movieId'].map(imdb_df.set_index('movieId')['title_cast'])
train_df['budget'] = train_df['movieId'].map(imdb_df.set_index('movieId')['budget'])
train_df['runtime'] = train_df['movieId'].map(imdb_df.set_index('movieId')['runtime'])
train_df['director'] = train_df['movieId'].map(imdb_df.set_index('movieId')['director'])
train_df['plot_keywords'] = train_df['movieId'].map(imdb_df.set_index('movieId')['plot_keywords'])

In [43]:
#add the columns title and genres to the main training file from the movies dataset.
train_df['title'] = train_df['movieId'].map(movies_df.set_index('movieId')['title'])
train_df['genres'] = train_df['movieId'].map(movies_df.set_index('movieId')['genres'])

In [44]:
#visualize our final train dataframe
train_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title_cast,budget,runtime,director,plot_keywords,title,genres
0,5163,57669,4.0,1518349992,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,"$15,000,000",107.0,Martin McDonagh,dwarf|bruges|irish|hitman,In Bruges (2008),Comedy|Crime|Drama|Thriller
1,106343,5,4.5,1206238739,Steve Martin|Diane Keaton|Martin Short|Kimberl...,"$30,000,000",106.0,Albert Hackett,fatherhood|doberman|dog|mansion,Father of the Bride Part II (1995),Comedy
2,146790,5459,5.0,1076215539,Tommy Lee Jones|Will Smith|Rip Torn|Lara Flynn...,"$140,000,000",88.0,Lowell Cunningham,lingerie|michael jackson character|shorthaired...,Men in Black II (a.k.a. MIIB) (a.k.a. MIB 2) (...,Action|Comedy|Sci-Fi
3,106362,32296,2.0,1423042565,Sandra Bullock|Regina King|Enrique Murciano|Wi...,"$45,000,000",115.0,Marc Lawrence,female protagonist|cleave gag|good woman|fbi,Miss Congeniality 2: Armed and Fabulous (2005),Adventure|Comedy|Crime
4,9041,366,3.0,833375837,Jeff Davis|Heather Langenkamp|Miko Hughes|Matt...,"$8,000,000",112.0,Wes Craven,freddy krueger|elm street|famous director as h...,Wes Craven's New Nightmare (Nightmare on Elm S...,Drama|Horror|Mystery|Thriller


<a id="three"></a>
## 3. Create Experiment with Comet
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---

In [49]:
# Create an experiment with your api key
experiment = Experiment(
    api_key="nslGETXycV0zGWbDfDoCsMaHL",
    project_name="edsa-movie-recommendation-system",
    workspace="stella",
)

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/stella/edsa-movie-recommendation-system/7561d45c4d164044a454a6d6cb7bee86

COMET INFO: Couldn't find a Git repository in 'C:\\Users\\Stella\\Documents\\Explore Data Science\\unsupervised learning\\Predict' nor in any parent directory. You can override where Comet is looking for a Git Patch by setting the configuration `COMET_GIT_DIRECTORY`


<a id="four"></a>
## 4. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---



The Exploratory Data Analysis Section involves extracting insights from the training dataset. In summary this section involves:
* The shape of the training dataset
* Checking for Null values
* Unique number of users
* The distribution of ratings
* Word Clouds

In [54]:
#shape of training data
print(f'Total number of rows in dataset:  {train_df.shape[0]}')
print(f'Total number of columns in dataset:  {train_df.shape[1]}')

Total number of rows in dataset:  10000038
Total number of columns in dataset:  11




In [55]:
#checking for null values
train_df.isnull().sum()

userId                 0
movieId                0
rating                 0
timestamp              0
title_cast       2971414
budget           3519283
runtime          3020065
director         2969695
plot_keywords    2977050
title                  0
genres                 0
dtype: int64



The columns 'title_cast', 'budget', 'runtime', 'director' and 'plot_keywords' have more that 2.5 million null values. This

In [68]:
#percentage of null values for each column:
title_cast= 2971414/len(train_df)*100
print('Percentage of missing values in title_cast column is: ' + str(title_cast))
budget= 3519283/len(train_df)*100
print('Percentage of missing values in budget column is: ' + str(budget))
runtime= 3020065/len(train_df)*100
print('Percentage of missing values in runtime column is: ' + str(runtime))
director = 2969695/len(train_df)*100
print('Percentage of missing values in the director column is: ' + str(director))
plot_keywords = 2977050/len(train_df)*100
print('Percentage of missing values in plot_keywords column is: ' + str(plot_keywords))


Percentage of missing values in title_cast column is: 29.71402708669707
Percentage of missing values in budget column is: 35.19269626775418
Percentage of missing values in runtime column is: 30.200535237966097
Percentage of missing values in the director column is: 29.696837152018823
Percentage of missing values in plot_keywords column is: 29.770386872529887




In [70]:
# number of unique users
users = train_df['userId'].unique()
print ('Number of unique usesr in the trainset is : ' + str(len(users)))

Number of unique usesr in the trainset is : 162541




In [None]:
#distribution of the ratings
with sns.axes_style('white'):
    g = sns.factorplot("rating", data=train_df, aspect=2.0,kind='count')
    g.set_ylabels("Total number of ratings")
print (f'Average rating in dataset: {np.mean(train_df["rating"])}')



We can note that:
* Most movies recieved a rating of 4.0
* Most movie ratings are above 3.0


<a id="five"></a>
## 5. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---

<a id="six"></a>
## 6. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

----

<a id="seven"></a>
## 7. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---

<a id="eight"></a>
## 8. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---

In [None]:
experiment.end()