<b>
<p>
<center>
<font size="7">
Using Machine Learning to Create a Movie Recommendation System
</font>
</center>
</p>

<p>
<center>
<font size="5">
Max, Michael, and Yaswanth
</font>
</center>
</p>

<p>
<center>
<font size="3">
DATS 6202 Machine Learning I
</font>
</center>
</p>

---



---

# Setup

## Google Drive

In [1]:
from google.colab import drive
import sys

# Mount Google Drive
drive.mount('/content/drive')

# Get the path of the data directory
data_dir='/content/drive/My Drive/Colab Notebooks/DATS6202_machine_learning/Group Project/Data/MovieLens/'

# Get the absolute path of the current folder
abspath_curr = '/content/drive/My Drive/Colab Notebooks/DATS6202_machine_learning/Group Project/'

# Get the absolute path of the shallow utilities folder
abspath_util_shallow = '/content/drive/My Drive/Colab Notebooks/DATS6202_machine_learning/Utilities'

# Get the absolute path of the shallow models folder
abspath_model_shallow = '/content/drive/My Drive/Colab Notebooks/DATS6202_machine_learning/Models'

Mounted at /content/drive


## Warning

In [2]:
import warnings

# Ignore warnings
warnings.filterwarnings('ignore')

## Matplotlib

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline 

# Set matplotlib sizes
plt.rc('font', size=20)
plt.rc('axes', titlesize=20)
plt.rc('axes', labelsize=20)
plt.rc('xtick', labelsize=20)
plt.rc('ytick', labelsize=20)
plt.rc('legend', fontsize=20)
plt.rc('figure', titlesize=20)

## TensorFlow

In [4]:
# The magic below allows us to use tensorflow version 2.x
%tensorflow_version 2.x 
import tensorflow as tf
from tensorflow import keras

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


## Random Seed

In [5]:
# The random seed
random_seed = 42

# Set random seed in tensorflow
tf.random.set_seed(random_seed)

# Set random seed in numpy
import numpy as np
np.random.seed(random_seed)

# Data Preprocessing

In [6]:
# Import required libraries
import pandas as pd


## Load Data

In [7]:
# Load data
movies=pd.read_csv(data_dir+'movies.dat', sep='::', encoding='latin-1', header=None, names=['MovieId', 'Title', 'Genres'])
ratings=pd.read_csv(data_dir+'ratings.dat', sep='::', encoding='latin-1', header=None, names=['UserId', 'MovieId', 'Rating', 'Timestamp'])
users=pd.read_csv(data_dir+'users.dat', sep='::', encoding='latin-1', header=None, names=['UserId', 'Gender', 'Age', 'Occupation', 'Zip-code'])

In [8]:
movies.head()

Unnamed: 0,MovieId,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [9]:
ratings.head()

Unnamed: 0,UserId,MovieId,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [10]:
users.head()

Unnamed: 0,UserId,Gender,Age,Occupation,Zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


## Merge dataframes
- First merging movie and ratings data on 'MovieId'
- Second merged movie+ratings data with users data on 'UserId'

In [11]:
# Merge data frames
data=pd.merge(left=movies, right=ratings, how='inner', on='MovieId')
data=pd.merge(left=data, right=users, how='inner', on='UserId')

In [12]:
data.head()

Unnamed: 0,MovieId,Title,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation,Zip-code
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,978824268,F,1,10,48067
1,48,Pocahontas (1995),Animation|Children's|Musical|Romance,1,5,978824351,F,1,10,48067
2,150,Apollo 13 (1995),Drama,1,5,978301777,F,1,10,48067
3,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi,1,4,978300760,F,1,10,48067
4,527,Schindler's List (1993),Drama|War,1,5,978824195,F,1,10,48067


## Encode Categorical Variables

### Genre (Multilabel Categorical Variable)

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

def token_creator(genres):
  l = genres.split('|')
  return l

cv = CountVectorizer(analyzer=token_creator)

matrix_genres = cv.fit_transform(data.Genres)

df_genres=pd.DataFrame(matrix_genres.toarray(), columns=cv.get_feature_names())

In [14]:
df_movies=pd.concat([data, df_genres], axis=1)

In [15]:
df_movies.drop(['Genres'], axis=1, inplace=True)

In [16]:
df_movies.head()

Unnamed: 0,MovieId,Title,UserId,Rating,Timestamp,Gender,Age,Occupation,Zip-code,Action,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),1,5,978824268,F,1,10,48067,0,...,0,0,0,0,0,0,0,0,0,0
1,48,Pocahontas (1995),1,5,978824351,F,1,10,48067,0,...,0,0,0,1,0,1,0,0,0,0
2,150,Apollo 13 (1995),1,5,978301777,F,1,10,48067,0,...,0,0,0,0,0,0,0,0,0,0
3,260,Star Wars: Episode IV - A New Hope (1977),1,4,978300760,F,1,10,48067,1,...,1,0,0,0,0,0,1,0,0,0
4,527,Schindler's List (1993),1,5,978824195,F,1,10,48067,0,...,0,0,0,0,0,0,0,0,1,0


### Gender (Binary Categorical Variable)
- Female = 0
- Male = 1

In [17]:
df_movies['Gender']=np.where(df_movies['Gender']=='F',0,1)

## Deal with Identifiers

In [18]:
df_movies.drop(['MovieId', 'Title', 'UserId', 'Timestamp', 'Zip-code'], axis=1, inplace=True)

In [19]:
df_movies.head()

Unnamed: 0,Rating,Gender,Age,Occupation,Action,Adventure,Animation,Children's,Comedy,Crime,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,5,0,1,10,0,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,5,0,1,10,0,0,1,1,0,0,...,0,0,0,1,0,1,0,0,0,0
2,5,0,1,10,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,1,10,1,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
4,5,0,1,10,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


## Handling Missing Data

## Splitting into Train, Val, and Test Sets

In [20]:
from sklearn.model_selection import train_test_split

# Divide the training data into training (80%) and test (20%)
df_train, df_test=train_test_split(df_movies, train_size=0.8, random_state=random_seed)

# Divide the training data into training (80%) and test (20%)
df_train, df_val=train_test_split(df_train, train_size=0.75, random_state=random_seed)