# Phase 4 Final Project Submission

* Student name: Heath Rittler
* Student pace: Self paced
* Scheduled project review date/time: tbd
* Instructor name: Mark Barbour
* Blog post URL: https://medium.com/@heathlikethecandybar

# Introduction

## Business Case/ Summary

**Our task is to:**

Build a model that provides top 5 movie recommendations to a user, based on their ratings of other movies.

## Modelling and Approach

**Collaborative Filtering**

At minimum, your recommendation system must use collaborative filtering. If you have time, consider implementing a hybrid approach, e.g. using collaborative filtering as the primary mechanism, but using content-based filtering to address the cold start problem.

## Metrics for Evaluation

### **Evaluation**

The MovieLens dataset has explicit ratings, so achieving some sort of evaluation of your model is simple enough. But you should give some thought to the question of metrics. Since the rankings are ordinal, we know we can treat this like a regression problem. But when it comes to regression metrics there are several choices: RMSE, MAE, etc. Here are some further ideas.

## Core Field Names and Definitions from Data Source

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files 
* `links.csv`
* `movies.csv` 
* `ratings.csv` 
* `tags.csv`

### `Formatting and Encoding`
The dataset files are written as comma-separated values files with a single header row. Columns that contain commas (,) are escaped using double-quotes ("). These files are encoded as UTF-8. If accented characters in movie titles or tag values (e.g. Misérables, Les (1995)) display incorrectly, make sure that any program reading the data, such as a text editor, terminal, or script, is configured for UTF-8.

**`User Ids`**
MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between ratings.csv and tags.csv (i.e., the same id refers to the same user across the two files).

**`Movie Ids`**
Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL https://movielens.org/movies/1). Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).

### `Ratings Data File Structure (ratings.csv)`
All ratings are contained in the file ratings.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

userId,movieId,rating,timestamp
The lines within this file are ordered first by userId, then, within user, by movieId.

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

### `Tags Data File Structure (tags.csv)`
All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

userId,movieId,tag,timestamp
The lines within this file are ordered first by userId, then, within user, by movieId.

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

### `Movies Data File Structure (movies.csv)`
Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,title,genres
Movie titles are entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

Genres are a pipe-separated list, and are selected from the following:

Action
Adventure
Animation
Children's
Comedy
Crime
Documentary
Drama
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western
(no genres listed)

### `Links Data File Structure (links.csv)`
Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,imdbId,tmdbId
movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.

imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.

tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.

Use of the resources listed above is subject to the terms of each provider.

# Data Load, Cleaning

## Importing Packages

In [45]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
import statsmodels.api as sm
from datetime import datetime

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, FunctionTransformer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold

## Choosing Colors & Templates

## Import Data

### Ratings

In [35]:
# import ratings data file
ratings = pd.read_csv('ml-latest-small/ratings.csv')

In [36]:
# view the first 5 rows
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [37]:
# view data types record counts
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [38]:
# check for missing values within any columns
ratings.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

### Tags

In [39]:
# import ratings data file
tags = pd.read_csv('ml-latest-small/tags.csv')

In [40]:
# view the first 5 rows
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [41]:
# view data types record counts
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


In [42]:
# check for missing values within any columns
tags.isna().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

### Movies

In [43]:
# import ratings data file
movies = pd.read_csv('ml-latest-small/movies.csv')

In [44]:
# view the first 5 rows
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [52]:
# view data types record counts
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [53]:
# check for missing values within any columns
movies.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [123]:
# split out genres
movie_genres = movies.genres.str.split("|", expand = True).add_prefix('genre_')
movie_genres.head()

Unnamed: 0,genre_0,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,genre_7,genre_8,genre_9
0,Adventure,Animation,Children,Comedy,Fantasy,,,,,
1,Adventure,Children,Fantasy,,,,,,,
2,Comedy,Romance,,,,,,,,
3,Comedy,Drama,Romance,,,,,,,
4,Comedy,,,,,,,,,


In [124]:
# concat df back together
movies_2 = pd.concat([movies, movie_genres], axis=1)

# drop old genres column
movies_2.drop('genres', axis=1, inplace=True)

In [125]:
# view first 5 rows
movies_2.head()

Unnamed: 0,movieId,title,genre_0,genre_1,genre_2,genre_3,genre_4,genre_5,genre_6,genre_7,genre_8,genre_9
0,1,Toy Story (1995),Adventure,Animation,Children,Comedy,Fantasy,,,,,
1,2,Jumanji (1995),Adventure,Children,Fantasy,,,,,,,
2,3,Grumpier Old Men (1995),Comedy,Romance,,,,,,,,
3,4,Waiting to Exhale (1995),Comedy,Drama,Romance,,,,,,,
4,5,Father of the Bride Part II (1995),Comedy,,,,,,,,,


This looks great, however, I don't like how a single genre column could have more than one value.  It is hard for us to tell which are westerns as we would need to traverse multiple columns.  I am going to create now single columns for each genre, and a boolean flag so we can tell at a glance the movies and genre distributions.

In [126]:
# viewing the first column of genres to start splitting out into our denormalized data frame
movies_2['genre_0'].value_counts()

Comedy                2779
Drama                 2226
Action                1828
Adventure              653
Crime                  537
Horror                 468
Documentary            386
Animation              298
Children               197
Thriller                84
Sci-Fi                  62
Mystery                 48
Fantasy                 42
Romance                 38
(no genres listed)      34
Western                 23
Musical                 23
Film-Noir               12
War                      4
Name: genre_0, dtype: int64

In [150]:
movies_denormal = movies

In [151]:
# comedies
movies_denormal['Comedy'] = movies['genres'].str.contains("Comedy", case=False)

In [152]:
# drama
movies_denormal['Drama'] = movies['genres'].str.contains("Drama", case=False)

In [153]:
# action
movies_denormal['Action'] = movies['genres'].str.contains("Action", case=False)

In [154]:
# adventure
movies_denormal['Adventure'] = movies['genres'].str.contains("Adventure", case=False)

In [155]:
# crime
movies_denormal['Crime'] = movies['genres'].str.contains("Crime", case=False)

In [156]:
# horror
movies_denormal['Horror'] = movies['genres'].str.contains("Horror", case=False)

In [157]:
# documentary
movies_denormal['Documentary'] = movies['genres'].str.contains("Documentary", case=False)

In [158]:
# animation
movies_denormal['Animation'] = movies['genres'].str.contains("Animation", case=False)

In [159]:
# children
movies_denormal['Children'] = movies['genres'].str.contains("Children", case=False)

In [160]:
# thriller
movies_denormal['Thriller'] = movies['genres'].str.contains("Thriller", case=False)

In [161]:
# sci-fi
movies_denormal['Sci_Fi'] = movies['genres'].str.contains("Sci-Fi", case=False)

In [162]:
# mystery
movies_denormal['Mystery'] = movies['genres'].str.contains("Mystery", case=False)

In [163]:
# fantasy
movies_denormal['Fantasy'] = movies['genres'].str.contains("Fantasy", case=False)

In [164]:
# romance
movies_denormal['Romance'] = movies['genres'].str.contains("Romance", case=False)

In [165]:
# (no genres listed)
movies_denormal['(no genres listed)'] = movies['genres'].str.contains("(no genres listed)", case=False)

  return func(self, *args, **kwargs)


In [166]:
# western
movies_denormal['Western'] = movies['genres'].str.contains("Western", case=False)

In [167]:
# musical
movies_denormal['Musical'] = movies['genres'].str.contains("Musical", case=False)

In [168]:
# rilm-noir
movies_denormal['Film-Noir'] = movies['genres'].str.contains("Film-Noir", case=False)

In [169]:
# war
movies_denormal['War'] = movies['genres'].str.contains("War", case=False)

In [170]:
# now let's remove our string/ concatenated genres column
movies_denormal.drop('genres', axis=1)

Unnamed: 0,movieId,title,Comedy,Drama,Action,Crime,Horror,Documentary,Animation,Children,...,Sci_Fi,Mystery,Fantasy,Romance,(no genres listed),Western,Musical,Film-Noir,War,Adventure
0,1,Toy Story (1995),True,False,False,False,False,False,True,True,...,False,False,True,False,False,False,False,False,False,True
1,2,Jumanji (1995),False,False,False,False,False,False,False,True,...,False,False,True,False,False,False,False,False,False,True
2,3,Grumpier Old Men (1995),True,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
3,4,Waiting to Exhale (1995),True,True,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
4,5,Father of the Bride Part II (1995),True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),True,False,True,False,False,False,True,False,...,False,False,True,False,False,False,False,False,False,False
9738,193583,No Game No Life: Zero (2017),True,False,False,False,False,False,True,False,...,False,False,True,False,False,False,False,False,False,False
9739,193585,Flint (2017),False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
9740,193587,Bungo Stray Dogs: Dead Apple (2018),False,False,True,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False


This now looks much better and will be easier to work with as we move forward with our EDA.  Worst case scenario, we transform back into one of the previous versions with movies, or movies2 datasets.  Now going to move on to further denomalize our data set by joining the other files onto a single data frame.

### Join datasets together

In [58]:
# joining movies onto tags
denormal = pd.merge(
    ratings,
    movies,
    how="left",
    on='movieId'
)

denormal.head()


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller


In [None]:
# joining movies onto tags

denormal = pd.merge(
    tags,
    movies,
    how="left",
    on='movieId'
)

denormal.head()


# Exploratory Data Analysis