# Business Problem

A client - an independent movie company that prefers to remain anonymous - is interested in entering the streaming space.  The client recognizes that this space is competitive due to present offering.  However, the client still believes there is an opportunity based on its marketing analysis and backlog of independent films.

Before building the streaming service, the client has requested KBO Analytics to create a recommendation system.  KBO Analytics will address the first phase of this project by building a proof-of-concept based on the MovieLens dataset.

# Data Understanding

The data for examing the aforementioned problem comes from the following source: [MovieLens](https://grouplens.org/datasets/movielens/latest/)

Before beginning to create a recommendation system, I want to examine and become familiar with the dataset. I will conduct exploratory data analysis (EDA) in order to understand the dataset attributes, which includes, but not limited to the following:

1. Number of Columns
2. Number of Rows
3. Column Names
4. Format of the data in each column

In [1]:
# Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV

There are a total of four csv files associated with MovieLens.  They are the following:

- *links.csv*
- *movies.csv*
- *ratings.csv*
- *tags.csv*

I will investigate each of the aforementioned files in order to further understand how I will build the recommendation system.

## Links.csv ##

In [2]:
# Reading the 'links.csv' tile into a dataframe

df_links = pd.read_csv('data/links.csv')

In [3]:
# Examining the first five rows of the dataframe

df_links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [4]:
# Examining the last five rows of the dataframe

df_links.tail()

Unnamed: 0,movieId,imdbId,tmdbId
9737,193581,5476944,432131.0
9738,193583,5914996,445030.0
9739,193585,6397426,479308.0
9740,193587,8391976,483455.0
9741,193609,101726,37891.0


In [11]:
# Examining the dataframe

df_links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


In [13]:
# Examining missing values in each column

df_links.isna().sum()

movieId    0
imdbId     0
tmdbId     8
dtype: int64

In [14]:
# Examining dataframe for duplicate data

df_links.duplicated().sum()

0

## Movies.csv ##

In [5]:
# Reading the 'movies.csv' tile into a dataframe

df_movies = pd.read_csv('data/movies.csv')

In [6]:
# Examining the first 5 rows of the dataframe

df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [15]:
# Examining the last 5 rows of the dataframe

df_movies.tail()

Unnamed: 0,movieId,title,genres
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [16]:
# Examining missing values in each column

df_movies.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

In [17]:
# Examining dataframe for duplicate data

df_movies.duplicated().sum()

0

## Ratings.csv ##

In [7]:
# Reading the 'ratings.csv' tile into a dataframe

df_ratings = pd.read_csv('data/ratings.csv')

In [8]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [18]:
df_ratings.tail()

Unnamed: 0,userId,movieId,rating,timestamp
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352
100835,610,170875,3.0,1493846415


In [19]:
df_ratings.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [20]:
df_ratings.duplicated().sum()

0

## Tags.csv ##

In [9]:
df_tags = pd.read_csv('data/tags.csv')

In [10]:
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [21]:
df_tags.tail()

Unnamed: 0,userId,movieId,tag,timestamp
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978
3682,610,168248,Heroic Bloodshed,1493844270


In [22]:
df_tags.isna().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

In [23]:
df_tags.duplicated().sum()

0

# Data Preparation

# Modeling

# Overall Conclusion and Recommendations

## Overall Conclusion

## Recommendations