### Description:

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.

### Data Dictionary:
UserID – 4848 customers who provided a rating for each movie<br>
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users

### Data Considerations:
- All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
- Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

### Analysis Tasks:
**1. Exploratory Data Analysis:**<br>

1a) Which movies have maximum views/ratings?<br>
1b)What is the average rating for each movie? Define the top 5 movies with the maximum ratings.<br>
1c) Define the top 5 movies with the least audience.<br>

#### 2. Recommendation Model: <br>
Some of the movies hadn’t been watched and therefore, are not rated by the users. <br>Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

2a) Divide the data into training and test data<br>
2b) Build a recommendation model on training data<br>
2c) Make predictions on the test data

### Dataset: 
'Amazon - Movies and TV Ratings.csv'

#### EDA:

In [None]:
#importing libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
df = pd.read_csv('Amazon - Movies and TV Ratings.csv')
df.head()

In [None]:
#Transpose of the .describe() matrix for better understanding of the dataset
df.describe().T.head()

### 1a) Which movies have maximum views/ratings? COUNT RATINGS

In [None]:
#Rating count per movie out of 4848 distinct users

df.describe().T['count'].sort_values(ascending = False).head().to_frame()

#Hence the the top 5 top movies as per the number of views/ratings are shown below:
# Movie127    2313.0
# Movie140     578.0
# Movie16      320.0
# Movie103     272.0
# Movie29      243.0

In [None]:
#Sum of ratings
df.drop('user_id',axis = 1).sum().sort_values(ascending = False).head().to_frame()

### 1b)What is the average rating for each movie? Define the top 5 movies with the maximum ratings.

There are a total of 4848 distinct customers.<br>
Hence the average rating per movie = (sum of the ratings) / (Number of times the movie has been rated)

In [None]:
df.drop('user_id',axis = 1).mean().sort_values(ascending = False).head().to_frame()

### 1c) Define the top 5 movies with the least audience.

In [None]:
df.describe().T['count'].sort_values(ascending = True).head(5).to_frame()

#### 2. Recommendation Model: <br>
Some of the movies hadn’t been watched and therefore, are not rated by the users. <br>Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

2a) Divide the data into training and test data<br>
2b) Build a recommendation model on training data<br>
2c) Make predictions on the test data

In [None]:
df.head(2)

In [None]:
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise import Dataset

**For us, all we need is userID | MovieID | Rating**

In [None]:
melt_df = df.melt(id_vars= df.columns[0], value_vars= df.columns[1:], var_name = 'movie_name', value_name = 'rating')
melt_df.head()

In [None]:
reader = Reader(rating_scale=(-1,10))

data = Dataset.load_from_df(melt_df.fillna(0), reader = reader)

### 2a) Divide the data into training and test data<br>

In [None]:
train_data, test_data = train_test_split(data, test_size=0.2)

In [None]:
#Algorithm = Singular Value Decomposition
from surprise import SVD

In [None]:
algo = SVD()

### 2b) Build a recommendation model on training data<br>

In [None]:
algo.fit(train_data)

In [None]:
pred = algo.test(test_data)

In [None]:
accuracy.rmse(predictions= pred)

#melt_df.fillna(0): RMSE: 0.2810

In [None]:
#Prediction

#Sample Input
u_id, m_id, rat = ['A1CV1WROP5KTTW', 'Movie5', 5.0]

algo.predict(u_id,m_id,rat, verbose = True)


#user: A1CV1WROP5KTTW item: Movie5     r_ui = 5.00   est = 0.13   {'was_impossible': False}
#Very Poor Prediction, fillna(0) is not working out well.

In [None]:
#Cross Validation using surprise Library

from surprise.model_selection import cross_validate

In [None]:
# cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

# Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

#                   Fold 1  Fold 2  Fold 3  Mean    Std     
# RMSE (testset)    0.2834  0.2867  0.2777  0.2826  0.0037  
# MAE (testset)     0.0426  0.0431  0.0426  0.0428  0.0003  
# Fit time          31.97   32.26   32.28   32.17   0.14    
# Test time         3.81    3.81    3.29    3.64    0.25    

In [None]:
#RMSE = 0.2826. Let us try to reduce it, with cross validation.

### 2c) Make predictions on the test data<br>

In [None]:
def repeat(algo_type, frame, min_, max_):
    reader = Reader(rating_scale=(min_, max_))
    
    data = Dataset.load_from_df(frame, reader= reader)
    
    algo = algo_type
    
    print(cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=True))
    
    print("#"*10)
    
#     u_id, m_id, rat = ['A1CV1WROP5KTTW', 'Movie5', 5.0]
    u_id, m_id, rat = ['A3R5OBKS7OM2IR', 'Movie1', 5.0]

    
    print(algo.predict(u_id,m_id,rat, verbose = True))
    
    print("#"*10)
    print()

In [None]:
# df_1 = df.iloc[:1000, :50]
# melt_df_1 = df.melt(id_vars= df.columns[0], value_vars= df.columns[1:], var_name = 'movie_name', value_name = 'rating')

In [None]:
repeat( SVD(), melt_df_1.fillna(0), 5, 10)

# melt_df_1.fillna(0), 5, 10)
# Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

#                   Fold 1  Fold 2  Fold 3  Mean    Std     
# RMSE (testset)    4.9884  4.9884  4.9883  4.9884  0.0000  
# MAE (testset)     4.9781  4.9781  4.9779  4.9780  0.0001  
# Fit time          31.19   31.83   31.65   31.56   0.27    
# Test time         2.74    3.06    2.73    2.84    0.15    
# {'test_rmse': array([4.98839914, 4.98839824, 4.98834103]), 'test_mae': array([4.97814332, 4.97809226, 4.977894  ]), 'fit_time': (31.189449548721313, 31.83244276046753, 31.646572589874268), 'test_time': (2.7396697998046875, 3.0581066608428955, 2.729769229888916)}
# ##########
# user: A3R5OBKS7OM2IR item: Movie1     r_ui = 5.00   est = 5.00   {'was_impossible': False}