# _User Based Recommendation Model for Amazon_

<b>DESCRIPTION</b>

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.

<b> Data Dictionary </b>

UserID – 4848 customers who provided a rating for each movie
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users

<b>Data Considerations</b>

- All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
- Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

<b>Analysis Task</b>
Exploratory Data Analysis:

    - Which movies have maximum views/ratings?
    - What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
    - Define the top 5 movies with the least audience.
    
<b>Recommendation Model:</b> Some of the movies hadn’t been watched and therefore, are not rated by the users. Amazon would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.

    - Divide the data into training and test data
    - Build a recommendation model on training data
    - Make predictions on the test data


## _Import Libraries & Load Data_

In [1]:
## Import the libraries
import pandas as pd
import numpy as np
#visualizations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#handle warnings
import warnings
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
warnings.filterwarnings(action='ignore',category=FutureWarning)
#consistent sized plots
from pylab import rcParams
rcParams['figure.figsize']=12,5
rcParams['axes.labelsize']=10
rcParams['xtick.labelsize']=10
rcParams['ytick.labelsize']=10

In [2]:
#load the data
movies = pd.read_csv('Amazon - Movies and TV Ratings.csv',delimiter=',',engine='python')
movies.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,...,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,...,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,...,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,...,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,...,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,...,,,,,,,,,,


In [3]:
#check the shape
movies.shape

(4848, 207)

_As mentioned, there are 4,848 users and 206 movies in the dataset_

In [4]:
#check info
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4848 entries, 0 to 4847
Columns: 207 entries, user_id to Movie206
dtypes: float64(206), object(1)
memory usage: 7.7+ MB


_The datatype of the ratings is float and for the user id is object. The format is correct for further analysis_

## _Exploratory Data Analysis_

### _Which movie has the maximum views and the ratings?_

In [5]:
df = movies.melt(id_vars='user_id',var_name='movie_id',value_name='rating')

In [6]:
df.sort_values('user_id',inplace=True)

In [7]:
df.head(3)

Unnamed: 0,user_id,movie_id,rating
576418,A0047322388NOTO4N8SKD,Movie119,
847906,A0047322388NOTO4N8SKD,Movie175,
300082,A0047322388NOTO4N8SKD,Movie62,


_There are a lot of NaN values in the dataset now as all the movies have not been watched and rated by all the users_

In [8]:
#check the shape .. 
df.shape

(998688, 3)

In [9]:
# groupby users and check the number of movies rated by the users 
num_user_rating = df.groupby('user_id')['rating'].count().sort_values(ascending=False)
print('Tope 10 Most Ratings by a User')
num_user_rating[:10]

Tope 10 Most Ratings by a User


user_id
A2AKR3QR28W09U    6
A1ISBRQ8WUFE41    5
A137SY2CCOWTP6    5
A6GMEO3VRY51S     5
A3H82LUT1EC655    4
A3E102F6LPUF1J    4
A3EVDRB2NK2UHS    4
AD5JI9UN98JPH     3
AP3B615GM191G     3
A3CWH6VKCTJAD     3
Name: rating, dtype: int64

_So the maximum number of ratings given by a user is 6 or in other words, the number of movies watched and rated by the top user is only 6_

In [10]:
#most watched movies .. 
movie_num_rating = df.groupby('movie_id')['rating'].count().sort_values(ascending=False)
print('Top 10 Most Watched/Rated Movies')
movie_num_rating[:10]

Top 10 Most Watched/Rated Movies


movie_id
Movie127    2313
Movie140     578
Movie16      320
Movie103     272
Movie29      243
Movie91      128
Movie92      101
Movie89       83
Movie158      66
Movie108      54
Name: rating, dtype: int64

In [11]:
#least watached movies .. 
print('Top 10 Least Watched/Rated Movies')
movie_num_rating[-10:]

Top 10 Least Watched/Rated Movies


movie_id
Movie153    1
Movie154    1
Movie47     1
Movie156    1
Movie46     1
Movie45     1
Movie42     1
Movie41     1
Movie38     1
Movie1      1
Name: rating, dtype: int64

In [12]:
df_1 = df.groupby(['user_id','movie_id'],as_index=False)['rating'].mean()
df_1['rating'].value_counts()

5.0    3659
4.0     521
1.0     363
3.0     272
2.0     185
Name: rating, dtype: int64

In [14]:
df_1.shape

(998688, 3)

## _Recommendation Model_

In [15]:
#import the surprise package
from surprise import Dataset
from surprise import KNNWithMeans
from surprise.reader import Reader
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate
from surprise import accuracy

In [16]:
#read the dataset from the dataframe created
reader = Reader()
data = Dataset.load_from_df(df_1,reader)

In [17]:
#split the dataset into train and test set
train_set,test_set = train_test_split(data,test_size=0.20)

In [18]:
#use user based recommendation model based on collaborative filtering 
algo = KNNWithMeans(k=40,sim_options={'name':'cosine','user_based':True})
#fit on the train set
algo.fit(train_set)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f1b03e2d2d0>

In [19]:
#get a prediction for any user and movie id
uid = 'A0047322388NOTO4N8SKD'
iid =  'Movie127'
predict = algo.predict(uid,iid,verbose=True)

user: A0047322388NOTO4N8SKD item: Movie127   r_ui = None   est = 5.00   {'actual_k': 0, 'was_impossible': False}


In [20]:
#run the trained model against the test set
test_predict = algo.test(test_set)

In [21]:
#get the performance score
print('User based Recommendation System Model - RMSE on Test Set')
accuracy.rmse(test_predict,verbose=True)

User based Recommendation System Model - RMSE on Test Set
RMSE: nan


nan

In [22]:
from collections import defaultdict

In [23]:
def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


In [24]:
top_n = get_top_n(test_predict, n=10)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

AODZXY5RCM4QT ['Movie58', 'Movie135', 'Movie109', 'Movie1', 'Movie75', 'Movie92', 'Movie134', 'Movie18', 'Movie37', 'Movie202']
AQ8DU6XVA3USJ ['Movie25', 'Movie1', 'Movie43', 'Movie114', 'Movie40', 'Movie29', 'Movie200', 'Movie28', 'Movie202', 'Movie46']
A2T7TZBOTP68C7 ['Movie16', 'Movie195', 'Movie63', 'Movie180', 'Movie4', 'Movie6', 'Movie193', 'Movie20', 'Movie135', 'Movie99']
AKMMCNA8BDB5C ['Movie43', 'Movie22', 'Movie9', 'Movie57', 'Movie112', 'Movie91', 'Movie100', 'Movie59', 'Movie181', 'Movie188']
A2S6UVLBE48EY2 ['Movie202', 'Movie55', 'Movie32', 'Movie71', 'Movie88', 'Movie45', 'Movie77', 'Movie61', 'Movie108', 'Movie60']
A2BG16N3K3SQ6 ['Movie157', 'Movie73', 'Movie74', 'Movie171', 'Movie10', 'Movie29', 'Movie126', 'Movie75', 'Movie28', 'Movie76']
A1Z3K3GOK50S42 ['Movie30', 'Movie54', 'Movie82', 'Movie146', 'Movie111', 'Movie4', 'Movie79', 'Movie110', 'Movie120', 'Movie165']
A3HHER92LK8DA1 ['Movie24', 'Movie20', 'Movie194', 'Movie182', 'Movie175', 'Movie122', 'Movie110', 'Movi