# Challenge Description


In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

With this context, EDSA is challenging you to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.

# Problem Statement

The evaluation metric for this competition is Root Mean Square Error. Root Mean Square Error  is commonly used in regression analysis and forecasting, and measures the standard deviation of the residuals arising between predicted and actual observed values for a modelling process. For our task of generating user movie ratings via recommendation algorithms, the the formula is given by:

Where \\( \hat{R} \\) is the total number of recommendations generated for users and movies, with \\( r_{ui} \\) and \\( \hat{r}_{ui} \\) being the true and predicted ratings for user \\( u \\) watching movie \\( i \\) respectively.

### Submission Format
For every author in the dataset, submission files should contain two columns: Id and rating. 'Id' is a concatenation of the userID and movieID given in the test file using an '_' character. 'rating' is the predicted rating for a given user-movie pair

## Installing packages
Please download all relevant packages in. There is no terminal so you will pip install everything.

You can find a list of recommended install from the Intro to Recommender sysytem notebook.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Installing packages
Please download all relevant packages in. There is no terminal so you will pip install everything.

You can find a list of recommended install from the Intro to Recommender sysytem notebook.

In [None]:
# Install packages here
# Packages for data processing
import numpy as np
import pandas as pd
import datetime
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from scipy.sparse import csr_matrix
import scipy as sp

# importing the libraries

import math
import random
import json
import re
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS


# Packages for modeling
from surprise import NormalPredictor
from surprise import Reader
from surprise import Dataset
from surprise import KNNBasic, KNNWithMeans, KNNWithZScore, KNNBaseline
from surprise import BaselineOnly, SlopeOne, CoClustering
from surprise.model_selection import cross_validate
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV
import heapq

# Packages for model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from time import time

# Package to suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Packages for saving models
import pickle

## Reading in data

In [None]:
df_sample_submission = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/sample_submission.csv')
df_movies = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/movies.csv')
df_imdb = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/imdb_data.csv')
df_genome_scores = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/genome_scores.csv')
df_genome_tags = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/genome_tags.csv')
df_train = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/train.csv')
df_test = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/test.csv')
df_tags = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/tags.csv')
df_links = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/links.csv')

## Dataset Overview¶
Let's first look at the shape (number of entries/rows and columns) of the datasets in order to have a general overview.

In [None]:
# Declaring a list that contains the names of the dataframes
dfs = [df_train, df_test, df_genome_scores, df_genome_tags, df_imdb, df_links, df_movies, df_tags]
# Create a list of the names of the imported datasets
df_names = ['train', 'test', 'genome_scores', 'genome_tags',
            'imdb_data', 'links', 'movies', 'tags']
dfs_dict = {}  # declaring an empty dictionary
for name, data in zip(df_names, dfs):  # iterate over the list and dictionary
    dfs_dict[name] = [data.shape[0], data.shape[1]]
    df_prop = pd.DataFrame(dfs_dict,
                          index=['rows', 'columns']).transpose()
df_properties = df_prop.sort_values(by='rows', ascending=False)

df_properties  # view the final output

In [None]:
df_movies.head()

In [None]:
df_sample_submission.head()

In [None]:
df_imdb.head()

In [None]:
df_genome_scores.head()

In [None]:
df_genome_tags.head()

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_tags.head()

In [None]:
df_links.head()

### Analysis of imdb_data
The plot below is a visual representation of the different columns in the imdb_data dataset with their percentage of missing values.

There is a high number of movies without budget, director or title cast. Such high proportions of missing data largely disqualifies this particular set from our current modelling task.

In [None]:
# The percentage of each column of missing values
total = df_imdb.isnull().sum().sort_values(ascending=False)
percent_1 = df_imdb.isnull().sum()/df_imdb.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2],
                         axis=1, keys=['Total', '(%) missing'])
missing_data['(%) missing'].plot(kind='barh')
plt.xlabel('(%) Missing Values')
plt.ylabel('Columns with Missing Values')
plt.title('Percentage of Missing Values per Column')
plt.show()

## Duplicates

In [None]:
# Checking for unique users and movieId's in the train dataset
users = len(df_train.userId.unique())
items = len(df_train.movieId.unique())
print('There are {} unique users and {}\
 unique movies train dataset with {} duplicated entries'.format(users, items, df_train[df_train.duplicated()].count().sum()))

##  Merging of Datasets
Now that we have a basic understanding of the data we are working with, we merge the sets below for more in depth analysis in the EDA section.

In [None]:
# Combining both train and movies datasets by using movieId
# as the matching column between both datasets
train_movies_df = pd.merge(df_train,
                           df_movies,
                           how='left',
                           on='movieId')

# Combining all the observations in movies_metadata_df with imdb_data
# using movieId as the matching column between both dataframes
movies_metadata_df = pd.merge(train_movies_df,
                              df_imdb,
                              how='left',
                              on='movieId')

movies_metadata_df.head()

In [None]:
movies_ranking = movies_metadata_df[['title','rating']].groupby('title').mean().sort_values('rating', ascending=False)
movies_ranking.head()

## EDA
Discovery phase and data understanding

In [None]:
#dropping the timestamp column because we don't need it
df_train = df_train.drop('timestamp', axis = 1)

In [None]:
movie_ratings = pd.merge(df_movies, df_train)
movie_ratings.head()

### Descriptive Statistics
The descriptive statistics of the movies dataset below does not provide any useful information

In [None]:

df_movies.describe()

In [None]:
df_train.describe()

### Data Visualizations
Distribution of rating score in ratings df
Showing the distribution of ratings given in the ratings dataset. In the plot below, we see that the rating scale ranges from 0.5 to 5.0 with increments of 0.5. The most prevalent ratings given are 3.0, and 4.0 with 5.0 coming in third. We also see that people were less likely to give low ratings as evidenced by the low number of movies rated between 0.5 and 2.5.

In [None]:

fig, ax = plt.subplots(figsize = (14,8))
sns.countplot(x = df_train.rating)
plt.suptitle('Frequency Distribution of Rating Scores', fontsize = 20);

### Distribution of User Ratings

In [None]:
print (f'Average rating in the dataset: {np.mean(df_train["rating"])}')

with sns.axes_style('white'):
    g = sns.factorplot("rating", data=df_train, aspect=2.5, kind='count')
    g.set_ylabels("Total number of ratings")

#### Average Rating per Genre
The genres with the highest average ratings

In [None]:
#subsetting the genres and rating columns from movie_ratings
genre_ratingdf = movie_ratings.loc[:,['genres', 'rating']]

# Unravel the genre columns
genre_ratingdf = genre_ratingdf.set_index(genre_ratingdf.columns.drop('genres',1).tolist()).genres.str.split('|', expand=True).stack().reset_index().rename(columns={0:'genres'}).loc[:, genre_ratingdf.columns]

# group by genres and their mean rating score and resetting the index
genre_mean_rating = pd.DataFrame(genre_ratingdf.groupby('genres')['rating'].mean()).reset_index()

# plotting the figure
fig, ax = plt.subplots(figsize=(14, 8))
sns.barplot(x = 'genres', y = 'rating', data = genre_mean_rating)

ax.tick_params(axis='x', labelsize= 15, rotation=90)
ax.set_xlabel('Genres', fontsize= 20)
ax.set_ylabel('Average Rating Score',fontsize= 20)
plt.suptitle('Average Rating per Genre', fontsize= 25);

###  Observation

The bar graph above shows that most movies have a rating of 4 followed by 3 while the least rated movies were rated 0.5 and 1.5. The mean rating is around 3.5 revealing that users tend to give higher ratings to movies in general.

In [None]:
movies_ranking['No_of_ratings'] = movies_metadata_df.groupby('title')['rating'].count()

In [None]:
movies_ranking.sort_values(by=['No_of_ratings', 'rating'],
                          ascending=False).head()

In [None]:
# Set plot size
sns.set(rc={'figure.figsize':(12,9)})

# Plot Number of rating for every rating category.
sns.scatterplot(x='rating', y='No_of_ratings', data=movies_ranking)
plt.title('Number of ratings per average rating per movie')
plt.xlabel('Rating')
plt.ylabel('Number of ratings')
plt.show()

The above scatterplot shows that there is a strong correlation between the number of ratings a rating-category contains and the rating category, i.e. movies that have more ratings (views) strongly tend to also have higher average ratings. This supports the previously established notion that users tend to give higher ratings in general. The plot below similarly shows that even movies with more than one hundred views (ratings) the average rating stays consistent around 3.5.

In [None]:
# Average rating of movies in the dataset
avg_rating = df_train.groupby('movieId')['rating'].mean()

# Plotting the results
plt.figure(figsize=(12,10))
avg_rating.plot(kind='hist')
plt.ylabel('Frequency')
plt.xlabel('Movie Rating')
plt.title('Average ratings of movies with 100 or more viewers')
plt.show()

Below we look at average rating for individual movie directors. From the resulting dataframe it is clear that certian directors like Quentin Tarantino, Lilly Wachowski, and Stephen Kind have both higher-than-average average ratings AND higher numbers of ratings (views).

In [None]:
best_director = pd.DataFrame(movies_metadata_df.groupby('director')['rating'].mean().
                             sort_values(ascending=False))
best_director['No_of_ratings'] = movies_metadata_df.groupby('director')['rating'].count()
best_director.sort_values(by=['No_of_ratings', 'rating'], ascending=False).head(10)

The plot below shows the distribution of ratings for directors, extending the positive correclation between 'number of ratings' and 'rating' to 'directors' as well.

In [None]:
# Set plot size
sns.set(rc={'figure.figsize':(12,9)})

sns.scatterplot(x = 'rating', y = 'No_of_ratings', data = best_director).set_title('Number of ratings per average rating per director')
plt.xlabel('Ratings')
plt.ylabel('Number of Ratings')
plt.show()

**Most common Genres**

In [None]:
# Create dataframe containing only the movieId and genres
movies_genres = pd.DataFrame(df_movies[['movieId', 'genres']],
                             columns=['movieId', 'genres'])

# Split genres seperated by "|" and create a list containing the genres allocated to each movie
movies_genres.genres = movies_genres.genres.apply(lambda x: x.split('|'))

# Create expanded dataframe where each movie-genre combination is in a seperate row
movies_genres = pd.DataFrame([(tup.movieId, d) for tup in movies_genres.itertuples() for d in tup.genres],
                             columns=['movieId', 'genres'])

movies_genres.head()

In [None]:
# Plot the genres from most common to least common
plot = plt.figure(figsize=(15, 10))
plt.title('Most common genres\n', fontsize=20)
sns.countplot(y="genres", data=movies_genres,
              order=movies_genres['genres'].value_counts(ascending=False).index,
              palette='Reds_r')
plt.show()

## Data Prepartion

In [None]:
# Creating a small test dataframe to evaluate our models
tests = df_train.copy()

tests = tests.head(20000)

# Creating the training data
reader = Reader(rating_scale=(0.5, 5))
test_data = Dataset.load_from_df(tests[['userId','movieId','rating']], reader)

# Compute similarities between users using cosine distance
sim_options = {"name": "cosine",
               "user_based": True}  

# Evaluate the model 
user = KNNWithMeans(sim_options=sim_options)
cv = cross_validate(user, test_data, cv=5, measures=['RMSE'], verbose=True)

In [None]:
# Compute similarities between items using cosine distance
sim_options = {"name": "cosine",
               "user_based": False}  

# Fit the KNNwithmeans algorithm to the training set
item_based = KNNWithMeans(sim_options=sim_options)

# Evaluate the model 
cv = cross_validate(item_based, test_data, cv=5, measures=['RMSE'], verbose=True)

## Modelling phase

Here you can apply the models outline in the Intro to Recommender Notebook. You only need to apply one version 
be it Content based or Collabrative method



In [None]:
# Loading as Surprise dataframe 
reader = Reader()
data = Dataset.load_from_df(df_train[['userId', 'movieId', 'rating']], reader)

In [None]:
# Data split 85/15
trainset, testset = train_test_split(data, test_size=0.15)

In [None]:
co_clust = CoClustering()

In [None]:
# Fitting our trainset
co_clust.fit(trainset)

# Using the 15% testset to make predictions
predictions = co_clust.test(testset) 
predictions

test = pd.DataFrame(predictions)

In [None]:
# View the head
test.head()

In [None]:
# We are trying to predict ratings for every userId / movieId pair, we implement the below list comprehension to achieve this.
ratings_predictions=[co_clust.predict(row.userId, row.movieId) for _,row in df_test.iterrows()]
ratings_predictions

In [None]:

# Converting our prediction into a familiar format-Dataframe
df_pred=pd.DataFrame(ratings_predictions)
df_pred

In [None]:

# Renaming our predictions to original names
df_pred=df_pred.rename(columns={'uid':'userId', 'iid':'movieId','est':'rating'})
df_pred.drop(['r_ui','details'],axis=1,inplace=True)

In [None]:
# Snippet of our ratings
df_pred.head()

In [None]:
# Concatenating userId/movieId into a single Id column.(code has to be run twice to get desired outcome)
df_pred['Id']=df_pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)
df_pred['Id']=df_pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)

In [None]:
# drop the two features from the dataset userId and movieId
df_pred.drop(['userId', 'movieId'], inplace=True, axis= 1)

In [None]:
# Shape of the prediction dataset
df_pred.shape

Prepare Submission File
We make submissions in CSV files. Your submissions usually have two columns: an ID column and a prediction column. The ID field comes from the test data (keeping whatever name the ID field had in that data, which for the data is the string 'Id'). The prediction column will use the name of the target field.

We will create a DataFrame with this data, and then use the dataframe's to_csv method to write our submission file. Explicitly include the argument index=False to prevent pandas from adding another column in our csv file.

In [None]:
# Submission final csv. file
df_pred.to_csv("coClustering.csv", index=False)


### Make Submission
Hit the blue Publish button at the top of your notebook screen. It will take some time for your kernel to run. When it has finished your navigation bar at the top of the screen will have a tab for Output. This only shows up if you have written an output file (like we did in the Prepare Submission File step).

Example below of how the output would look once published

In [None]:
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
  
# print dataframe.
df

In [None]:
df.to_csv('my_test_output.csv', index = False)

## Collaborators
1. Kwanda Silekwa
2. Nomfundo Manyisa
3. Sihle Riti
4. Thanyani Khedzi
5. Thembinkosi Malefo
6. Ofentse Makeketlane

# Reference

https://towardsdatascience.com/exploring-movie-data-with-interactive-visualizations-c22e8ce5f663

https://in.springboard.com/blog/recommender-system-with-python/

https://asdkazmi.medium.com/ai-movies-recommendation-system-with-clustering-based-k-means-algorithm-f04467e02fcd

https://github.com/Wonuabimbola/movie-recommendation-system/blob/master/movie_rec_system.ipynb

https://towardsdatascience.com/unsupervised-classification-project-building-a-movie-recommender-with-clustering-analysis-and-4bab0738efe6

https://www.kaggle.com/dibyawantrivedi/movie-recommendation-system

https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d