
# Generating Movie Recommendations

### Unsupervised_Learning_ZM6

1. Noxolo Ngcobo
2. Mora Magakwe
3. Sandra Malope
4. Katleho Moketo
5. Shuaib Morris
6. Matthews Montle


## Introduction

Recommendation systems are becoming increasingly important in today’s extremely busy world. People are forever short on time due to the tasks they need to accomplish in the limited time they have. Therefore, the recommendation systems are very important as they assist in making the right choices.

The purpose of a recommendation system basically is to search for content that would be interesting to an individual. Moreover, it involves a number of factors to create personalised lists of useful and interesting content specific to each user/individual. Recommendation systems are Artificial Intelligence based algorithms that skim through all possible options and create a customized list of items that are interesting and relevant to an individual. These results are based on their profile, search/browsing history, what other people with similar traits/demographics are watching, and how likely are you to watch those movies. This is achieved through predictive modeling and heuristics with the data available.

With this context, EDSA is challenging us to construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.


<div align="center" style="width: 900px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/kmoketo-dev/ZM6_Unsurpervised-Learning-Predict/main/intro.jpg"
     alt="Titanic"
     style="float: center; padding-bottom=0.5em"
     width=900px/>

</div>


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Installing packages
Please download all relevant packages in. There is no terminal so you will pip install everything.

You can find a list of recommended install from the Intro to Recommender sysytem notebook.

In [None]:
# Install packages here
# Packages for data processing
import numpy as np
import pandas as pd
import datetime
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from scipy.sparse import csr_matrix
import scipy as sp


# Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Packages for modeling
from surprise import Reader
from surprise import Dataset
from surprise import KNNWithMeans
from surprise import KNNBasic
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
import heapq

# Packages for model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from time import time

# Package to suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Packages for saving models
import pickle

## Reading in data

In [2]:
# df_sample_submission = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/sample_submission.csv')
# df_movies = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/movies.csv')
# df_imdb = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/imdb_data.csv')
# df_genome_scores = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/genome_scores.csv')
# df_genome_tags = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/genome_tags.csv')
# df_train = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/train.csv')
# df_test = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/test.csv')
# df_tags = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/tags.csv')
# df_links = pd.read_csv('/kaggle/input/edsa-movie-recommendation-challenge/links.csv')

In [None]:
df_sample_submission = pd.read_csv('sample_submission.csv')
df_movies = pd.read_csv('movies.csv')
df_imdb = pd.read_csv('imdb_data.csv')
df_genome_scores = pd.read_csv('genome_scores.csv')
df_genome_tags = pd.read_csv('genome_tags.csv')
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_tags = pd.read_csv('tags.csv')
df_links = pd.read_csv('links.csv')

In [None]:
df_movies.head()

In [None]:
df_imdb.head()

In [None]:
df_genome_scores.head()

In [None]:
df_genome_tags.head()

In [None]:
df_train.head()

In [None]:
df_tags.head()

In [None]:
df_links.head()

In [None]:
df_test.head()

In [None]:
df_movies.shape

In [None]:
df_train.shape

In [None]:
df_movies.describe()

In [None]:
df_train.describe()

In [None]:
dataset = pd.merge(df_train,df_movies,on = 'movieId')

In [None]:
dataset.head()

In [None]:
dataset.shape

In [None]:
dataset.nunique()

In [None]:
dataset.describe()

## EDA
Discovery phase and data understanding

In [None]:
#average rating
a=dataset
a=a.groupby('title')['rating'].mean().head()
a

In [None]:
#the rating of each movie
b=dataset.groupby('title')['rating'].count()
b.head()

In [None]:
#making a new dataframe
new_record = pd.DataFrame()
new_record['Average_ratings']=a

In [None]:
new_record['Count of total ratings'] = b
new_record.head()

In [None]:
#plot graph of num of 
plt.figure(figsize=(10,9))

new_record['Count of total ratings'].hist(bins = 100)

In [None]:
#plot graph of ratings column
plt.figure(figsize =(10,4))

new_record['Average_ratings'].hist(bins = 70)

In [None]:
plt.figure(figsize=(8,6))
sns.jointplot(x='Average_ratings',  y='Count of total ratings',data=new_record,alpha=0.4)

In [None]:
movies_df=pd.merge(df_train, df_movies,on='movieId',how='inner')
movies_df.head()

**Genres with highest rating.**

In [None]:
genre_count=movies_df['genres'].value_counts().sort_values(ascending=False)
genre_count=pd.DataFrame(genre_count)
top_genre=genre_count[0:11]
top_genre

In [None]:
# To find the number of times a user rated a movie
user_df = pd.DataFrame(
    df_train['userId'].value_counts()).reset_index()
user_df.rename(columns={'index':'userId','userId':'count'},
                  inplace=True)
user_df.head(10)

In [None]:
plt.figure(figsize=(14,7))
data = df_train['userId'].value_counts().head(10)
ax = sns.barplot(x = data.index, y = data, order= data.index, palette='CMRmap', edgecolor="black")
plt.title(f'Top 10 Users by Number of Ratings', fontsize=14)
plt.xlabel('User ID')
plt.ylabel('Number of Ratings')
plt.show()

The above output shows that user the with id **72315** has the highest rating count of **12952**

In [None]:
movieRating_Group = df_train['rating'].value_counts().sort_index().reset_index()
fig, ax = plt.subplots(figsize=(14,7))
sns.barplot(data=movieRating_Group, x='index', y='rating', palette="CMRmap", edgecolor="black", ax=ax)
ax.set_xlabel("Rating")
ax.set_ylabel('Number of Users')
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
total = float(movieRating_Group['rating'].sum())
plt.title('Number of Users Per Rating', fontsize=14)
plt.show()

**Most common Genres**

In [None]:
# Create dataframe containing only the movieId and genres
movies_genres = pd.DataFrame(df_movies[['movieId', 'genres']],
                             columns=['movieId', 'genres'])

# Split genres seperated by "|" and create a list containing the genres allocated to each movie
movies_genres.genres = movies_genres.genres.apply(lambda x: x.split('|'))

# Create expanded dataframe where each movie-genre combination is in a seperate row
movies_genres = pd.DataFrame([(tup.movieId, d) for tup in movies_genres.itertuples() for d in tup.genres],
                             columns=['movieId', 'genres'])

movies_genres.head()

In [None]:
# Plot the genres from most common to least common
plot = plt.figure(figsize=(15, 10))
plt.title('Most common genres\n', fontsize=20)
sns.countplot(y="genres", data=movies_genres,
              order=movies_genres['genres'].value_counts(ascending=False).index)
plt.show()

In [None]:
dates = []
for title in movies_df['title']:
    if title[-1] == " ":
        year = title[-6: -2]
        try:
            dates.append(int(year))
        except:
            dates.append(9999)
    else:
        year = title[-5: -1]
        try:
            dates.append(int(year))
        except:
            dates.append(9999)

movies_df['Year'] = dates

**Years in which movies were released**

In [None]:
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(y="Year", data=movies_df, palette="Set2", order=movies_df['Year'].value_counts().index[0:15])

So, **1995** was the year when most of the movies were released.

In [None]:
dataset.groupby('title')['rating'].count().sort_values(ascending=False).head()

In [None]:
ratings = pd.DataFrame(dataset.groupby('title')['rating'].mean())
ratings.head()

In [None]:
ratings['num of ratings'] = pd.DataFrame(dataset.groupby('title')['rating'].count())
ratings.head()

In [None]:
plt.figure(figsize=(10,4))
ratings['num of ratings'].hist(bins=2)

In [None]:
plt.figure(figsize=(10,4))
ratings['rating'].hist(bins=70)

In [None]:
sns.jointplot(x='rating',y='num of ratings',data=ratings,alpha=0.5)

In [None]:
years = []

for title in df_movies['title']:
    yearpub_subset = title[-5:-1]
    try: years.append(int(yearpub_subset))
    except: years.append(9999)
        
df_movies['yearpub'] = years
print(len(df_movies[df_movies['yearpub'] == 9999]))

In [None]:
def make_histogram(dataset, attribute, bins=25, bar_color='#3498db', edge_color='#2980b9', title='Title', xlab='X', ylab='Y', sort_index=False):
    if attribute == 'yearpub':
        dataset = dataset[dataset['yearpub'] != 9999]
        
    fig, ax = plt.subplots(figsize=(14, 7))
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.set_title(title, fontsize=24, pad=20)
    ax.set_xlabel(xlab, fontsize=16, labelpad=20)
    ax.set_ylabel(ylab, fontsize=16, labelpad=20)
    
    plt.hist(dataset[attribute], bins=bins, color=bar_color, ec=edge_color, linewidth=2)
    
    plt.xticks(rotation=45)
    
    
make_histogram(df_movies, 'yearpub', title='Movies per year distribution', xlab='year', ylab='Counts')

In [None]:
ratings_df = pd.DataFrame()
ratings_df['Mean_Rating'] = dataset.groupby('title')['rating'].mean().values
ratings_df['Num_Ratings'] = dataset.groupby('title')['rating'].count().values


fig, ax = plt.subplots(figsize=(14, 7))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Rating vs. Number of Ratings', fontsize=24, pad=20)
ax.set_xlabel('Rating', fontsize=16, labelpad=20)
ax.set_ylabel('Number of Ratings', fontsize=16, labelpad=20)

plt.scatter(ratings_df['Mean_Rating'], ratings_df['Num_Ratings'], alpha=0.5)

## Data Preparation and cleaning

In [20]:
df_train['rating'].value_counts()

4.0    2652977
3.0    1959759
5.0    1445230
3.5    1270642
4.5     880516
2.0     656821
2.5     505578
1.0     311213
1.5     159731
0.5     157571
Name: rating, dtype: int64

In [23]:
x = df_train['userId'].value_counts() > 500
y = x[x].index  #user_ids
print(y.shape)
users = df_train[df_train['userId'].isin(y)]
len(users)

(1551,)


1177413

### Matrix Factorization-based Algorithm

In [3]:
#splitting data
train_data=df_train.iloc[:int(df_train.shape[0]*0.80)]
test_data=df_train.iloc[int(df_train.shape[0]*0.80):] 
train_data.drop(['timestamp'], 1, inplace = True)
test_data.drop(['timestamp'], 1, inplace = True)

In [4]:
#check shape
train_data.shape, test_data.shape

((8000030, 3), (2000008, 3))

In [5]:
## Here you will sort your data out and process it accordingly
# This  specifies how to read the data frame.
init_reader = Reader(rating_scale=(1,5))

# create the traindata from the data frame
train_data_mf = Dataset.load_from_df(train_data[['userId', 'movieId', 'rating']], init_reader)
# build the train set from traindata. 
#It is of dataset format from surprise library
trainset = train_data_mf.build_full_trainset()


In [6]:
init_reader = Reader(rating_scale=(1,5))

# create the traindata from the data frame
test_data_mf = Dataset.load_from_df(test_data[['userId', 'movieId', 'rating']], init_reader)
# build the train set from traindata. 
#It is of dataset format from surprise library
testset = test_data_mf.build_full_trainset()


In [7]:
svd = SVD(n_factors=100, biased=True, random_state=15, verbose=True)
svd.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f91c6a80390>

In [8]:
#getting predictions of train set
train_preds = svd.test(trainset.build_testset())
train_pred_mf = np.array([pred.est for pred in train_preds])

In [9]:
#getting predictions of testset
test_preds = svd.test(testset.build_testset())

test_pred_mf = np.array([pred.est for pred in test_preds])

## Modelling phase

Here you can apply the models outline in the Intro to Recommender Notebook. You only need to apply one version 
be it Content based or Collabrative method



### Xgboost

In [10]:
# Creating a sparse matrix
from scipy.sparse import csr_matrix
train_sparse_matrix = csr_matrix((train_data.rating.values, (train_data.userId.values, train_data.movieId.values)))

#### Features which represent the global averages

In [11]:
train_averages = dict()
# get the global average of ratings in our train set.
train_global_average = train_sparse_matrix.sum()/train_sparse_matrix.count_nonzero()
train_averages['global'] = train_global_average
train_averages

{'global': 3.5334301246370328}

In [12]:
def get_average_ratings(sparse_matrix, of_users):
    
    # average ratings of user/axes
    ax = 1 if of_users else 0 # 1 - User axes,0 - Movie axes

    # ".A1" is for converting Column_Matrix to 1-D numpy array 
    sum_of_ratings = sparse_matrix.sum(axis=ax).A1
    # Boolean matrix of ratings ( whether a user rated that movie or not)
    is_rated = sparse_matrix!=0
    # no of ratings that each user OR movie..
    no_of_ratings = is_rated.sum(axis=ax).A1
    
    # max_user  and max_movie ids in sparse matrix 
    u,m = sparse_matrix.shape
    # creae a dictonary of users and their average ratigns..
    average_ratings = { i : sum_of_ratings[i]/no_of_ratings[i]
                                 for i in range(u if of_users else m) 
                                    if no_of_ratings[i] !=0}

    # return that dictionary of average ratings
    return average_ratings

In [13]:
# Average ratings given by a user
train_averages['user'] = get_average_ratings(train_sparse_matrix, of_users=True)
print('\nAverage rating of user 10 :',train_averages['user'][10])


Average rating of user 10 : 3.75


In [14]:
# Average ratings given for a movie
train_averages['movie'] =  get_average_ratings(train_sparse_matrix, of_users=False)
print('\n AVerage rating of movie 15 :',train_averages['movie'][15])


 AVerage rating of movie 15 : 2.728359908883827


In [15]:
# get users, movies and ratings from our samples train sparse matrix
from scipy.sparse import find
train_users, train_movies, train_ratings = find(train_sparse_matrix)

#### Features which represent the top 5 similar users and 5 top similar movies

In [None]:
from datetime import datetime

In [16]:
def ratings_(train_users, train_movies, train_ratings):
    final_data = pd.DataFrame()
    for (user, movie, rating)  in zip(train_users, train_movies, train_ratings):

            #     print(user, movie)    
                #--------------------- Ratings of "movie" by similar users of "user" ---------------------
                # compute the similar Users of the "user"        
        user_sim = cosine_similarity(train_sparse_matrix[user], train_sparse_matrix).ravel()
        top_sim_users = user_sim.argsort()[::-1][1:] # we are ignoring 'The User' from its similar users.
                # get the ratings of most similar users for this movie
        top_ratings = train_sparse_matrix[top_sim_users, movie].toarray().ravel()
                # we will make it's length "5" by adding movie averages to .
        top_sim_users_ratings = list(top_ratings[top_ratings != 0][:5])
        top_sim_users_ratings.extend([train_averages['movie'][movie]]*(5 - len(top_sim_users_ratings)))
            #     print(top_sim_users_ratings, end=" ")    


                #--------------------- Ratings by "user"  to similar movies of "movie" ---------------------
                # compute the similar movies of the "movie"        
        movie_sim = cosine_similarity(train_sparse_matrix[:,movie].T, train_sparse_matrix.T).ravel()
        top_sim_movies = movie_sim.argsort()[::-1][1:] # we are ignoring 'The User' from its similar users.
                # get the ratings of most similar movie rated by this user..
        top_ratings = train_sparse_matrix[user, top_sim_movies].toarray().ravel()
                # we will make it's length "5" by adding user averages to.
        top_sim_movies_ratings = list(top_ratings[top_ratings != 0][:5])
        top_sim_movies_ratings.extend([train_averages['user'][user]]*(5-len(top_sim_movies_ratings))) 
            #     print(top_sim_movies_ratings, end=" : -- ")

                #-----------------prepare the row to be stores in a file-----------------#
        row = list()
        row.append(user)
        row.append(movie)
                # Now add the other features to this data...
        row.append(train_averages['global']) # first feature
                # next 5 features are similar_users "movie" ratings
        row.extend(top_sim_users_ratings)
                # next 5 features are "user" ratings for similar_movies
        row.extend(top_sim_movies_ratings)
                # Avg_user rating
        row.append(train_averages['user'][user])
                # Avg_movie rating
        row.append(train_averages['movie'][movie])

                # finalley, The actual Rating of this user-movie pair...
        row.append(rating)
        #count = count + 1
        final_data = final_data.append([row])
            
    return final_data


In [17]:
final_train = ratings_(train_users, train_movies, train_ratings)

KeyboardInterrupt: 

# Generate your outputs here

Prepare Submission File
We make submissions in CSV files. Your submissions usually have two columns: an ID column and a prediction column. The ID field comes from the test data (keeping whatever name the ID field had in that data, which for the data is the string 'Id'). The prediction column will use the name of the target field.

We will create a DataFrame with this data, and then use the dataframe's to_csv method to write our submission file. Explicitly include the argument index=False to prevent pandas from adding another column in our csv file.

In [None]:
# This is an example
## my_submission = pd.DataFrame({'id': test.Id, 'rating': test.ratings})
# you could use any filename. We choose submission here
## my_submission.to_csv('submission.csv', index=False)


### Make Submission
Hit the blue Publish button at the top of your notebook screen. It will take some time for your kernel to run. When it has finished your navigation bar at the top of the screen will have a tab for Output. This only shows up if you have written an output file (like we did in the Prepare Submission File step).

Example below of how the output would look once published

In [None]:
# initialize list of lists
data = [['tom', 10], ['nick', 15], ['juli', 14]]
  
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
  
# print dataframe.
df

In [None]:
df.to_csv('my_test_output.csv', index = False)