# Unsupervised Learning Team JS4
We will use this Notebook to build and test various models relating to our goal.

## Our Team:
- Kwanda Silekwa
- Thembinkosi Malefo
- Sihle Riti
- Nomfundo Manyisa
- Ofentse Sabe
- Thanyi

## Introduction
The rapid growth of data collection has led to a new era of information. Data is being used to create more efficient systems and this is where Recommendation Systems come into play. Recommendation Systems are a type of information filtering systems as they improve the quality of search results and provides items that are more relevant to the search item or are realted to the search history of the user.

### What is recommendation system?
Recommender System is a system that seeks to predict or filter preferences according to the user’s choices. Recommender systems are utilized in a variety of areas including movies, music, news, books, research articles, search queries, social tags, and products in general. Moreover, companies like Netflix and Spotify depend highly on the effectiveness of their recommendation engines for their business and sucees.

![image.png](attachment:image.png)



The current recommendation systems that are bring used and are popular are the content-based filtering and collaborative filtering which works by implementing different information sources to make the recommendations.

- Content-based filtering (CBF) : makes recommendations based on user preferences for product features.
- Collaborative filtering (CF): mimics user-to-user recommendations (i.e. it relies on how other users have responded to the same items). 

It predicts users preferences as a linear, weighted combination of other user preferences.
We have to note that both of these methods have limitations: The CBF can recommend a new item but needs more data on user preferences to give out the best match. On the other hand, the CF requires large dataset with active users who rated the product before to make the most accurate predictions. The combination of both of these methods is known as hybrid recommendation systems.

## Problem statement:
Construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences

## Importing Libraries

In [5]:
# !pip install surprise

In [2]:
!pip install comet_ml

Collecting comet_ml
  Downloading comet_ml-3.13.1-py2.py3-none-any.whl (276 kB)
[K     |████████████████████████████████| 276 kB 1.2 MB/s eta 0:00:01
[?25hCollecting wurlitzer>=1.0.2
  Downloading wurlitzer-2.1.1-py2.py3-none-any.whl (6.2 kB)
Collecting dulwich>=0.20.6
  Downloading dulwich-0.20.24-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (529 kB)
[K     |████████████████████████████████| 529 kB 8.4 MB/s eta 0:00:01
[?25hCollecting semantic-version>=2.8.0
  Downloading semantic_version-2.8.5-py2.py3-none-any.whl (15 kB)
Collecting everett[ini]>=1.0.1
  Downloading everett-1.0.3-py2.py3-none-any.whl (31 kB)
Collecting requests-toolbelt>=0.8.0
  Downloading requests_toolbelt-0.9.1-py2.py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 1.9 MB/s  eta 0:00:01
Collecting configobj
  Downloading configobj-5.0.6.tar.gz (33 kB)
Building wheels for collected packages: configobj
  Building wheel for configobj (setu

In [4]:
from comet_ml import Experiment

# Packages for data processing
import numpy as np
import pandas as pd
# import datetime
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from scipy.sparse import csr_matrix
import scipy as sp
from ast import literal_eval
import ast
from IPython.display import FileLink
from collections import Counter

# visualisation libraries
from matplotlib import pyplot as plt
import seaborn as sns
from numpy.random import RandomState
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

# Packages for modeling
from surprise import Reader
from surprise import Dataset
from surprise import KNNWithMeans
from surprise import KNNBasic
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise.accuracy import rmse
from surprise import CoClustering
from surprise import BaselineOnly
from surprise import accuracy

# Packages for model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from time import time
from datetime import datetime

#word cloud
%matplotlib inline
import wordcloud
from wordcloud import WordCloud, STOPWORDS
%matplotlib inline
sns.set()

# Kaggle requirements
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))   
        


/kaggle/input/edsa-movie-recommendation-challenge/sample_submission.csv
/kaggle/input/edsa-movie-recommendation-challenge/movies.csv
/kaggle/input/edsa-movie-recommendation-challenge/imdb_data.csv
/kaggle/input/edsa-movie-recommendation-challenge/genome_tags.csv
/kaggle/input/edsa-movie-recommendation-challenge/genome_scores.csv
/kaggle/input/edsa-movie-recommendation-challenge/train.csv
/kaggle/input/edsa-movie-recommendation-challenge/test.csv
/kaggle/input/edsa-movie-recommendation-challenge/tags.csv
/kaggle/input/edsa-movie-recommendation-challenge/links.csv


In [5]:
# Create an experiment with your api key
experiment = Experiment(
    api_key="fdVf8HBu4jwg9hKw347n5Dj8h",
    project_name="unsupervised-predict-team-js4",
    workspace="kwanda9700",
)

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/kwanda9700/unsupervised-predict-team-js4/4bf4f9b4657b4d9bb728f295476dcf16



## Loading the dataset
We going to load the dataframe will be working with

In [6]:
#Loading data
df_train = pd.read_csv('../input/edsa-movie-recommendation-challenge/train.csv')
df_test = pd.read_csv('../input/edsa-movie-recommendation-challenge/test.csv')
df_movies = pd.read_csv('../input/edsa-movie-recommendation-challenge/movies.csv')
# df_sample_submission = pd.read_csv('../input/edsa-movie-recommendation-challenge/sample_submission.csv')
# df_imdb = pd.read_csv('../input/edsa-movie-recommendation-challenge/imdb_data.csv')
# df_genome_tags = pd.read_csv("../input/edsa-movie-recommendation-challenge/genome_tags.csv")
# df_genome_scores = pd.read_csv("../input/edsa-movie-recommendation-challenge/genome_scores.csv")
# df_tags = pd.read_csv("../input/edsa-movie-recommendation-challenge/tags.csv")
# df_links = pd.read_csv("../input/edsa-movie-recommendation-challenge/links.csv")

In [10]:
# df_train=pd.read_csv('data/train.csv')
# df_links=pd.read_csv('data/links.csv')
# df_movies=pd.read_csv('data/movies.csv')
# df_imdb = pd.read_csv('data/imdb_data.csv')
# df_sample_submission = pd.read_csv('data/sample_submission.csv')
# df_tags=pd.read_csv('data/tags.csv')
# df_genome_scores=pd.read_csv('data/genome_scores.csv')
# df_genome_tags=pd.read_csv('data/genome_tags.csv')
# df_test=pd.read_csv('data/test.csv')

## Evaluating the data

This dataset consists of several million 5-star ratings obtained from users of the online MovieLens movie recommendation service. The MovieLens dataset has long been used by industry and academic researchers to improve the performance of explicitly-based recommender systems, and now you get to as well!

For this Predict, we'll be using a special version of the MovieLens dataset which has enriched with additional data, and resampled for fair evaluation purposes.

### Source
The data for the MovieLens dataset is maintained by the GroupLens research group in the Department of Computer Science and Engineering at the University of Minnesota. Additional movie content data was legally scraped from IMDB

### Supplied files
- genome_scores.csv - a score mapping the strength between movies and tag-related properties. Read more here
- genome_tags.csv - user assigned tags for genome-related scores
- imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.
- links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
- sample_submission.csv - Sample of the submission format for the hackathon.
- tags.csv - User assigned for the movies within the dataset.
- test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.
- train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.

In [7]:
print("Train data contains {} rows and {} columns".format(df_train.shape[0], df_train.shape[1]))
print("Movie data contains {} rows and {} columns".format(df_movies.shape[0], df_movies.shape[1]))
print("Test data contains {} rows and {} columns".format(df_test.shape[0], df_test.shape[1]))

Train data contains 10000038 rows and 4 columns
Movie data contains 62423 rows and 3 columns
Test data contains 5000019 rows and 2 columns


In [8]:
#viewing training data
df_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


Train:

- UserId
- movieId : Identifier for movies used
- rating : Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars)
- timestamp: represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [9]:
#viewing test data
df_test.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


Tags:

- userId
- movieId : Identifier for movies used
- tag : User-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
- timestamp : represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [10]:
#viewing movies data
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Movies:

- movieId : Identify the movies that are watched 

- title : Entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

- genres: They are pipe-separated list, and are selected from the following:
    - Action
    - Adventure
    - Animation
    - Children's
    - Comedy
    - Crime
    - Documentary
    - Drama
    - Fantasy
    - Film-Noir
    - Horror
    - Musical
    - Mystery
    - Romance
    - Sci-Fi
    - Thriller
    - War
    - Western
    - (no genres listed)

# Data Prepocessing

## Checking for missing values

In [12]:
#check for missing values
df_train.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

## Checking for duplicate values

In [14]:
#check duplicates
dup_bool = df_train.duplicated(['userId','movieId','rating'])

#display duplicates
print("Number of duplicate records:",sum(dup_bool))

Number of duplicate records: 0


## Creating a copy of df_train

In [15]:
df = df_train.copy()

In [16]:
#create a copy of the train data
train_df = df_train.copy()

#display top 5 records
train_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


## Evaluating Length of Unique Values

In [17]:
# Find the length of the unique use
len(df_train['userId'].unique()), len(df_train['movieId'].unique())

(162541, 48213)

In [18]:
#view movies
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Joining Datasets

In [19]:
# Merge
df_merge1 = df_train.merge(df_movies, on = 'movieId')
df_merge1.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,5163,57669,4.0,1518349992,In Bruges (2008),Comedy|Crime|Drama|Thriller
1,87388,57669,3.5,1237455297,In Bruges (2008),Comedy|Crime|Drama|Thriller
2,137050,57669,4.0,1425631854,In Bruges (2008),Comedy|Crime|Drama|Thriller
3,120490,57669,4.5,1408228517,In Bruges (2008),Comedy|Crime|Drama|Thriller
4,50616,57669,4.5,1446941640,In Bruges (2008),Comedy|Crime|Drama|Thriller


## Collaborative Filtering
### What Is Collaborative Filtering?

Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users.

It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user. It looks at the items they like and combines them to create a ranked list of suggestions.

to be more precise it is based on similarity in preference , taste and choices of two users. A good example that we can give you could be if user A likes movies 1,2 and 3 and user B likes movies 2,3 and 4 then this implies that they have similar interests and user A should like movie 4 and B should like movie 1.

### Why Do We Consider Collaborating Filtering Over Content Based Filtering?

Collaborative filtering recommender engine is a much better algorithim then content content based filtering since it is able to do feature laerning on its own, in other words it can laern which features to use

### Advantages of Collaborative filtering:

Taken that we find collaborative filtering better than content based, We will give a few adavntages to support the argument.

- Takes other user ratings into consideration
- Doesnt need to study or extract information from recommended item.
- It adapts to the user' interest which might change over time.

### About Collaborative Filtering Datasets:

To take note that in order for us to implement this algorithm or any recommendation algorithms we need a specific dataset that is stuctured in a specific format. This data should entail a set of items and users who have reacted to some of the items.

While working with such data, you’ll mostly see it in the form of a matrix consisting of the reactions given by a set of users to some items from a set of items. Each row would contain the ratings given by a user, and each column would contain the ratings received by an item. A matrix with five users and five items could look like this:

### Loading as Surprise Dataframe
We will be using the dataset module which loads the pandas dataframe that is available for this experiment, The reader function is used to parse a file containing ratings data. The default format in which it accepts data is that each rating is stored in a separate line in the order user, movie and rating

In [20]:
# Loading as Surprise dataframe 
reader = Reader()
data = Dataset.load_from_df(df_train[['userId', 'movieId', 'rating']], reader)

### Removing the pipe between genres, title_cast and plot_keywords

In [21]:
# Data split 85/15
trainset, testset = train_test_split(data, test_size=0.15)

### Training Model
Using the base algoritm of Co Clustering we will fit method which will train the algorithm on the trainset and and the test() method which will return the predictions made from the testset furthermore storing all our predictions on a dataframe called test.

In [22]:
co_clust = CoClustering()

In [None]:
# Fitting our trainset
co_clust.fit(trainset)

# Using the 15% testset to make predictions
predictions = co_clust.test(testset) 
predictions

test = pd.DataFrame(predictions)

Let us have a closer look into the predictions on the dataframe test.

In [None]:
# View the head
df_test.head()

### Evaluate Model
Utilising the test dataframe that we have created we are going to predict some of the ratings for each userId and movieId pair, this ratings predictions will be collected and stored as a list consiting of these pairs, ideally this list will help in predicting unknown values in the original matrix(test_df dataframe) (this is also known as matrix completion)

Let us look at the list called ratings predictions.

In [32]:
# We are trying to predict ratings for every userId / movieId pair, we implement the below list comprehension to achieve this.
ratings_predictions=[co_clust.predict(row.userId, row.movieId) for _,row in test_df.iterrows()]
ratings_predictions

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,5163,57669,4.0,1518349992,In Bruges (2008),"Comedy,Crime,Drama,Thriller"
1,87388,57669,3.5,1237455297,In Bruges (2008),"Comedy,Crime,Drama,Thriller"
2,137050,57669,4.0,1425631854,In Bruges (2008),"Comedy,Crime,Drama,Thriller"
3,120490,57669,4.5,1408228517,In Bruges (2008),"Comedy,Crime,Drama,Thriller"
4,50616,57669,4.5,1446941640,In Bruges (2008),"Comedy,Crime,Drama,Thriller"


We will store the list of predictions in a dataframe which will essentially in help in creating the familiar format of the dataframe

In [33]:
# Converting our prediction into a familiar format-Dataframe
df_pred=pd.DataFrame(ratings_predictions)
df_pred

userId       0
movieId      0
rating       0
timestamp    0
title        0
genres       0
dtype: int64

In [34]:
# Renaming our predictions to original names
df_pred=df_pred.rename(columns={'uid':'userId', 'iid':'movieId','est':'rating'})
df_pred.drop(['r_ui','details'],axis=1,inplace=True)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,title_cast,director,runtime,budget,plot_keywords
0,5163,57669,4.0,1518349992,In Bruges (2008),"Comedy,Crime,Drama,Thriller","[Elizabeth Berrington, Rudy Blomme, Olivier Bo...",Martin McDonagh,107.0,"$15,000,000","[dwarf, bruges, irish, hitman]"
1,87388,57669,3.5,1237455297,In Bruges (2008),"Comedy,Crime,Drama,Thriller","[Elizabeth Berrington, Rudy Blomme, Olivier Bo...",Martin McDonagh,107.0,"$15,000,000","[dwarf, bruges, irish, hitman]"
2,137050,57669,4.0,1425631854,In Bruges (2008),"Comedy,Crime,Drama,Thriller","[Elizabeth Berrington, Rudy Blomme, Olivier Bo...",Martin McDonagh,107.0,"$15,000,000","[dwarf, bruges, irish, hitman]"
3,120490,57669,4.5,1408228517,In Bruges (2008),"Comedy,Crime,Drama,Thriller","[Elizabeth Berrington, Rudy Blomme, Olivier Bo...",Martin McDonagh,107.0,"$15,000,000","[dwarf, bruges, irish, hitman]"
4,50616,57669,4.5,1446941640,In Bruges (2008),"Comedy,Crime,Drama,Thriller","[Elizabeth Berrington, Rudy Blomme, Olivier Bo...",Martin McDonagh,107.0,"$15,000,000","[dwarf, bruges, irish, hitman]"


In [35]:
# Snippet of our ratings
df_pred.head()

userId                 0
movieId                0
rating                 0
timestamp              0
title                  0
genres                 0
title_cast       2604407
director         2602688
runtime          2653058
budget           3152276
plot_keywords    2610043
dtype: int64

In [36]:
# Concatenating userId/movieId into a single Id column.(code has to be run twice to get desired outcome)
df_pred['Id']=df_pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)
df_pred['Id']=df_pred.apply(lambda x:'%s_%s' % (x['userId'],x['movieId']),axis=1)

(9633031, 11)

In [37]:
# drop the two features from the dataset userId and movieId
df_pred.drop(['userId', 'movieId'], inplace=True, axis= 1)

userId             int64
movieId            int64
rating           float64
timestamp          int64
title             object
genres            object
title_cast        object
director          object
runtime          float64
budget            object
plot_keywords     object
dtype: object

## Preparing Submission
The submission of this competition has to be in csv file entailing a id and rating column

In [None]:
# df_pred = df_pred[['Id', 'rating']]
# df_pred.shape

In [None]:
# df_pred.to_csv("coClustering_model_base.csv", index=False)

In [None]:
experiment.end()