# Unsupervised Learning Team JS4
We will use this Notebook to build and test various models relating to our goal.

## Our Team:
- Kwanda Silekwa
- Thembinkosi Malefo
- Sihle Riti
- Nomfundo Manyisa
- Ofentse Sabe
- Thanyi

## Introduction
The rapid growth of data collection has led to a new era of information. Data is being used to create more efficient systems and this is where Recommendation Systems come into play. Recommendation Systems are a type of information filtering systems as they improve the quality of search results and provides items that are more relevant to the search item or are realted to the search history of the user.

### What is recommendation system?
Recommender System is a system that seeks to predict or filter preferences according to the user’s choices. Recommender systems are utilized in a variety of areas including movies, music, news, books, research articles, search queries, social tags, and products in general. Moreover, companies like Netflix and Spotify depend highly on the effectiveness of their recommendation engines for their business and sucees.

![image.png](attachment:image.png)



The current recommendation systems that are bring used and are popular are the content-based filtering and collaborative filtering which works by implementing different information sources to make the recommendations.

- Content-based filtering (CBF) : makes recommendations based on user preferences for product features.
- Collaborative filtering (CF): mimics user-to-user recommendations (i.e. it relies on how other users have responded to the same items). 

It predicts users preferences as a linear, weighted combination of other user preferences.
We have to note that both of these methods have limitations: The CBF can recommend a new item but needs more data on user preferences to give out the best match. On the other hand, the CF requires large dataset with active users who rated the product before to make the most accurate predictions. The combination of both of these methods is known as hybrid recommendation systems.

## Problem statement:
Construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences

## Importing Libraries

In [None]:
# !pip install surprise

In [None]:
# !pip install comet_ml

In [None]:
from comet_ml import Experiment

# Packages for data processing
import numpy as np
import pandas as pd
# import datetime
from sklearn import preprocessing
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from scipy.sparse import csr_matrix
import scipy as sp
from ast import literal_eval
import ast
from IPython.display import FileLink
from collections import Counter

# visualisation libraries
from matplotlib import pyplot as plt
import seaborn as sns
from numpy.random import RandomState
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

# Packages for modeling
from surprise import Reader
from surprise import Dataset
from surprise import KNNWithMeans
from surprise import KNNBasic
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise import BaselineOnly
from surprise import accuracy

# Packages for model evaluation
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from time import time
from datetime import datetime

#word cloud
%matplotlib inline
import wordcloud
from wordcloud import WordCloud, STOPWORDS
%matplotlib inline
sns.set()

# Kaggle requirements
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))   
        


## Loading the dataset
We going to load the dataframe will be working with

In [None]:
#Loading data
df_train = pd.read_csv('../input/edsa-movie-recommendation-challenge/train.csv')
df_test = pd.read_csv('../input/edsa-movie-recommendation-challenge/test.csv')
df_movies = pd.read_csv('../input/edsa-movie-recommendation-challenge/movies.csv')
df_sample_submission = pd.read_csv('../input/edsa-movie-recommendation-challenge/sample_submission.csv')
df_imdb = pd.read_csv('../input/edsa-movie-recommendation-challenge/imdb_data.csv')
df_genome_tags = pd.read_csv("../input/edsa-movie-recommendation-challenge/genome_tags.csv")
df_genome_scores = pd.read_csv("../input/edsa-movie-recommendation-challenge/genome_scores.csv")
df_tags = pd.read_csv("../input/edsa-movie-recommendation-challenge/tags.csv")
df_links = pd.read_csv("../input/edsa-movie-recommendation-challenge/links.csv")

In [None]:
# df_train=pd.read_csv('data/train.csv')
# df_links=pd.read_csv('data/links.csv')
# df_movies=pd.read_csv('data/movies.csv')
# df_imdb = pd.read_csv('data/imdb_data.csv')
# df_sample_submission = pd.read_csv('data/sample_submission.csv')
# df_tags=pd.read_csv('data/tags.csv')
# df_genome_scores=pd.read_csv('data/genome_scores.csv')
# df_genome_tags=pd.read_csv('data/genome_tags.csv')
# df_test=pd.read_csv('data/test.csv')

## Evaluating the data

This dataset consists of several million 5-star ratings obtained from users of the online MovieLens movie recommendation service. The MovieLens dataset has long been used by industry and academic researchers to improve the performance of explicitly-based recommender systems, and now you get to as well!

For this Predict, we'll be using a special version of the MovieLens dataset which has enriched with additional data, and resampled for fair evaluation purposes.

### Source
The data for the MovieLens dataset is maintained by the GroupLens research group in the Department of Computer Science and Engineering at the University of Minnesota. Additional movie content data was legally scraped from IMDB

### Supplied files
- genome_scores.csv - a score mapping the strength between movies and tag-related properties. Read more here
- genome_tags.csv - user assigned tags for genome-related scores
- imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.
- links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.
- sample_submission.csv - Sample of the submission format for the hackathon.
- tags.csv - User assigned for the movies within the dataset.
- test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.
- train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.

In [None]:
print("Train data contains {} rows and {} columns".format(df_train.shape[0], df_train.shape[1]))
print("Movie data contains {} rows and {} columns".format(df_movies.shape[0], df_movies.shape[1]))
print("Imdb data contains {} rows and {} columns".format(df_imdb.shape[0], df_imdb.shape[1]))
print("Genome_tags data contains {} rows and {} columns".format(df_genome_tags.shape[0], df_genome_tags.shape[1]))
print("Genome_scores data contains {} rows and {} columns".format(df_genome_scores.shape[0], df_genome_scores.shape[1]))
print("Tags data contains {} rows and {} columns".format(df_tags.shape[0], df_tags.shape[1]))
print("Links data contains {} rows and {} columns".format(df_links.shape[0], df_links.shape[1]))

In [None]:
#viewing training data
df_train.head()

Train:

- UserId
- movieId : Identifier for movies used
- rating : Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars)
- timestamp: represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [None]:
#viewing tags data
df_tags.head()

Tags:

- userId
- movieId : Identifier for movies used
- tag : User-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.
- timestamp : represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [None]:
#viewing movies data
df_movies.head()

Movies:

- movieId : Identify the movies that are watched 

- title : Entered manually or imported from https://www.themoviedb.org/, and include the year of release in parentheses. Errors and inconsistencies may exist in these titles.

- genres: They are pipe-separated list, and are selected from the following:
    - Action
    - Adventure
    - Animation
    - Children's
    - Comedy
    - Crime
    - Documentary
    - Drama
    - Fantasy
    - Film-Noir
    - Horror
    - Musical
    - Mystery
    - Romance
    - Sci-Fi
    - Thriller
    - War
    - Western
    - (no genres listed)

In [None]:
#viewing links data
df_links.head()

Links:

- movieId : Identifier for movies used by https://movielens.org
- imdbId : Identifier for movies used by http://www.imdb.com
- tmdbId : An identifier for movies used by https://www.themoviedb.org.

In [None]:
#viewing genome scores data
df_genome_scores.head()

In [None]:
#viewing genome tags data
df_genome_tags.head()

Genome_tags:

- tagId
- tag : tag descriptions for the tag IDs in the genome file

In [None]:
#viewing Imdb_data
df_imdb.head()

In [None]:
df_imdb.describe(include = 'all')

# Data Preparation

In [None]:
def to_list(df, column) :
    df[column] = df[column].str.split('|')
    return df[column]

In [None]:
def get_genre_list(df,column) :
    genres_list = []
    for genre in df[column].unique():
        genres_list = genres_list + genre.split("|")
        genres_list = list(set(genres_list))
    return genres_list

In [None]:
# Make a census of the genre keywords
def get_genre_labels(df,column) :
    genre_labels = set()
    for s in df_movies['genres'].str.split('|').values:
        genre_labels = genre_labels.union(set(s))
    return genre_labels

In [None]:
# Function that counts the number of times each of the genre keywords appear
def count_word(dataset, ref_col, census):
   
    keyword_count = dict()
    for s in census: 
        keyword_count[s] = 0
    for census_keywords in dataset[ref_col].str.split('|'):        
        if type(census_keywords) == float and pd.isnull(census_keywords): 
            continue        
        for s in [s for s in census_keywords if s in census]: 
            if pd.notnull(s): 
                keyword_count[s] += 1
                
    keyword_occurences = []
    for k,v in keyword_count.items():
        keyword_occurences.append([k,v])
    keyword_occurences.sort(key = lambda x:x[1], reverse = True)
    return keyword_occurences, keyword_count

In [None]:
def to_str(df, column) : 
    df[column] = [','.join(map(str, l)) for l in df[column]]
    return df[column]

In [None]:
genre_labels = get_genre_list(df_movies, 'genres')
print(genre_labels)

In [None]:
keyword_occurences, dum = count_word(df_movies, 'genres', genre_labels)
print(keyword_occurences[:5])

In [None]:
genres_list = get_genre_list(df_movies, 'genres')
print(genres_list)

### Removing the pipe between genres, title_cast and plot_keywords

In [None]:
df_movies.genres = to_list(df_movies, 'genres')

In [None]:
df_movies.genres = to_str(df_movies, 'genres')
df_movies.head()

In [None]:
df_imdb.plot_keywords = to_list(df_imdb, 'plot_keywords')

In [None]:
df_imdb.title_cast = to_list(df_imdb, 'title_cast')
df_imdb.head()

#### Inorder to play along with our data and before data processing. Let's merge some of our data to see how it work out.

### Merging of the df_train, df_movies and df_imdb dataframes into df_full_movies

In [None]:
# Merge the train and movies data
table1 = pd.merge(df_train, df_movies, on = ['movieId'])
# df_table1 = df_train.merge(df_movies, on='movieId')

# Viewing the 1st 5 rows
table1.head()

In [None]:
# Checking for nulls
table1.isnull().sum()

In [None]:
# Merging table1 dataframe and Imdb data
df_full_movies = pd.merge(table1, df_imdb, on='movieId')

# Viewing the 1st 5 rows
df_full_movies.head()

In [None]:
# Checking for nulls
df_full_movies.isnull().sum()

In [None]:
df_full_movies.shape

In [None]:
df_full_movies.dtypes

### Merging of the df_genome_scores and df_genome_tags into df_tags_scores

In [None]:
df_tags_scores = pd.merge(df_genome_tags , df_genome_scores , on = ['tagId'])
df_tags_scores.head()

In [None]:
# Capturing Tags only if its relevance is higher than 80% to a movie
df_tags_scores_2 = df_tags_scores[df_tags_scores['relevance'] > 0.80]
df_tags_scores_2.head()

# Exploratory data analysis

### Cleaning of the data and Visualizing

In [None]:
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'g']

##  &#128202; 1. Ratings
#### i) Since we want to check the number of ratings people did, let add a column named, 'Number of rating'

In [None]:
# Creating a plot for the movie ratings
data = df_full_movies['rating'].value_counts().sort_index(ascending=False)
trace = go.Bar(x=data.index,
               y=data.values,
               marker=dict(color='#0080ff'))
layout = dict(title='Distribution of Movie ratings'.format(df_full_movies.shape[0]),
              xaxis=dict(title='rating'),
              yaxis=dict(title='Count'))
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)


- We can observe that most of the movies we rated above 3 which exceeded the average of ratings and least of the movie rating was below 3

#### ii) Here is a distribution of number of ratings given by each user. Histogram is truncated for users that rated less than 200 movies. 

In [None]:
user_ratings = df_full_movies.groupby(by='userId')
d = user_ratings['rating'].count()
limit = 200
plt.hist(d[d<=limit], bins='fd')
plt.xlabel('number of rated movies')
plt.ylabel('number of users')
print(f'Only users with less than {limit} ratings are displayed ({len(user_ratings) - len(d[d<=limit]):,} users omitted).')
plt.show()

In [None]:
users_average = df_full_movies.groupby('userId')['rating'].mean()
items_average = df_full_movies.groupby('movieId')['rating'].mean()
plt.hist([users_average, items_average], histtype='step', density=True)
plt.xlabel('average rating for a movie / by a user')
plt.ylabel('number of movies / users')
plt.legend(['average rating given by a user', 'average rating of a movie'], loc=2)
plt.show()

##  &#128202; 2. Movies

This is a list of the most rated (among movies with at least 20 ratings):

In [None]:
movie_ratings = df_full_movies.groupby(by='movieId')
most_rated = movie_ratings['rating'].count().sort_values(ascending=False).head(10)
print(pd.merge(pd.DataFrame(most_rated), df_movies, on='movieId')[['title','rating']].rename(index=lambda x: x+1, columns={'rating': 'n. of ratings'}),'\n')

This is a list of the top rated movies (among movies with at least 20 ratings):

In [None]:
movie_ratings = df_full_movies.groupby(by='movieId')
top_rated = movie_ratings['rating'].mean().where(movie_ratings['rating'].count() > 20).sort_values(ascending=False).head(10)
print(pd.merge(pd.DataFrame(top_rated), df_movies, on='movieId')[['title','rating']].rename(index=lambda x: x+1, columns={'rating': 'average rating'}))

In [None]:
# Create a wordcloud of the movie titles
df_movies['title'] = df_movies['title'].fillna("").astype('str')
title_corpus = ' '.join(df_movies['title'])
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(title_corpus)

# Plot the wordcloud
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

- We can observe from that Girl, Love and Man are the most popular title words
- We also observed that Warrior, Affair and Without are the least popular title words

In [None]:
# Creating a dataframe for the number of movies
num_ratings = pd.DataFrame(df_full_movies.groupby('movieId').count()['rating']).reset_index()
df_full_movies = pd.merge(left=df_full_movies, right=num_ratings, on='movieId')
df_full_movies.rename(columns={'rating_x': 'rating', 'rating_y': 'NumberRatings'}, inplace=True)
df_full_movies.head()

In [None]:
# Dropping the duplicates in the movies
Remove_duplicates = df_full_movies.drop_duplicates('movieId')
Remove_duplicates.head()

In [None]:
# Getting the number of ratings per director
Director_ratings = pd.DataFrame(Remove_duplicates.groupby('director').sum()['NumberRatings'].sort_values(ascending=False)).reset_index()

# visualize the number of movies per director
plt.figure(figsize = (14, 9.5))
sns.barplot(data = Director_ratings.head(50), y = 'director', x = 'NumberRatings', color = 'Blue')
plt.ylabel('Directors')
plt.xlabel('Number of ratings')
plt.title('Number of ratings per director\n')
#plt.xlim(0, 27)
plt.show()

In [None]:
# Number of movies per director
director_movies = pd.DataFrame(Remove_duplicates.groupby('director').count()['title'].sort_values(ascending=False)).reset_index()
director_movies.head()

In [None]:
# Number of movies per director
director_movies = pd.DataFrame(Remove_duplicates.groupby('director').count()['title'].sort_values(ascending=False)).reset_index()


# visualize the number of movies per director
plt.figure(figsize = (14, 9.5))
sns.barplot(data = director_movies.head(50), y = 'director', x = 'title', color = 'Blue')
plt.ylabel('Directors')
plt.xlabel('Number of movies released')
plt.title('Number of Movies released per director\n')
plt.xlim(0, 27)
plt.show()

##  &#128202; 3. Genres

In [None]:
df_full_movies.head()

In [None]:
print(genres_list)

In [None]:
print(genre_labels)

In [None]:
print(keyword_occurences)

In [None]:
# Define the dictionary used to produce the genre wordcloud
genres = dict()
trunc_occurences = keyword_occurences[0:18]
for s in trunc_occurences:
    genres[s[0]] = s[1]

# plot the wordcloud
genre_wordcloud = WordCloud(width=1000,height=400, background_color='black')
genre_wordcloud.generate_from_frequencies(genres)
f, ax = plt.subplots(figsize=(16, 8))
plt.imshow(genre_wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [None]:
# Some preprocessing will done on the budget column
#Removing commas
df_full_movies['budget'] = df_full_movies['budget'].str.replace(',', '')

#remove currency signs
df_full_movies['budget'] = df_full_movies['budget'].str.extract('(\d+)', expand=False)

# Replace Nan with 0 on budget
df_full_movies['budget'] = df_full_movies['budget'].replace(np.nan, 0)

In [None]:
# Removing the years from title
df_full_movies['release_year'] = df_full_movies.title.str.extract('(\(\d\d\d\d\))', expand=False)

In [None]:
#Removing brackets
df_full_movies['release_year'] = df_full_movies['release_year'].str.replace('[(,)]', '', regex=True)

In [None]:
# Changing the timestamp into years
import time
df_full_movies['timestamp'] = df_full_movies['timestamp'].apply(lambda x: time.strftime('%Y', time.localtime(x)))
df_full_movies.head()

In [None]:
avg_ratings = df_full_movies.groupby(['movieId', 'title', 'genres', 'release_year'], as_index=False)['rating'].mean()
avg_ratings.head()

In [None]:
genres = avg_ratings['genres'].apply(lambda x: x[0:].split(','))
genres.head()

In [None]:
# create a new split_data dataframe
split_data = pd.DataFrame({'genres':genres.values}, index = avg_ratings['genres'].index)

split_data['rating'] = avg_ratings['rating']
split_data['title'] = avg_ratings['title']
split_data['year'] = avg_ratings['release_year']
split_data['movieId'] = avg_ratings['movieId']

split_data.head()

In [None]:
objs = [split_data, pd.DataFrame(split_data['genres'].tolist())]
new_df = pd.concat(objs, axis=1).drop('genres', axis=1).sort_values('rating', ascending=False)
final_ratings = pd.melt(new_df, var_name='genre', value_name="genres", id_vars=['movieId','rating','title', 'year'], value_vars=[0,1,2,3,4,5,6,7,8]).sort_values('rating', ascending=False)
final_ratings = final_ratings[final_ratings.genres.notnull()].drop("genre", axis=1)
final_ratings.sort_values(by=['movieId'], inplace=True)
final_ratings.head()

In [None]:
# Total number of movies with a specific genre counted multiple times for multi genre movies
genre_count = final_ratings.groupby('genres').count()[['movieId']]
genre_count = genre_count.rename(columns = {'movieId': 'count'})
genre_count = genre_count.sort_values('count', ascending=False)

count = genre_count['count'].tolist()
genre = genre_count.index.tolist()
genre_count = pd.DataFrame({'genre': genre, 'count': count})
genre_count

In [None]:
genre_count.plot.barh(x = 'genre', y = 'count', color = 'blue')
plt.xlabel('Number of movies')
plt.ylabel('Genre')
plt.title('Number of Movies vs Genre')

plt.show()

In [None]:
avg_genre_ratings = final_ratings.groupby(['genres'], as_index=False)['rating'].mean()
avg_genre_ratings = avg_genre_ratings.sort_values(by=['rating'], ascending=False)
avg_genre_ratings

In [None]:
avg_genre_ratings.plot.barh(x = 'genres', y='rating', color = 'blue')
plt.xlabel('Rating')
plt.ylabel('Genres')
plt.title('Comparing Movie Ratings by Genre')

plt.show()

In [None]:
final_ratings.sort_values(by=['year'], ascending=True)
final_ratings

In [None]:
rating_acrossYears = final_ratings.groupby(['genres', 'year'], as_index=False)['rating'].mean()

fig, ax = plt.subplots(figsize=(10, 12))
for genre, year in rating_acrossYears.groupby('genres'):
    year.plot(x='year', y='rating', ax=ax, label=genre)
    
plt.xlabel('Year (1995 - 2019)')
plt.xlim(71, 94) 
plt.ylabel('Rating')
plt.title('Trends in Average Movie Ratings for Different Genres')
plt.show()

## Step 3: Build and evaluate models

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
def train_split(df) :
  reader = Reader()
  data = Dataset.load_from_df(df_train[['userId', 'movieId', 'rating']], reader)
  trainset, testset = train_test_split(data, test_size=0.25, random_state=42)
  return trainset, testset


In [None]:
def SVD_train_split(df) :
  reader = Reader()
  data = Dataset.load_from_df(df_train[['userId', 'movieId', 'rating']], reader)
  trainset, testset = train_test_split(data, test_size=.1, random_state=42)
  return trainset, testset

In [None]:
def train_split_size(X, y) :
  X = X.n_ratings
  y = len(y)
  return print("Size of train ratings: ", X, "and Size of test ratings: ", y)

In [None]:
# def train_split_size_GridSearch(X) :
#   X = len(X)
# #   y = len(y)
#   return print("Size of train ratings: ", X)

In [None]:
def build_model(model, trainset, testset) :
  model = model
  model.fit(trainset)
  trainset_pred = model.test(trainset.build_testset())
  testset_pred_test = model.test(testset)
  return model, trainset_pred, testset_pred_test


In [None]:
def build_model_GridSearch(df) :
    reader = Reader
    data = Dataset.load_from_df(df_train[['userId', 'movieId', 'rating']], reader = reader)
    param_grid = {'bsl_options': {'method': ['als', 'sgd'],
                              'reg': [1, 2]},
              'k': [2, 3],
              'sim_options': {'name': ['msd', 'cosine'],
                              'min_support': [1, 5],
                              'user_based': [False]}
              }
    model = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=1)
    trainset_pred = model.fit(trainset)
    return model, trainset_pred



# reader = Reader(rating_scale=(1, 5))
# data = Dataset.load_from_df(df[['profile_id', 'content_id', 'rating']], reader)
# trainset, testset = train_test_split(data, test_size=.25, random_state=20)


# param_grid = {
#               'n_factors': [30],
#               'n_epochs': [5, 10, 20], 
#               'lr_all': [0.002, 0.006, 0.018, 0.054, 0.10, 0.15]
#              }
# algo = GridSearchCV(SVDpp, param_grid, measures=['rmse', 'mae'], cv=3)
# model = algo.fit(trainset)  // error occurs here


# from surprise import SVD
# from surprise import Dataset
# from surprise.model_selection import GridSearchCV

# # Use movielens-100K
# data = Dataset.load_builtin('ml-100k')

# param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
#               'reg_all': [0.4, 0.6]}
# gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

# gs.fit(data)

# # best RMSE score
# print(gs.best_score['rmse'])

# # combination of parameters that gave the best RMSE score
# print(gs.best_params['rmse'])

In [None]:
def get_RMSE_score_GridSearch(X) :
  X = accuracy.rmse(X)
  return print("best RMSE score: ", X.best_score['rmse'], "and combination of parameters that gave the best RMSE score: ", X.best_params['rmse'])

In [None]:
def get_RMSE_score(X, y) :
  X = accuracy.rmse(X)
  y = accuracy.rmse(y)
  return print("Training set RMSE score: ", X, "and Testing set RMSE score: ", y)

In [None]:
# model.predict(1, 4144).est

In [None]:
def create_submission(df, model) :
  # df.tail()
  sub_pred = []
  for i, row in df.iterrows():
      sub_pred.append(model.predict(row["userId"], row["movieId"]).est)
  df_submission = pd.DataFrame()
  df_submission["Id"] = df["userId"].astype(str) + "_" + df["movieId"].astype(str)
  df_submission["rating"] = sub_pred
  return df_submission

In [None]:
def save_sub_file(df) :
  now = datetime.now().strftime("%Y%m%d%H%M")
  # df_submission.to_csv(f"submissions/Kwanda{model}_{now}.csv", index=False)
  df_submission.to_csv(f'Kwandas_{now}.csv', index=False)
  name = f'Kwandas_{now}.csv'
  return name

## Base Model

In [None]:
model = BaselineOnly()

In [None]:
trainset, testset = train_split(df_train)

In [None]:
test_size = train_split_size(trainset, testset)
test_size

In [None]:
model, trainset_pred, testset_pred = build_model(model, trainset, testset)

In [None]:
train_n_test_score = get_RMSE_score(trainset_pred, testset_pred)
train_n_test_score

In [None]:
df_submission = create_submission(df_test, model)
df_submission.head()

In [None]:
df_sub_file = save_sub_file(df_submission)

In [None]:
#Download csv
FileLink(df_sub_file)

In [None]:
#delete csv
file = "/kaggle/working/"+df_sub_file
os.remove(file)