## Download data source

* Download the data needed for this jupyter notebook from kaggle and store it in a new folder (the-movies-dataset) in the current directory.


* Upon running this cell, the user will be asked for their username and key which can be found in a fresh api token from kaggle.

* Instructions to get api token to authenticate the data request (Note: kaggle account required):
    1. Sign into kaggle.
    2. Go to the 'Account' tab of your user profile and select 'Create New Token'. 
    3. This will trigger the download of kaggle.json, a file containing your API credentials.

* If the folder has been created and the files are already in that folder, than this cell does nothing and requires no credentials.

* Data Source Information: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv


In [1]:
import opendatasets as od

od.download("https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

Your Kaggle Key:Downloading the-movies-dataset.zip to ./the-movies-dataset


100%|██████████| 228M/228M [00:23<00:00, 10.2MB/s] 





## Combine Raw Data

Combining certain data from the necessary csv files into a single dataframe (complete_df).

* Rows are removed from each dataframe when they do not have sufficent data for a column or the data from a column does not exist.
* This kind of row removal is done before multiple copies of the same movie data becomes present in multple rows, to save time and space.
* Iteration through rows of a dataframe at this level is inefficient compared to list iteration.
* This is why the dataframes are converted into lists before iteration and then back again to dataframes, so the merge function can be applied to combine the data into a single dataframe (complete_df).

In [1]:
import pandas as pd
import time

start_time = time.time()


pd.set_option('display.max_colwidth', None)

movies_df = pd.read_csv('./the-movies-dataset/movies_metadata.csv',usecols=("genres","id" ,"title","tagline", "overview","production_companies"),
                          dtype={'genres':"string","id":"string","title": "string", "tagline": "string","overview":"string",
                                    "production_companies" :"string"})[["genres","id" ,"title","tagline", "overview","production_companies"]]
movies_df.dropna(inplace = True)
movies_lst = [row for row in movies_df.values.tolist() if not (row[0][len(row[0])  - 2:] == "[]" or row[5][len(row[5]) - 2:] == "[]")]
movies_df = pd.DataFrame(movies_lst, columns = ("genres","id" ,"title","tagline", "overview","production_companies"), dtype = str)



ratings_df = pd.read_csv('./the-movies-dataset/ratings.csv', usecols = ("userId", "movieId", "rating"),
                       dtype={"userId": "string","movieId": "string","rating": "string"})[["userId", "movieId", "rating"]]
ratings_df.rename(columns={"movieId": "id"}, inplace = True)
ratings_df.dropna(inplace = True)


# Question: What if the removal of duplicate movie ids per user was processed here instead of the cell below???
# Answer: The duplicate removal function can be ran here,...
# but the complete_list in the cell below can also be iterated over with relative complexity in order to remove duplicates.
# The iteration in the next cell also populates the gap list...
# which is critical to be ran directly before the function that determines bounds for users rated movies.
# So, omitting the no duplicate function in this cell and making it run in the next cell avoids redundant iteration.


# Question: What if the test and train ratings bounds was enforced here instead of the cell below???
# Answer: The merge functions below needs to be executed before determining test and train users, because merge will remove rows and ratings from users...
# before enforcing the users to be in a certain bounds for the number of their ratings. 
# The current timing of this function will ensure that the final users are within the set train or test bounds.


keywords_df = pd.read_csv('./the-movies-dataset/keywords.csv', usecols = ("id", "keywords"), dtype={"id": "string","keywords":"string"})[["id", "keywords"]]
keywords_df.dropna(inplace = True)
keywords_lst = [row for row in keywords_df.values.tolist() if not (row[1][len(row[1])  - 2:] == "[]")]
keywords_df = pd.DataFrame(keywords_lst, columns = ("id", "keywords"), dtype = str)


credits_df = pd.read_csv("./the-movies-dataset/credits.csv", usecols = ("cast", "id"), dtype={"cast": "string", "id": "string"})[["cast", "id"]]
credits_df.dropna(inplace = True)
credits_lst = [row for row in credits_df.values.tolist() if (not row[0][len(row[0])  - 2:] == "[]")]
credits_df = pd.DataFrame(credits_lst, columns = ("cast", "id"), dtype = str)


# Default merge is inner: This only keeps movies that have the id existing in both dataframes.
complete_df =  pd.merge(movies_df, ratings_df, on ="id")
complete_df =  pd.merge(complete_df,keywords_df, on ="id")
complete_df  = pd.merge(complete_df,credits_df, on ="id")


complete_df.sort_values(by = 'userId', inplace = True)


# Master dataframe: For each (user id, movie id) row combination there is the combined movie data from movies_df, ratings_df, keywords_df, and credits_df for the movie id in question.
# The columns are reordered.
complete_df  = complete_df.loc[:,['userId','id','rating',"title", "genres","production_companies","keywords", "cast", "tagline", "overview" ]]



# For testing:
print("Minutes taken:", (time.time()-start_time)/60)
print(complete_df.head())



# Tested on personal machine:
# Old run with dataframe iteration (old code): 1 minute and 5.7 seconds
# New run with list conversion before iteration (current code): 37.1 seconds

Minutes taken: 0.5958173910776774
        userId    id rating               title  \
6566765      1  1246    5.0        Rocky Balboa   
6880303      1  2959    4.0      License to Wed   
2083077      1  2762    4.5  Young and Innocent   
1492304      1  1968    4.0       Fools Rush In   
2638962      1   147    4.5       The 400 Blows   

                                                                                                genres  \
6566765                                                                  [{'id': 18, 'name': 'Drama'}]   
6880303                                                                 [{'id': 35, 'name': 'Comedy'}]   
2083077                                     [{'id': 18, 'name': 'Drama'}, {'id': 80, 'name': 'Crime'}]   
1492304  [{'id': 18, 'name': 'Drama'}, {'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]   
2638962                                                                  [{'id': 18, 'name': 'Drama'}]   

                      

## Data Extraction and Selection
1. Select data from users that have a number of ratings within a certain bounds.
2. Select a random subset of this data and simplify it.


In [2]:
import ast
import random
import time
import matplotlib.pyplot as plt
from matplotlib.pyplot import hist

start_time = time.time()


# LOOK: To make a fair comparison to the best possible implementation of the netflix data
# a closer distribution of user ratings to the netflix data should be selected 
# perhaps simalir proportion of users in each increment of 10 ratings
# problem: this does not correctly represent the population of the non-netflix dataset so it would have weaker applciaiton!!!
# But, another question is, is the netflix data biased???
# was the whole problem statment more theoretical then practical???
# how is it that the two datasets are vastly different???



# Also, since these are different datasets, a higher distribution is more extreme for this dataset than the netflix dataset
# meaning there is a higher selection bias for this dataset 

# the right distibution of users ratings should be selected with trial and error


# Note: in the netflix data, the distribution of nof user ratings does not change for users tested and users not tested
# this should be mimicked with this data



SEED_INT = 2
# Seed for consistent results across runtimes:
random.seed(SEED_INT)


def populate_names(item):
    """Extract names from the syntax of certain data entries:"""
    string  = item[1:-1]
    jsons = string.split("}, ")   
    names = ""
    index = 0
    for item in jsons:
        if(index == len(jsons)-1):
            temp_dict = ast.literal_eval(item)
            names+=str(temp_dict["name"])
        else:
            temp_dict = ast.literal_eval(item+"}")
            names+=str(str(temp_dict["name"])+" ")
        index += 1
    return names


def provide_data(row):
    """Extract data from row of complete_list:"""
    movie_data = []
    movie_data.append(int(row[0]))
    movie_data.append(int(row[1]))
    movie_data.append(float(row[2]))
    movie_data.append(row[3])  

    movie_data.append(populate_names(row[4]))
    movie_data.append(populate_names(row[5]))
    movie_data.append(populate_names(row[6]))
    movie_data.append(populate_names(row[7]))

    movie_data.append(str(row[8]))
    movie_data.append(str(row[9]))
    return movie_data
    


# The list of rows with users id, the users rating for the movie, and raw data for the movie:
# Note: It is sorted by user_id.
complete_list = complete_df.values.tolist()

print("Complete number of users:", len(list(complete_df["userId"].unique()))) # 260788

# The complete list of user rows without ratings of the same movie more than once for a given user:
complete_list_no_dups = []

# Distinquish the user the row belongs to:
last_id = complete_list[0][0]

# The set of movies that a user has rated:
# It is used to omit later ratings of a movie that the user has already rated.
movie_set = set()

# The number of rows of movie data a single user takes up for each user:
gaps = []

# Appended to gaps when all of a users rows of movie data have been counted:
gap_len = 0


# Populates gaps and complete_list_no_dups by omitting movies that already have a rating in respect to each user:
# Note: This code is faster than using dataframe methods.
# Example: Filter data by user and then remove duplicate movie ids for each user.
# This avoids slow dataframe iteration, but the filter method is also slow.
for row in complete_list:
    if last_id != row[0]:
        movie_set= set()
        complete_list_no_dups.append(row)
        movie_set.add(row[1])
        gaps.append(gap_len)
        gap_len = 1
    else:
        if row[1] not in movie_set:
            complete_list_no_dups.append(row)
            gap_len+=1
            movie_set.add(row[1])
    last_id = row[0]

# Add the last gap_len:
gaps.append(gap_len)


# LOOK: 

# Why does the program filter and select users by the number of movies they rated when a certain lower number of ratings can be taken...
# from users with more than enough ratings???

# The histogram suggests that by far the most likely bin is users with ratings from 0-9 (really 1-9 since a user has to have 1 rating to be in the data)
# and the rest of the users fall somewhere in the heavy right skewed graph.

# So if the goal is to collect users with ratings between 30-50 for train users it is more normal to take from users with the exact number of ratings...
# between 30-50 and using all of their available ratings, because the closest to normal user type for users that have 30 or more ratings...
# is the users with the minimun numeber of ratings, 30. The same logic holds for any user with total number of ratings falling between 30 and 50.
# ie: you rather have a user with 40 ratings use all their ratings rather than a users with 50 ratings only use 40.

# To add to this understanding, the average rating per person is just under 30 (29.620)...
# and the graph is heavily right skewed meaning that a relatively small percentage of users have 30 or more ratings...
# and users get more abnormal the more ratings they have above 30.



#LOOK:

# Why do train users have more ratings than test users???
# Answer: The train data is not just used with its own svd function to make train predictions.
# Another svd function takes the concatenation of train and test data and outputs test predictions.
# There is alot of value in train users having more ratings because the svd_full function will use those extra ratings
# for better test predictions.

# Performance was recorded for users with train users in between 11-31 ratings, 30-50 ratings, and 50-70 ratings
# The fact that train users with 30-50 ratings performed the best underlines the trade off between more data to feed the model
# (Particularly when svd test predictions are used with the concatenation of train and test data)
# and reasonable representation for the test users with smaller number of provided ratings than the train users.

# The incredibly right skewed distribution helps to explain the tradeoff between more data to feed the model 
# and reasonable representation for the test users with smaller number of provided ratings than the train users.
# Since the data is right skewed, at a certain threshold, the more ratings the users has, the more abnormal the user is.



#LOOK: All train users have much more movie ratings than test users so the output of content based and collaborative filtering methods in training
#incorrectly assumes an over-reliability to the test data in the final model since the test users have less movie ratings. 

#Tests were made in and attempt relinguish this bais by making the test and train users have similair bounds for the number of ratings (5-10) and introduce a new set of data with some higher rating bounds concantenated with both test an train user inputs with their respective calls to svd_full.

#this resulted in worse performance for an unknown exact reason. It could have something to do with the vast amount of 6000 train users with 5-10 ratings
#overwelhming the svd inputs with higher sparsenss reducing the accuracy of the svd train prediciton.




#LOOK: it is important to realize people who enter a number of ratings in the user client with the...
#application of this model would not directly correlate to the number of movie ratings that a user would have in the...
#data set. This introduces slight selection bias. However, using simlair range of ratings in train data to the client application is the best compromise
#to report honest perfromance.

#LOOK: review code... in the training svd, are all the indices of target movie ratings zeros???
#What about the those indices of the train data in the testing svd??? How much of a difference does it make?
#should it be changed?



# Index in the complete_list_no_dups list:
full_index = 0 



bounds = [] 



# Populates bounds_train and bounds_test by testing each user if they are a valid train or test user:
for user_index in range(len(gaps)):
    bounds.append([full_index, full_index+gaps[user_index]])
    full_index+=gaps[user_index]    



bounds_train = random.sample(bounds, 10000)




# Transformed data of the selected train users and test users (in that order):
sampled_data = []


cnt = 0

for bound in bounds_train:
    for movie in complete_list_no_dups[bound[0]:bound[1]]:
        movie_data = provide_data(movie)
        movie_data[0] = cnt
        sampled_data.append(movie_data)
    cnt+=1



print("Minutes taken:", (time.time()-start_time)/60)






Complete number of users: 260788
Minutes taken: 4.4205666303634645


## Write Data

Save selected data in constructed_data.csv file so that cells below it can run without running this cell and above.


In [3]:
import csv
import os

current_directory = os.getcwd()
final_directory = os.path.join(current_directory, 'constructed_data')
if not os.path.exists(final_directory):
   os.makedirs(final_directory)

with open("constructed_data/constructed_data_2.csv", "w", encoding="utf-8", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['userId','id','rating',"title", "genres","production_companies","keywords", "cast", "tagline", "overview"])
    writer.writerows(sampled_data)

## Read Data
This is the starting cell to run if the data is already saved to the constructed_data.csv. 

In [4]:
import csv

data_list =[]

with open("constructed_data/constructed_data_2.csv", 'r', encoding="utf-8") as f:
    csv_reader = csv.reader(f)
    data_list = list(csv_reader)

data_list = data_list[1:]


## Format and re-sample Data:

Format the data into a list of movie data rows for each movie rated for the user for each user. Then, select a subset of that data for each user type.

In [5]:
import random


SEED_INT = 2

random.seed(SEED_INT)

user_to_data = []

user_id = data_list[0][0]

ratings = []

for row in data_list:
    if (row[0]!=user_id):
        user_to_data.append(ratings)
        user_id = row[0]
        ratings = [row]
    else:
        ratings.append(row)



user_to_data.append(ratings)




user_to_data_train = random.sample(user_to_data, 10000)
user_to_data_test = random.sample(user_to_data_train, 1000)

## Create Features and Target values

* The train and test version of feature 1,2, and 3 are populated and in the final cell some subset of (feature 1, 2 and 3) is used to train and test the final model.
* The target values are ratings for each user from the randomly selected movie that they rated. They are also either train or test ratings used to train or test the the final model.


In [13]:
from gensim.parsing.preprocessing import remove_stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
import random
from ordered_set import OrderedSet
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.cluster import KMeans
import time
from sklearn.metrics import mean_squared_error

start_time = time.time()

N_VALUE = 10

SEED_INT = 2
# Seed for consistent results across runtimes:
random.seed(SEED_INT)


#variables needed:

#target movie: a list of target movies for each user in user_to_data_test

#target_movie_index: A list of indices corresponding to the randomly selected target movie for each user in user_to_data_test

#Overall average: the average rating amoung all users non-target movie ratings.

#movie_order: An ordered set of all movies that appear in at least one users movie ratings (including target movies).

#dictionary of movie id to the movies average rating. The ratings included are all non-target ratings.

#list of movies average rating in the order of movies_order. The ratings included are all non-target ratings.
#If because of this the average rating of a movie does not exist, set it equal to the overall average.

#list of users in user_to_data_train to list of ratings ordered by movie_order for every movie. Unrated movies take the average movies rating for that movie
#then normalize the ratings by subtracting the average movies rating for each movie.

#list of users in user_to_data_test to list of ratings ordered by movie_order for every movie. Unrated movies take the average movies rating for that movie
#set the target movies rating at the index given by target_movies to the average rating for that movie.
#then normalize the ratings by subtracting the average movies rating for each movie.

movies_order = OrderedSet()
movie_ratings_sum_dict = dict()
movie_ratings_count_dict = dict()
overall_average = 0
cnt = 0

train_user_to_movie_to_rating = [] 

for user in user_to_data_train:
    movie_to_rating  = dict()
    for movie in user:
        movies_order.add(movie[1])
        movie_to_rating[movie[1]] = float(movie[2])
        if(movie[1] in movie_ratings_sum_dict.keys()):
            movie_ratings_sum_dict[movie[1]] += float(movie[2])
            movie_ratings_count_dict[movie[1]] += 1
        else:
            movie_ratings_sum_dict[movie[1]] = float(movie[2])
            movie_ratings_count_dict[movie[1]] = 1
        overall_average+=float(movie[2])
        cnt += 1
    train_user_to_movie_to_rating.append(movie_to_rating)


test_user_to_movie_to_rating = [] 
target_movie = []
target_rating = []

for user in user_to_data_test:
    rand_num  = random.randint(0, len(user)-1)
    index = 0
    movie_to_rating  = dict()
    for movie in user:
        movies_order.add(movie[1])
        movie_to_rating[movie[1]] = float(movie[2])
        if(index == rand_num):
            target_movie.append(movie[1])
            target_rating.append(float(movie[2]))
        else:
            if(movie[1] in movie_ratings_sum_dict.keys()):
                movie_ratings_sum_dict[movie[1]] += float(movie[2])
                movie_ratings_count_dict[movie[1]] += 1
            else:
                movie_ratings_sum_dict[movie[1]] = float(movie[2])
                movie_ratings_count_dict[movie[1]] = 1
            cnt += 1
        index+=1
    test_user_to_movie_to_rating.append(movie_to_rating)

overall_average  = overall_average/cnt

movie_ratings_avg_list = []


for movie in movies_order:
    if movie in movie_ratings_sum_dict.keys():
        movie_ratings_avg_list.append(movie_ratings_sum_dict[movie]/movie_ratings_count_dict[movie])
    else:
        movie_ratings_avg_list.append(overall_average)
    

users_to_movie_ratings_transformed = []


for i in range(len(user_to_data_train)):
    j = 0
    lst = []
    for movie in movies_order: 
        if movie in train_user_to_movie_to_rating[i].keys():
            lst.append(train_user_to_movie_to_rating[i][movie] - movie_ratings_avg_list[j])
        else:
            lst.append(0)
        j += 1
    users_to_movie_ratings_transformed.append(lst)



target_movie_index = []

for i in range(len(user_to_data_test)):
    j = 0
    lst = []
    for movie in movies_order: 
        if(target_movie[i] == movie):
            lst.append(0)
            target_movie_index.append(j)
        elif movie in test_user_to_movie_to_rating[i].keys():
            lst.append(test_user_to_movie_to_rating[i][movie] - movie_ratings_avg_list[j])
        else:
            lst.append(0)
        j += 1
    users_to_movie_ratings_transformed.append(lst)



print(movie_ratings_avg_list)

def svd_full(user_to_ratings_transformed, n, movie_ratings_avg_list):
    """
    1. Get the svd of the user_to_ratings_full_transform 
    2. Truncate each factor to n components
    3. Multiply the truncated components together (U X s) X V 
    4. Scale back the values to the orginal rating scale (1-5) and return result
    """
    U, S, V = np.linalg.svd(user_to_ratings_transformed, full_matrices=False)
    
    # Simplify factors to n components:
    U=U[:,0:n]
    S=np.diag(S)
    S=S[0:n,0:n]
    V=V[0:n,:]

    # Reconstruct to a new array:
    US = np.dot(U,S)
    USV = np.dot(US,V)

    # This tranforms the UsV row by row into the original rating scale (1-5).
    USV = USV + np.tile(list(movie_ratings_avg_list), (USV.shape[0],1))

    # Be consistent with data structures:
    return list(USV)



svd_out_full = svd_full(users_to_movie_ratings_transformed, N_VALUE, movie_ratings_avg_list)



predictions = [svd_out_full[i+len(user_to_data_train)][target_movie_index[i]] for i in range(len(user_to_data_test))]


print()

print(mean_squared_error(target_rating , predictions, squared=False))

print(time.time() - start_time)


#rmse: 0.9825798712414232
#LOOK: should try more data
#What about applying svd to each individual test case instead of all at once



[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jackson\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[2.537593984962406, 3.284019975031211, 3.613256113256113, 4.327841845140033, 3.250661375661376, 4.15856638609829, 4.114254792826221, 4.2597035991531405, 4.068160152526215, 4.225690276110444, 3.871347785108388, 3.404235727440147, 3.873809523809524, 3.294403892944039, 3.6690368455074336, 3.113049095607235, 3.630081300813008, 3.0224403927068724, 3.320693391115926, 3.075892857142857, 3.07, 3.193798449612403, 3.720985691573927, 3.4743896601244613, 3.7600561272217026, 2.172185430463576, 3.798573975044563, 3.604575163398693, 4.1104651162790695, 3.8623619371282922, 3.8716683119447186, 3.723650385604113, 3.7439024390243905, 2.1, 3.3493975903614457, 3.73567029843676, 3.689915966386555, 3.2131147540983607, 3.2, 2.9974048442906573, 3.4023285899094438, 3.753154574132492, 2.211060948081264, 4.089508928571429, 3.732085561497326, 4.192197566213315, 3.3893256102606535, 3.7556053811659194, 3.484004739336493, 3.941747572815534, 2.5794392523364484, 3.0112359550561796, 3.7546296296296298, 3.524590163934426