## Download data source

* Download the data needed for this jupyter notebook from kaggle and store it in a new folder (the-movies-dataset) in the current directory.


* Upon running this cell, the user will be asked for their username and key which can be found in a fresh api token from kaggle.

* Instructions to get api token to authenticate the data request (Note: kaggle account required):
    1. Sign into kaggle.
    2. Go to the 'Account' tab of your user profile and select 'Create New Token'. 
    3. This will trigger the download of kaggle.json, a file containing your API credentials.

* If the folder has been created and the files are already in that folder, than this cell does nothing and requires no credentials.

* Data Source Information: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv


In [1]:
import opendatasets as od

od.download("https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

Your Kaggle Key:Downloading the-movies-dataset.zip to ./the-movies-dataset


100%|██████████| 228M/228M [00:23<00:00, 10.2MB/s] 





## Combine Raw Data

Combining certain data from the necessary csv files into a single dataframe (complete_df).

* Rows are removed from each dataframe when they do not have sufficent data for a column or the data from a column does not exist.
* This kind of row removal is done before multiple copies of the same movie data becomes present in multple rows, to save time and space.
* Iteration through rows of a dataframe at this level is inefficient compared to list iteration.
* This is why the dataframes are converted into lists before iteration and then back again to dataframes, so the merge function can be applied to combine the data into a single dataframe (complete_df).

In [1]:
import pandas as pd
import time

start_time = time.time()


pd.set_option('display.max_colwidth', None)

movies_df = pd.read_csv('./the-movies-dataset/movies_metadata.csv',usecols=("genres","id" ,"title","tagline", "overview","production_companies"),
                          dtype={'genres':"string","id":"string","title": "string", "tagline": "string","overview":"string",
                                    "production_companies" :"string"})[["genres","id" ,"title","tagline", "overview","production_companies"]]
movies_df.dropna(inplace = True)
movies_lst = [row for row in movies_df.values.tolist() if not (row[0][len(row[0])  - 2:] == "[]" or row[5][len(row[5]) - 2:] == "[]")]
movies_df = pd.DataFrame(movies_lst, columns = ("genres","id" ,"title","tagline", "overview","production_companies"), dtype = str)



ratings_df = pd.read_csv('./the-movies-dataset/ratings.csv', usecols = ("userId", "movieId", "rating"),
                       dtype={"userId": "string","movieId": "string","rating": "string"})[["userId", "movieId", "rating"]]
ratings_df.rename(columns={"movieId": "id"}, inplace = True)
ratings_df.dropna(inplace = True)


# Question: What if the removal of duplicate movie ids per user was processed here instead of the cell below???
# Answer: The duplicate removal function can be ran here,...
# but the complete_list in the cell below can also be iterated over with relative complexity in order to remove duplicates.
# The iteration in the next cell also populates the gap list...
# which is critical to be ran directly before the function that determines bounds for users rated movies.
# So, omitting the no duplicate function in this cell and making it run in the next cell avoids redundant iteration.


# Question: What if the test and train ratings bounds was enforced here instead of the cell below???
# Answer: The merge functions below needs to be executed before determining test and train users, because merge will remove rows and ratings from users...
# before enforcing the users to be in a certain bounds for the number of their ratings. 
# The current timing of this function will ensure that the final users are within the set train or test bounds.


keywords_df = pd.read_csv('./the-movies-dataset/keywords.csv', usecols = ("id", "keywords"), dtype={"id": "string","keywords":"string"})[["id", "keywords"]]
keywords_df.dropna(inplace = True)
keywords_lst = [row for row in keywords_df.values.tolist() if not (row[1][len(row[1])  - 2:] == "[]")]
keywords_df = pd.DataFrame(keywords_lst, columns = ("id", "keywords"), dtype = str)


credits_df = pd.read_csv("./the-movies-dataset/credits.csv", usecols = ("cast", "id"), dtype={"cast": "string", "id": "string"})[["cast", "id"]]
credits_df.dropna(inplace = True)
credits_lst = [row for row in credits_df.values.tolist() if (not row[0][len(row[0])  - 2:] == "[]")]
credits_df = pd.DataFrame(credits_lst, columns = ("cast", "id"), dtype = str)


# Default merge is inner: This only keeps movies that have the id existing in both dataframes.
complete_df =  pd.merge(movies_df, ratings_df, on ="id")
complete_df =  pd.merge(complete_df,keywords_df, on ="id")
complete_df  = pd.merge(complete_df,credits_df, on ="id")


complete_df.sort_values(by = 'userId', inplace = True)


# Master dataframe: For each (user id, movie id) row combination there is the combined movie data from movies_df, ratings_df, keywords_df, and credits_df for the movie id in question.
# The columns are reordered.
complete_df  = complete_df.loc[:,['userId','id','rating',"title", "genres","production_companies","keywords", "cast", "tagline", "overview" ]]

# For testing:
print("Minutes taken:", (time.time()-start_time)/60)
print(complete_df.head())



# Tested on personal machine:
# Old run with dataframe iteration (old code): 1 minute and 5.7 seconds
# New run with list conversion before iteration (current code): 37.1 seconds

Minutes taken: 0.6275500257809957
        userId    id rating               title  \
6566765      1  1246    5.0        Rocky Balboa   
6880303      1  2959    4.0      License to Wed   
2083077      1  2762    4.5  Young and Innocent   
1492304      1  1968    4.0       Fools Rush In   
2638962      1   147    4.5       The 400 Blows   

                                                                                                genres  \
6566765                                                                  [{'id': 18, 'name': 'Drama'}]   
6880303                                                                 [{'id': 35, 'name': 'Comedy'}]   
2083077                                     [{'id': 18, 'name': 'Drama'}, {'id': 80, 'name': 'Crime'}]   
1492304  [{'id': 18, 'name': 'Drama'}, {'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]   
2638962                                                                  [{'id': 18, 'name': 'Drama'}]   

                      

## Data Extraction and Selection
1. Select data from users that have a number of ratings within a certain bounds.
2. Select a random subset of this data and simplify it.


In [2]:
import ast
import random
import time
import matplotlib.pyplot as plt
from matplotlib.pyplot import hist

start_time = time.time()

SEED_INT = 5
# Seed for consistent results across runtimes:
random.seed(SEED_INT)


def populate_names(item):
    """Extract names from the syntax of certain data entries:"""
    string  = item[1:-1]
    jsons = string.split("}, ")   
    names = ""
    index = 0
    for item in jsons:
        if(index == len(jsons)-1):
            temp_dict = ast.literal_eval(item)
            names+=str(temp_dict["name"])
        else:
            temp_dict = ast.literal_eval(item+"}")
            names+=str(str(temp_dict["name"])+" ")
        index += 1
    return names


def provide_data(row):
    """Extract data from row of complete_list:"""
    movie_data = []
    movie_data.append(int(row[0]))
    movie_data.append(int(row[1]))
    movie_data.append(float(row[2]))
    movie_data.append(row[3])  

    movie_data.append(populate_names(row[4]))
    movie_data.append(populate_names(row[5]))
    movie_data.append(populate_names(row[6]))
    movie_data.append(populate_names(row[7]))

    movie_data.append(str(row[8]))
    movie_data.append(str(row[9]))
    return movie_data
    


# The list of rows with users id, the users rating for the movie, and raw data for the movie:
# Note: It is sorted by user_id.
complete_list = complete_df.values.tolist()

print("Complete number of users:", len(list(complete_df["userId"].unique()))) # 260788

# The complete list of user rows without ratings of the same movie more than once for a given user:
complete_list_no_dups = []

# Distinquish the user the row belongs to:
last_id = complete_list[0][0]

# The set of movies that a user has rated:
# It is used to omit later ratings of a movie that the user has already rated.
movie_set = set()

# The number of rows of movie data a single user takes up for each user:
gaps = []

# Appended to gaps when all of a users rows of movie data have been counted:
gap_len = 0


# Populates gaps and complete_list_no_dups by omitting movies that already have a rating in respect to each user:
# Note: This code is faster than using dataframe methods.
# Example: Filter data by user and then remove duplicate movie ids for each user.
# This avoids slow dataframe iteration, but the filter method is also slow.
for row in complete_list:
    if last_id != row[0]:
        movie_set= set()
        complete_list_no_dups.append(row)
        movie_set.add(row[1])
        gaps.append(gap_len)
        gap_len = 1
    else:
        if row[1] not in movie_set:
            complete_list_no_dups.append(row)
            gap_len+=1
            movie_set.add(row[1])
    last_id = row[0]

# Add the last gap_len:
gaps.append(gap_len)



full_index = 0 
bounds = [] 

for user_index in range(len(gaps)):
    bounds.append([full_index, full_index+gaps[user_index]])
    full_index+=gaps[user_index]    
 


#LOOK: rundown of process
#LOOK: these are the types of user categories
#users that are there only to predict the svd for train and test users
#train users
#test users

#test and train users should have the same range of ratings
#svd users should have a different rating range

#there are 2 features to train the final model...
#against the target ratings of the train users
#feature 1: svd prediction from train users
#feature 2: average rating for the train users

#there are 2 features to test the final model...
#against the target ratings of the test users
#feature 1: svd prediction from test users
#feature 2: average rating for the test users


#These set the rating requirements for test and train users.
    

SVD_USER_RATING_LB = 20
SVD_USER_RATING_UB = 30
USER_RATING_LB = 5
USER_RATING_UB = 10




random.shuffle(bounds)
no_svd_users = 1000
train_users = 800
test_users = 800
last_index = -1
bounds_svd_users = []
bounds_train_users = []
bounds_test_users = []



#LOOK: the problem seems to occur only when user groups share simlair ratings bounds
#why???

index = 0
for item in bounds:
    if item[1]-item[0] >=SVD_USER_RATING_LB and item[1]-item[0] <=SVD_USER_RATING_UB:
        bounds_svd_users.append(item)
        if len(bounds_svd_users) == no_svd_users:
            last_index = index
            break
    index+=1



index+=1
for item in bounds[last_index:]:
    if item[1]-item[0] >=USER_RATING_LB and item[1]-item[0] <=USER_RATING_UB:
        bounds_train_users.append(item)
        if len(bounds_train_users) == train_users:
            last_index = index
            break
    index+=1

index+=1
for item in bounds[last_index:]:
    if item[1]-item[0] >=USER_RATING_LB and item[1]-item[0] <=USER_RATING_UB:
        bounds_test_users.append(item)
        if len(bounds_test_users) == test_users:
            break



# Transformed data of the selected train users and test users (in that order):
sampled_data = []


for bound in bounds_svd_users:
    for movie in complete_list_no_dups[bound[0]:bound[1]]:
        movie_data = provide_data(movie)
        sampled_data.append(movie_data)


for bound in bounds_train_users:
    for movie in complete_list_no_dups[bound[0]:bound[1]]:
        movie_data = provide_data(movie)
        sampled_data.append(movie_data)



for bound in bounds_test_users:
    for movie in complete_list_no_dups[bound[0]:bound[1]]:
        movie_data = provide_data(movie)
        sampled_data.append(movie_data)



print("Minutes taken:", (time.time()-start_time)/60)




Complete number of users: 260788
Minutes taken: 0.6642666419347127


## Write Data

Save selected data in constructed_data.csv file so that cells below it can run without running this cell and above.


In [3]:
import csv
import os

current_directory = os.getcwd()
final_directory = os.path.join(current_directory, 'constructed_data')
if not os.path.exists(final_directory):
   os.makedirs(final_directory)

with open("constructed_data/constructed_data_3.csv", "w", encoding="utf-8", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['userId','id','rating',"title", "genres","production_companies","keywords", "cast", "tagline", "overview"])
    writer.writerows(sampled_data)

## Read Data
This is the starting cell to run if the data is already saved to the constructed_data.csv. 

In [7]:
import csv

data_list =[]

with open("constructed_data/constructed_data_3.csv", 'r', encoding="utf-8") as f:
    csv_reader = csv.reader(f)
    data_list = list(csv_reader)

data_list = data_list[1:]




## Format and re-sample Data:

Format the data into a list of movie data rows for each movie rated for the user for each user. Then, select a subset of that data for each user type.

In [8]:
import random
import copy


#LOOK: change back to zero...
SEED_INT = 5

random.seed(SEED_INT)

user_to_data_svd = []
user_to_data_train= []
user_to_data_test = []

user_id = data_list[0][0]
ratings = []
user_index = 0


#LOOK: What if two diiferen user have the same id???

for row in data_list:
    if (row[0]!=user_id):
        if(user_index<1000 and user_index>=0):
            user_to_data_svd.append(ratings)
        elif(user_index<1800 and user_index>=1000):
            user_to_data_train.append(ratings)
        else:
            user_to_data_test.append(ratings)         
        user_id = row[0]
        ratings = [row]
        user_index+=1
    else:
        ratings.append(row)



user_to_data_test.append(ratings)


#sample and relabel user and movie indices
user_to_data_svd = random.sample(user_to_data_svd, 1000)
user_to_data_train = random.sample(user_to_data_train, 800)
user_to_data_test = random.sample(user_to_data_test, 800)


old_to_new_svd  = dict()
last_index_svd = 0
svd_cnt = 0

for user in user_to_data_svd:
    for row in user: 
        if(row[1] in old_to_new_svd.keys()):
            row[1] = old_to_new_svd[row[1]]
        else:
            old_to_new_svd[row[1]] = last_index_svd
            row[1] = last_index_svd
            last_index_svd+=1      
        row[0] = svd_cnt
    svd_cnt+=1


old_to_new_train = copy.deepcopy(old_to_new_svd)
last_index_train = last_index_svd
train_cnt = svd_cnt

for user in user_to_data_train:
    for row in user: 
        if(row[1] in old_to_new_train.keys()):
            row[1] = old_to_new_train[row[1]]
        else:
            old_to_new_train[row[1]] = last_index_train
            row[1] = last_index_train
            last_index_train+=1      
        row[0] = train_cnt
    train_cnt+=1

old_to_new_test = copy.deepcopy(old_to_new_svd)
last_index_test = last_index_svd
test_cnt = svd_cnt

for user in user_to_data_test:
    for row in user: 
        if(row[1] in old_to_new_test.keys()):
            row[1] = old_to_new_test[row[1]]
        else:
            old_to_new_test[row[1]] = last_index_test
            row[1] = last_index_test
            last_index_test+=1      
        row[0] = test_cnt
    test_cnt+=1




In [12]:
from gensim.parsing.preprocessing import remove_stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
import random
from ordered_set import OrderedSet
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.cluster import KMeans
import time
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.model_selection import ParameterGrid
from surprise import SVD,Dataset,Reader
import pandas as pd
from numba import njit
import copy

SEED_INT = 5
# Seed for consistent results across runtimes:
random.seed(SEED_INT)
np.random.seed(SEED_INT)


#LOOK: The nof test users actually matters alot in the svd_focus notebook!
#this mean more testing shuld be done


# methods of getting svd predictions for train and test users:
# should give all of these a try!!!


# 1: 
# make an svd model for every test or train user combined with all the svd users
# then make predictions for the single attatched test or train user
# problem: too slow!!!

# 2:
# another method is batch processing:
# use all of the train user with the svd users to build a model then make predcitions with the train users
# use all the test users with the svd users to build a model then make prediction with the test users

# problem: does not answer a users rating request imediately without a pool of users with the same range 
# for nof ratings...
# However, to solve this problem, this pool of users alike can still be acessed randomly from a database
# problem: train and test users are less helpful for bulding the svd since they have less ratings
# requirements: same number of train and test users because otherwise the svd works differently...
# and the final model (combination of two features) will be lead astray.
# problem: the train and test users only aproximate the client user that enters a number of ratings.
# just because the user enters a certain amount of ratings doesnt mean we known how...
# it corresponds to the number of movies they rated in the database. answering this is a whole other question.
# helpful idea: force the user to enter an exact number of (movie, rating) pairs...
# because we don't know how the distribution of nof train/test user ratings for a certain bounds of ratings
# correlates to the number of movie, rating pairs entered on the client.
# helpful idea: this method can be slightly optimized for individual prediction by replacing the users that are
# alike the user that requests a rating with a number of svd users with a higher rating range.
# This is idea 1 but cannot be used in bulk with the sheer amount of data and is only good for individual
# client requests. Idea 2 at best produces similair scores to idea 1, but is infeasble in testing
# this should be noted in the readme!!!


# 3: 
# train the svd model with the svd users
# then train a little bit more with a single train or test user
# then make a prediction for a single train or test user
# problem: hard to determine the right amount of training
# problem: the traning order is off


#LOOK: will be trying to make #2 work first


#these 6 are the variables for the final model:
feature_1_train = [] #user average predictions (if not exist set to train or tes oevrall average)
feature_1_test = [] 
feature_2_train = [] #svd predictions
feature_2_test = [] 
target_rating_train = []
target_rating_test = []


#used later
movie_ratings_avg_list_train = []
movie_ratings_avg_list_test = []



#temporary variables:
movies_order_svd = OrderedSet()
movie_ratings_sum_dict_svd = dict()
movie_ratings_count_dict_svd = dict()
svd_user_to_movie_to_rating = []
overall_average_svd = 0 
cnt_svd = 0


for user in user_to_data_svd:
    movie_to_rating  = dict()
    for movie in user:
        movies_order_svd.add(movie[1])
        movie_to_rating[movie[1]] = float(movie[2])
        if(movie[1] in movie_ratings_sum_dict_svd.keys()):
            movie_ratings_sum_dict_svd[movie[1]] += float(movie[2])
            movie_ratings_count_dict_svd[movie[1]] += 1
        else:
            movie_ratings_sum_dict_svd[movie[1]] = float(movie[2])
            movie_ratings_count_dict_svd[movie[1]] = 1
        overall_average_svd+=float(movie[2])
        cnt_svd += 1
    svd_user_to_movie_to_rating.append(movie_to_rating)



#LOOK: Problem: there are users that are in svd set and the train set....hmm

movies_order_train = copy.deepcopy(movies_order_svd)
movie_ratings_sum_dict_train = copy.deepcopy(movie_ratings_sum_dict_svd)
movie_ratings_count_dict_train = copy.deepcopy(movie_ratings_count_dict_svd)
train_user_to_movie_to_rating = copy.deepcopy(svd_user_to_movie_to_rating)
overall_average_train = overall_average_svd 
cnt_train = cnt_svd
target_movie_train = []

for user in user_to_data_train:
    rand_num  = random.randint(0, len(user)-1)
    index = 0
    movie_to_rating  = dict()
    #these two varaibles are used for the user average ratings
    user_rating_sum = 0
    usr_rating_count = 0
    for movie in user:
        movies_order_train.add(movie[1])
        if(index == rand_num):
            target_movie_train.append(movie[1])
            target_rating_train.append(float(movie[2]))
        else:
            if(movie[1] in movie_ratings_sum_dict_train.keys()):
                movie_ratings_sum_dict_train[movie[1]] += float(movie[2])
                movie_ratings_count_dict_train[movie[1]] += 1
            else:
                movie_ratings_sum_dict_train[movie[1]] = float(movie[2])
                movie_ratings_count_dict_train[movie[1]] = 1
            movie_to_rating[movie[1]] = float(movie[2])
            overall_average_train+=float(movie[2])
            cnt_train += 1
            user_rating_sum+=float(movie[2])
            usr_rating_count +=1
        index+=1
    if(user_rating_sum==0):
        feature_1_train.append(-1)
    else: 
        feature_1_train.append(user_rating_sum/usr_rating_count)
    train_user_to_movie_to_rating.append(movie_to_rating)


overall_average_train = overall_average_train/cnt_train

for i in range(len(feature_1_train)):
    if feature_1_train[i] ==-1:
        feature_1_train[i] = overall_average_train




movies_order_test = copy.deepcopy(movies_order_svd)
movie_ratings_sum_dict_test = copy.deepcopy(movie_ratings_sum_dict_svd)
movie_ratings_count_dict_test = copy.deepcopy(movie_ratings_count_dict_svd)
test_user_to_movie_to_rating = copy.deepcopy(svd_user_to_movie_to_rating)
overall_average_test = overall_average_svd 
cnt_test = cnt_svd
target_movie_test = []


#left off here
for user in user_to_data_test:
    rand_num  = random.randint(0, len(user)-1)
    index = 0
    movie_to_rating  = dict()
    user_rating_sum = 0
    usr_rating_count = 0
    for movie in user:
        movies_order_test.add(movie[1])
        if(index == rand_num):
            target_movie_test.append(movie[1])
            target_rating_test.append(float(movie[2]))
        else:
            if(movie[1] in movie_ratings_sum_dict_test.keys()):
                movie_ratings_sum_dict_test[movie[1]] += float(movie[2])
                movie_ratings_count_dict_test[movie[1]] += 1
            else:
                movie_ratings_sum_dict_test[movie[1]] = float(movie[2])
                movie_ratings_count_dict_test[movie[1]] = 1
            movie_to_rating[movie[1]] = float(movie[2])
            overall_average_test+=float(movie[2])
            cnt_test += 1
            user_rating_sum+=float(movie[2])
            usr_rating_count +=1
        index+=1
    if(user_rating_sum==0):
        feature_1_test.append(-1)
    else: 
        feature_1_test.append(user_rating_sum/usr_rating_count)
    test_user_to_movie_to_rating.append(movie_to_rating)

overall_average_test  = overall_average_test/cnt_test

for i in range(len(feature_1_test)):
    if feature_1_test[i] ==-1:
        feature_1_test[i] = overall_average_test


#checkpoint1...


#need to find movie_wise averages for...
#the svd+train set and the svd+test set
#train:
movie_ratings_avg_list_train = []
for movie in movies_order_train:
    if movie in movie_ratings_sum_dict_train.keys():
        movie_ratings_avg_list_train.append(movie_ratings_sum_dict_train[movie]/movie_ratings_count_dict_train[movie])
    else:
        movie_ratings_avg_list_train.append(overall_average_train)

#test:
movie_ratings_avg_list_test = []
for movie in movies_order_test:
    if movie in movie_ratings_sum_dict_test.keys():
        movie_ratings_avg_list_test.append(movie_ratings_sum_dict_test[movie]/movie_ratings_count_dict_test[movie])
    else:
        movie_ratings_avg_list_test.append(overall_average_test) 


train_users_to_movie_ratings = []
test_users_to_movie_ratings = []

for i in range(len(user_to_data_svd)):
    j = 0
    lst = []
    for movie in movies_order_train: 
        if movie in train_user_to_movie_to_rating[i].keys():
            lst.append(train_user_to_movie_to_rating[i][movie])
        else:
            lst.append(movie_ratings_avg_list_train[j])
        j += 1
    train_users_to_movie_ratings.append(lst)
    j = 0
    lst1 = []
    for movie in movies_order_test: 
        if movie in test_user_to_movie_to_rating[i].keys():
            lst1.append(test_user_to_movie_to_rating[i][movie])
        else:
            lst1.append(movie_ratings_avg_list_test[j])
        j += 1
    test_users_to_movie_ratings.append(lst1)



for i in range(len(user_to_data_train)):
    j = 0
    lst = []
    for movie in movies_order_train: 
        if(target_movie_train[i] == movie):
            lst.append(movie_ratings_avg_list_train[j])
        elif movie in train_user_to_movie_to_rating[i].keys():
            lst.append(train_user_to_movie_to_rating[i][movie])
        else:
            lst.append(movie_ratings_avg_list_train[j])
        j += 1
    train_users_to_movie_ratings.append(lst)

for i in range(len(user_to_data_test)):
    j = 0
    lst = []
    for movie in movies_order_test: 
        if(target_movie_test[i] == movie):
            lst.append(movie_ratings_avg_list_test[j])
        elif movie in test_user_to_movie_to_rating[i].keys():
            lst.append(test_user_to_movie_to_rating[i][movie])
        else:
            lst.append(movie_ratings_avg_list_test[j])
        j += 1
    test_users_to_movie_ratings.append(lst)



#SVD methods....
@njit
def epoch(list, b1, b2, p, q, overall_average, lr, rt):
    for row in list:
        #conversions needed because numpy array converts to decimal
        u = int(row[0])
        i = int(row[1])
        r = row[2]

        pred = overall_average+b1[u]+b2[i]+np.dot(p[u],q[i])
        error = r-pred
        b1[u] += lr*(error- rt*b1[u])
        b2[i] += lr*(error- rt*b2[i])
        temp = lr*(error*q[i] -rt*p[u])
        q[i] += lr*(error*p[u] -rt*q[i])
        p[u] += temp



#LOOK: some row in list is requesting an item that is out of bounds
#in this case it is the features that is out of bounds
        
def svd_iterative(list, n, epochs, rt, lr, overall_average, nof_users, nof_movies):
    
    q = np.random.normal(0, .1, (nof_movies, n))
    p = np.random.normal(0, .1, (nof_users, n))

    b1 = np.zeros(nof_users)
    b2 = np.zeros(nof_movies)

    np_array = np.array(list)

    for _ in range(epochs):
        epoch(np_array, b1, b2, p, q, overall_average, lr, rt)

    return b1, b2, p, q


# train_list = []
# train_index = 0
# train_movies_to_predict = []
# for user_ratings, user_dict in zip(train_users_to_movie_ratings, train_user_to_movie_to_rating):
#     for rating, movie_id in zip(user_ratings, list(movies_order_train)):
#         if movie_id in user_dict.keys():
#             train_list.append((train_index, int(movie_id), rating))
#         elif(train_index >= len(user_to_data_svd)):
#             if(movie_id == target_movie_train[train_index-len(user_to_data_svd)]):
#                 train_movies_to_predict.append((train_index, int(movie_id)))
#     train_index+=1



#LOOK: why is this different than above???
train_list = []
train_index = 0
train_movies_to_predict = []
for user_dict in train_user_to_movie_to_rating:
    for movie_id in list(movies_order_train):
        if movie_id in user_dict.keys():
            train_list.append((train_index, int(movie_id), user_dict[movie_id]))
        elif(train_index >= len(user_to_data_svd)):
            if(movie_id == target_movie_train[train_index-len(user_to_data_svd)]):
                train_movies_to_predict.append((train_index, int(movie_id)))
    train_index+=1

random.shuffle(train_list)

#train_list and train_movies to predict ready for usage...


#LOOK: problem is that the indices are in the order svd, train, test
#when really there should be two orders:
#svd train and then svd test

# test_list = []
# test_index = 0
# test_movies_to_predict = []
# for user_ratings, user_dict in zip(test_users_to_movie_ratings, test_user_to_movie_to_rating):
#     for rating, movie_id in zip(user_ratings, list(movies_order_test)):
#         if movie_id in user_dict.keys():
#             test_list.append((test_index, int(movie_id), rating))
#         elif(test_index >= len(user_to_data_svd)):     
#             if(movie_id == target_movie_test[test_index-len(user_to_data_svd)]):
#                 test_movies_to_predict.append((test_index, int(movie_id)))
#     test_index+=1

test_list = []
test_index = 0
test_movies_to_predict = []
for user_dict in test_user_to_movie_to_rating:
    for movie_id in list(movies_order_test):
        if movie_id in user_dict.keys():
            test_list.append((test_index, int(movie_id), user_dict[movie_id]))
        elif(test_index >= len(user_to_data_svd)):     
            if(movie_id == target_movie_test[test_index-len(user_to_data_svd)]):
                test_movies_to_predict.append((test_index, int(movie_id)))
    test_index+=1

random.shuffle(test_list)


#test_list and test_movies to predict ready for usage


#now need to invoke svd for train and test sets:

#svd_iterative(train_list, n, epochs, rt, lr, overall_average, nof_users, nof_movies):

b1, b2, p, q = svd_iterative(train_list, 50,30,.03,.0075,
                            overall_average_train, len(train_users_to_movie_ratings), len(list(movies_order_train)))

feature_2_train = [overall_average_train + b1[pair[0]]+b2[pair[1]]
                            +np.dot(p[pair[0]],q[pair[1]]) for pair in train_movies_to_predict]

b1, b2, p, q = svd_iterative(test_list, 50,30,.03,.0075,
                            overall_average_test, len(test_users_to_movie_ratings), len(list(movies_order_test)))

feature_2_test = [overall_average_test + b1[pair[0]]+b2[pair[1]]
                            +np.dot(p[pair[0]],q[pair[1]]) for pair in test_movies_to_predict]
                     
print(mean_squared_error(target_rating_train, feature_2_train, squared = False))
print(r2_score(target_rating_train, feature_2_train))
print(mean_squared_error(target_rating_test, feature_2_test, squared = False))
print(r2_score(target_rating_test, feature_2_test))



[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jackson\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


1.0125046273267093
0.26576702553132336
1.0563343901436708
0.24348389376077495
