## Download data source

* Download the data needed for this jupyter notebook from kaggle and store it in a new folder (the-movies-dataset) in the current directory.


* Upon running this cell, the user will be asked for their username and key which can be found in a fresh api token from kaggle.

* Instructions to get api token to authenticate the data request (Note: kaggle account required):
    1. Sign into kaggle.
    2. Go to the 'Account' tab of your user profile and select 'Create New Token'. 
    3. This will trigger the download of kaggle.json, a file containing your API credentials.

* If the folder has been created and the files are already in that folder, than this cell does nothing and requires no credentials.

* Data Source Information: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv


In [1]:
import opendatasets as od

od.download("https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

Your Kaggle Key:Downloading the-movies-dataset.zip to ./the-movies-dataset


100%|██████████| 228M/228M [00:23<00:00, 10.2MB/s] 





## Combine Raw Data

Combining certain data from the necessary csv files into a single dataframe (complete_df).

* Rows are removed from each dataframe when they do not have sufficent data for a column or the data from a column does not exist.
* This kind of row removal is done before multiple copies of the same movie data becomes present in multple rows, to save time and space.
* Iteration through rows of a dataframe at this level is inefficient compared to list iteration.
* This is why the dataframes are converted into lists before iteration and then back again to dataframes, so the merge function can be applied to combine the data into a single dataframe (complete_df).

In [1]:
import pandas as pd
import time

start_time = time.time()


pd.set_option('display.max_colwidth', None)

movies_df = pd.read_csv('./the-movies-dataset/movies_metadata.csv',usecols=("genres","id" ,"title","tagline", "overview","production_companies"),
                          dtype={'genres':"string","id":"string","title": "string", "tagline": "string","overview":"string",
                                    "production_companies" :"string"})[["genres","id" ,"title","tagline", "overview","production_companies"]]
movies_df.dropna(inplace = True)
movies_lst = [row for row in movies_df.values.tolist() if not (row[0][len(row[0])  - 2:] == "[]" or row[5][len(row[5]) - 2:] == "[]")]
movies_df = pd.DataFrame(movies_lst, columns = ("genres","id" ,"title","tagline", "overview","production_companies"), dtype = str)

print(len(movies_df))

#LOOK: trying small dataset for testing

ratings_df = pd.read_csv('./the-movies-dataset/ratings.csv', usecols = ("userId", "movieId", "rating"),
                       dtype={"userId": "string","movieId": "string","rating": "string"})[["userId", "movieId", "rating"]]
ratings_df.rename(columns={"movieId": "id"}, inplace = True)
ratings_df.dropna(inplace = True)


# Question: What if the removal of duplicate movie ids per user was processed here instead of the cell below???
# Answer: The duplicate removal function can be ran here,...
# but the complete_list in the cell below can also be iterated over with relative complexity in order to remove duplicates.
# The iteration in the next cell also populates the gap list...
# which is critical to be ran directly before the function that determines bounds for users rated movies.
# So, omitting the no duplicate function in this cell and making it run in the next cell avoids redundant iteration.


# Question: What if the test and train ratings bounds was enforced here instead of the cell below???
# Answer: The merge functions below needs to be executed before determining test and train users, because merge will remove rows and ratings from users...
# before enforcing the users to be in a certain bounds for the number of their ratings. 
# The current timing of this function will ensure that the final users are within the set train or test bounds.


keywords_df = pd.read_csv('./the-movies-dataset/keywords.csv', usecols = ("id", "keywords"), dtype={"id": "string","keywords":"string"})[["id", "keywords"]]
keywords_df.dropna(inplace = True)
keywords_lst = [row for row in keywords_df.values.tolist() if not (row[1][len(row[1])  - 2:] == "[]")]
keywords_df = pd.DataFrame(keywords_lst, columns = ("id", "keywords"), dtype = str)


credits_df = pd.read_csv("./the-movies-dataset/credits.csv", usecols = ("cast", "id"), dtype={"cast": "string", "id": "string"})[["cast", "id"]]
credits_df.dropna(inplace = True)
credits_lst = [row for row in credits_df.values.tolist() if (not row[0][len(row[0])  - 2:] == "[]")]
credits_df = pd.DataFrame(credits_lst, columns = ("cast", "id"), dtype = str)


# Default merge is inner: This only keeps movies that have the id existing in both dataframes.
complete_df =  pd.merge(movies_df, ratings_df, on ="id")
complete_df =  pd.merge(complete_df,keywords_df, on ="id")
complete_df  = pd.merge(complete_df,credits_df, on ="id")


print(len(complete_df["id"].unique()))

#LOOK: sort the movie id column then rerwrite the movie ids by...
#the lowest value starting at zero and onward
#LOOK: in the application of this model there need to be a mapping of...
#terms the user enters and the correct movie inthe database
# complete_df.sort_values(by = 'id', inplace=True)
# sorted_movies = list(complete_df["id"])
# movie_id = sorted_movies[0]
# i = 0
# sorted_movies[0] = i 
# for j in range(1, len(sorted_movies)):
#     if sorted_movies[j] != movie_id:
#         movie_id = sorted_movies[j]
#         i += 1
#         sorted_movies[j] = i
#     else: 
#         sorted_movies[j] = i
# complete_df["id"] = sorted_movies



complete_df.sort_values(by = 'userId', inplace = True)


# Master dataframe: For each (user id, movie id) row combination there is the combined movie data from movies_df, ratings_df, keywords_df, and credits_df for the movie id in question.
# The columns are reordered.
complete_df  = complete_df.loc[:,['userId','id','rating',"title", "genres","production_companies","keywords", "cast", "tagline", "overview" ]]


print(len(complete_df["id"].unique()))
# print(complete_df.tail())


# For testing:
# print("Minutes taken:", (time.time()-start_time)/60)
# print(complete_df.head())



# Tested on personal machine:
# Old run with dataframe iteration (old code): 1 minute and 5.7 seconds
# New run with list conversion before iteration (current code): 37.1 seconds

17651
3276
3276


## Data Extraction and Selection
1. Select data from users that have a number of ratings within a certain bounds.
2. Select a random subset of this data and simplify it.


In [3]:
import ast
import random
import time
# import matplotlib.pyplot as plt
# from matplotlib.pyplot import hist

start_time = time.time()


# LOOK: To make a fair comparison to the best possible implementation of the netflix data
# a closer distribution of user ratings to the netflix data should be selected 
# perhaps simalir proportion of users in each increment of 10 ratings
# problem: this does not correctly represent the population of the non-netflix dataset so it would have weaker applciaiton!!!
# But, another question is, is the netflix data biased???
# was the whole problem statment more theoretical then practical???
# how is it that the two datasets are vastly different???



# Also, since these are different datasets, a higher distribution is more extreme for this dataset than the netflix dataset
# meaning there is a higher selection bias for this dataset 

# the right distibution of users ratings should be selected with trial and error


# Note: in the netflix data, the distribution of nof user ratings does not change for users tested and users not tested
# this should be mimicked with this data



SEED_INT = 5
# Seed for consistent results across runtimes:
random.seed(SEED_INT)


def populate_names(item):
    """Extract names from the syntax of certain data entries:"""
    string  = item[1:-1]
    jsons = string.split("}, ")   
    names = ""
    index = 0
    for item in jsons:
        if(index == len(jsons)-1):
            temp_dict = ast.literal_eval(item)
            names+=str(temp_dict["name"])
        else:
            temp_dict = ast.literal_eval(item+"}")
            names+=str(str(temp_dict["name"])+" ")
        index += 1
    return names


def provide_data(row):
    """Extract data from row of complete_list:"""
    movie_data = []
    movie_data.append(int(row[0]))
    movie_data.append(int(row[1]))
    movie_data.append(float(row[2]))
    movie_data.append(row[3])  

    movie_data.append(populate_names(row[4]))
    movie_data.append(populate_names(row[5]))
    movie_data.append(populate_names(row[6]))
    movie_data.append(populate_names(row[7]))

    movie_data.append(str(row[8]))
    movie_data.append(str(row[9]))
    return movie_data
    


# The list of rows with users id, the users rating for the movie, and raw data for the movie:
# Note: It is sorted by user_id.
complete_list = complete_df.values.tolist()

print("Complete number of users:", len(list(complete_df["userId"].unique()))) # 260788

# The complete list of user rows without ratings of the same movie more than once for a given user:
complete_list_no_dups = []

# Distinquish the user the row belongs to:
last_id = complete_list[0][0]

# The set of movies that a user has rated:
# It is used to omit later ratings of a movie that the user has already rated.
movie_set = set()

# The number of rows of movie data a single user takes up for each user:
gaps = []

# Appended to gaps when all of a users rows of movie data have been counted:
gap_len = 0


# Populates gaps and complete_list_no_dups by omitting movies that already have a rating in respect to each user:
# Note: This code is faster than using dataframe methods.
# Example: Filter data by user and then remove duplicate movie ids for each user.
# This avoids slow dataframe iteration, but the filter method is also slow.
for row in complete_list:
    if last_id != row[0]:
        movie_set= set()
        complete_list_no_dups.append(row)
        movie_set.add(row[1])
        gaps.append(gap_len)
        gap_len = 1
    else:
        if row[1] not in movie_set:
            complete_list_no_dups.append(row)
            gap_len+=1
            movie_set.add(row[1])
    last_id = row[0]

# Add the last gap_len:
gaps.append(gap_len)



# Index in the complete_list_no_dups list:
full_index = 0 
bounds = [] 


#LOOK the problem stems from how complete_list_no_dups is built...


# Populates bounds_train and bounds_test by testing each user if they are a valid train or test user:
for user_index in range(len(gaps)):
    bounds.append([full_index, full_index+gaps[user_index]])
    full_index+=gaps[user_index]    



#LOOK: change back to 1300 when bounds are needed!
#LOOK: sampling does not remove items...
bounds_train = random.sample(bounds, 5)


bounds_test = random.sample(bounds_train, 3)

print(bounds_train)
print(bounds_test)


#LOOK: Now with the remaining bounds that were not sampled, iterate through them until...
# a number of users within a certain bounds of ratings has been selelcted


# commented out to test no trainbounds
# random.shuffle(bounds)
# bounds_test = []

# for item in bounds:
#     if item[1]-item[0] >=30 and item[1]-item[0] <=50:
#         bounds_test.append(item)
#         if len(bounds_test) == 200:
#             break



# Transformed data of the selected train users and test users (in that order):
sampled_data = []


cnt = 0

for bound in bounds_train:
    for movie in complete_list_no_dups[bound[0]:bound[1]]:
        movie_data = provide_data(movie)
        movie_data[0] = cnt
        sampled_data.append(movie_data)
    cnt+=1

for bound in bounds_test:
    for movie in complete_list_no_dups[bound[0]:bound[1]]:
        movie_data = provide_data(movie)
        movie_data[0] = cnt
        sampled_data.append(movie_data)
    cnt+=1




print(cnt)

print("Minutes taken:", (time.time()-start_time)/60)






Complete number of users: 260788
[[4834456, 4834516], [1965866, 1965874], [5736573, 5736587], [2778119, 2778123], [6162657, 6162782]]
[[6162657, 6162782], [4834456, 4834516], [1965866, 1965874]]
8
Minutes taken: 0.1821999708811442


## Write Data

Save selected data in constructed_data.csv file so that cells below it can run without running this cell and above.


In [16]:
import csv
import os

current_directory = os.getcwd()
final_directory = os.path.join(current_directory, 'constructed_data')
if not os.path.exists(final_directory):
   os.makedirs(final_directory)

with open("constructed_data/constructed_data_2.csv", "w", encoding="utf-8", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['userId','id','rating',"title", "genres","production_companies","keywords", "cast", "tagline", "overview"])
    writer.writerows(sampled_data)

In [17]:
#This cell is for testing how long the ratings small is...
import csv

data_list =[]

with open("ratings_small/ratings_small.csv", 'r', encoding="utf-8") as f:
    csv_reader = csv.reader(f)
    data_list = list(csv_reader)

data_list = data_list[1:]

cnt =0 
id = -1
for item in data_list:
     if item[0] != id:
          cnt+=1
          id = item[0]

print(cnt)


671


## Read Data
This is the starting cell to run if the data is already saved to the constructed_data.csv. 

In [18]:
import csv

data_list =[]

with open("constructed_data/constructed_data_2.csv", 'r', encoding="utf-8") as f:
    csv_reader = csv.reader(f)
    data_list = list(csv_reader)

data_list = data_list[1:]


## Format and re-sample Data:

Format the data into a list of movie data rows for each movie rated for the user for each user. Then, select a subset of that data for each user type.

In [19]:
import random


SEED_INT = 5

random.seed(SEED_INT)

user_to_data_train = []
user_to_data_test = []

user_id = data_list[0][0]

ratings = []


i = 0
for row in data_list:
    if (row[0]!=user_id):
        if(i<521):
            user_to_data_train.append(ratings)
        else:
            user_to_data_test.append(ratings)
        user_id = row[0]
        ratings = [row]
    else:
        ratings.append(row)
    i+=1



user_to_data_test.append(ratings)







#LOOK: Should try introducing higher rating bounds for train users
#in contrast to the other model in complete_11_03_2023.ipynb the test users need to be picked completely randomly 


## Create Features and Target values

* The train and test version of feature 1,2, and 3 are populated and in the final cell some subset of (feature 1, 2 and 3) is used to train and test the final model.
* The target values are ratings for each user from the randomly selected movie that they rated. They are also either train or test ratings used to train or test the the final model.


In [20]:
from gensim.parsing.preprocessing import remove_stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
import random
from ordered_set import OrderedSet
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.cluster import KMeans
import time
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from surprise import SVD,Dataset,Reader
import pandas as pd
from numba import njit

start_time = time.time()

N_VALUE = 100

SEED_INT = 5
# Seed for consistent results across runtimes:
random.seed(SEED_INT)



movies_order = OrderedSet()
movie_ratings_sum_dict = dict()
movie_ratings_count_dict = dict()
overall_average = 0
cnt = 0

train_user_to_movie_to_rating = [] 

for user in user_to_data_train:
    movie_to_rating  = dict()
    for movie in user:
        movies_order.add(movie[1])
        movie_to_rating[movie[1]] = float(movie[2])
        if(movie[1] in movie_ratings_sum_dict.keys()):
            movie_ratings_sum_dict[movie[1]] += float(movie[2])
            movie_ratings_count_dict[movie[1]] += 1
        else:
            movie_ratings_sum_dict[movie[1]] = float(movie[2])
            movie_ratings_count_dict[movie[1]] = 1
        overall_average+=float(movie[2])
        cnt += 1
    train_user_to_movie_to_rating.append(movie_to_rating)


test_user_to_movie_to_rating = [] 
target_movie = []
target_rating = []

for user in user_to_data_test:
    rand_num  = random.randint(0, len(user)-1)
    index = 0
    movie_to_rating  = dict()
    for movie in user:
        movies_order.add(movie[1])
        if(index == rand_num):
            target_movie.append(movie[1])
            target_rating.append(float(movie[2]))
        else:
            if(movie[1] in movie_ratings_sum_dict.keys()):
                movie_ratings_sum_dict[movie[1]] += float(movie[2])
                movie_ratings_count_dict[movie[1]] += 1
            else:
                movie_ratings_sum_dict[movie[1]] = float(movie[2])
                movie_ratings_count_dict[movie[1]] = 1
            movie_to_rating[movie[1]] = float(movie[2])
            overall_average+=float(movie[2])
            cnt += 1
        index+=1
    test_user_to_movie_to_rating.append(movie_to_rating)

overall_average  = overall_average/cnt

movie_ratings_avg_list = []



for movie in movies_order:
    if movie in movie_ratings_sum_dict.keys():
        movie_ratings_avg_list.append(movie_ratings_sum_dict[movie]/movie_ratings_count_dict[movie])
    else:
        movie_ratings_avg_list.append(overall_average)
    

train_user_averages = []
test_user_averages = []
for user in train_user_to_movie_to_rating:
    if len(user)==0:
        train_user_averages.append(overall_average)
    else:
        train_user_averages.append(sum([user[key] for key in user.keys()])/len(user))
for user in test_user_to_movie_to_rating:
    if len(user)==0:
        test_user_averages.append(overall_average)
    else:
        test_user_averages.append(sum([user[key] for key in user.keys()])/len(user))
        
user_averages = train_user_averages + test_user_averages


train_users_to_movie_ratings_transformed = []
test_users_to_movie_ratings_transformed = []

train_users_to_movie_ratings = []
test_users_to_movie_ratings = []


user_movie = []
overall_average_scaled = 0
cnt = 0

for i in range(len(user_to_data_train)):
    j = 0
    lst_1 = []
    lst_2 = []
    for movie in movies_order: 
        if movie in train_user_to_movie_to_rating[i].keys():
            lst_1.append(train_user_to_movie_to_rating[i][movie] - movie_ratings_avg_list[j])
            lst_2.append(train_user_to_movie_to_rating[i][movie])
            overall_average_scaled+=train_user_to_movie_to_rating[i][movie] - movie_ratings_avg_list[j]
            cnt+=1
            user_movie.append([i, j])
        else:
            lst_1.append(0)
            lst_2.append(movie_ratings_avg_list[j])
        j += 1
    train_users_to_movie_ratings_transformed.append(lst_1)
    train_users_to_movie_ratings.append(lst_2)

target_movie_index = []

for i in range(len(user_to_data_test)):
    j = 0
    lst_1 = []
    lst_2 = []
    for movie in movies_order: 
        if(target_movie[i] == movie):
            lst_1.append(0)
            lst_2.append(movie_ratings_avg_list[j])
            target_movie_index.append(j)
        elif movie in test_user_to_movie_to_rating[i].keys():
            lst_1.append(test_user_to_movie_to_rating[i][movie] - movie_ratings_avg_list[j])
            lst_2.append(test_user_to_movie_to_rating[i][movie])
            overall_average_scaled+=test_user_to_movie_to_rating[i][movie] - movie_ratings_avg_list[j]
            cnt+=1
            user_movie.append([len(user_to_data_train)+i, j])
        else:
            lst_1.append(0)
            lst_2.append(movie_ratings_avg_list[j])
        j += 1
    test_users_to_movie_ratings_transformed.append(lst_1)
    test_users_to_movie_ratings.append(lst_2)

overall_average_scaled = overall_average_scaled/cnt

users_to_movie_ratings_transformed = train_users_to_movie_ratings_transformed + test_users_to_movie_ratings_transformed
users_to_movie_ratings = train_users_to_movie_ratings + test_users_to_movie_ratings




#LOOK: need to try to optimize this function
def train(user_to_ratings, n, overall_average, user_movie):

    rt = .02
    lr = .005
    epochs  = 20

    np.random.seed(SEED_INT)
    R = np.array(user_to_ratings)
    q = np.random.normal(0, .1, (R.shape[1], n))
    p = np.random.normal(0, .1, (R.shape[0], n))

    b1 = [0]*R.shape[0]
    b2 = [0]*R.shape[1]

    R_hat = np.zeros((R.shape[0], R.shape[1]))
    eui = np.zeros((R.shape[0], R.shape[1]))


    #LOOK: caching is not done here
    #LOOK: R is not shuffled
    for _ in range(epochs):
        for u,i in user_movie:
            R_hat[u][i] = overall_average+b1[u]+b2[i]+np.dot(p[u],q[i])
            eui[u][i] = R[u][i]-R_hat[u][i]
            b1[u] += lr*(eui[u][i]- rt*b1[u])
            b2[i] += lr*(eui[u][i]- rt*b2[i])
            temp = lr*(eui[u][i]*q[i] -rt*p[u])
            q[i] += lr*(eui[u][i]*p[u] -rt*q[i])
            p[u] += temp


    return b1, b2, p, q



def epoch(train_list, b1, b2, p, q, overall_average, lr, rt):
    for row in train_list:
        u = int(row[0])
        i = int(row[1])
        r = float(row[2])

        pred = overall_average+b1[u]+b2[i]+np.dot(p[u],q[i])
        error = r-pred
        b1[u] += lr*(error- rt*b1[u])
        b2[i] += lr*(error- rt*b2[i])
        temp = lr*(error*q[i] -rt*p[u])
        q[i] += lr*(error*p[u] -rt*q[i])
        p[u] += temp
    # return b1, b2, p, q


#LOOK: (faster train)
def train_v2(train_list, n, overall_average, nof_users, nof_movies):

    rt = .02
    lr = .005
    epochs  = 20

    np.random.seed(SEED_INT)

    #LOOK: all these three arrays are unnecessary insetad use
    # R = np.array(user_to_ratings)
    # R_hat = np.zeros((R.shape[0], R.shape[1]))
    # eui = np.zeros((R.shape[0], R.shape[1]))


    #these values now need to be passsed!!!
    q = np.random.normal(0, .1, (nof_movies, n))
    p = np.random.normal(0, .1, (nof_users, n))

    b1 = np.zeros(nof_users)
    b2 = np.zeros(nof_movies)

    #LOOK: caching is not done here
    #LOOK: R is not shuffled
    #LOOK: after the movies are indexed from 0 some movies are not included since traget movies are omited from movie_order

    for _ in range(epochs):
        # for row in train_list:
        #     u = int(row[0])
        #     i = int(row[1])
        #     r = float(row[2])

        #     pred = overall_average+b1[u]+b2[i]+np.dot(p[u],q[i])
        #     error = r-pred
        #     b1[u] += lr*(error- rt*b1[u])
        #     b2[i] += lr*(error- rt*b2[i])
        #     temp = lr*(error*q[i] -rt*p[u])
        #     q[i] += lr*(error*p[u] -rt*q[i])
        #     p[u] += temp
        epoch(np.array(train_list), b1, b2, p, q, overall_average, lr, rt)

    return b1, b2, p, q


train_list = []

index =0
for user_ratings, user_dict in zip(train_users_to_movie_ratings, train_user_to_movie_to_rating):
    for rating, movie_id in zip(user_ratings, list(movies_order)):
        if movie_id in user_dict.keys():
            train_list.append((index, int(movie_id), rating))
    index+=1

i = 0
user_movies_to_predict = []
for user_ratings, user_dict in zip(test_users_to_movie_ratings, test_user_to_movie_to_rating):
    for rating, movie_id in zip(user_ratings, list(movies_order)):
        if movie_id in user_dict.keys():
            train_list.append((index, int(movie_id), rating))
        elif(movie_id == target_movie[i]):
            user_movies_to_predict.append((index, int(movie_id)))
    index+=1
    i+=1

random.shuffle(train_list)

reader = Reader()
train_df = pd.DataFrame(train_list, columns=['userId', 'movieId', 'rating'])
dataset = Dataset.load_from_df(train_df, reader)
trainset = dataset.build_full_trainset()

time_start_2 = time.time()
svd_model = SVD(random_state = SEED_INT)
svd_model_trained = svd_model.fit(trainset)

predictions_0 = []

for item in user_movies_to_predict:
    predictions_0.append(svd_model_trained.predict(item[0], item[1], verbose=False).est)

print(mean_absolute_error(target_rating , predictions_0))
# print(mean_squared_error(target_rating , predictions_0, squared = False))
# print(r2_score(target_rating , predictions_0))
print(time.time()- time_start_2)


#normalized version:
time_start_3 = time.time()
b1, b2, p, q = train(users_to_movie_ratings_transformed, N_VALUE, overall_average_scaled, user_movie)
predictions_1 = [movie_ratings_avg_list[target_movie_index[u]]+ overall_average_scaled 
                 + b1[len(user_to_data_train)+u]+b2[target_movie_index[u]]
                 +np.dot(p[len(user_to_data_train)+u],q[target_movie_index[u]]) for u in range(len(user_to_data_test))]

print(mean_absolute_error(target_rating , predictions_1))
# print(mean_squared_error(target_rating , predictions_1, squared = False))
# print(r2_score(target_rating , predictions_1))
print(time.time()- time_start_3)

#LOOK: after running... it is apparent that normalization does not improve performance
#non-normalized version:
time_start_4 = time.time()
b1, b2, p, q = train(users_to_movie_ratings, N_VALUE, overall_average, user_movie)
predictions_2 = [overall_average + b1[len(user_to_data_train)+u]+b2[target_movie_index[u]]
                 +np.dot(p[len(user_to_data_train)+u],q[target_movie_index[u]]) for u in range(len(user_to_data_test))]

print(mean_absolute_error(target_rating , predictions_2))
# print(mean_squared_error(target_rating , predictions_2, squared = False))
# print(r2_score(target_rating , predictions_2))
print(time.time() - time_start_4)


time_start_5 = time.time()
b1, b2, p, q = train_v2(train_list, N_VALUE, overall_average, len(users_to_movie_ratings), len(list(movies_order)))
predictions_3 = [overall_average + b1[pair[0]]+b2[pair[1]]
                 +np.dot(p[pair[0]],q[pair[1]]) for pair in user_movies_to_predict]

user_movies_to_predict.append((index, movie_id))

print(mean_absolute_error(target_rating , predictions_3))
# print(mean_squared_error(target_rating , predictions_3, squared = False))
# print(r2_score(target_rating , predictions_3))
print(time.time() - time_start_5)

print(time.time()- start_time)


#LOOK: need to try creating svd from the iterative process and compare the automated one (done)

#LOOK: try using r svd without pre and post scaling (done)

#LOOK: 
#sample code: https://www.kaggle.com/code/ankitahankare/collaborative-filtering-based-recommender-system
#A difference between his and my implementation is that in mine the target values for test users are never included in training
#this is simply because, in a realistic implementation of the model the users dont have these ratings
#Another difference is that the target ratings in his are randomly chosen (user,movie) pairs, where in mine they are one per test user

#LOOK: Try running the sample code (done) 

#LOOK: need to try with the small dataset (done) 

#LOOK: need to try removing random seeds (done)

#LOOK: Try changing outer and inner loop (done)

#LOOK: try less training, more training 

#LOOK: With train svd_full_alt_sgd how do you train the with the train data only without using the fill in values for target movies???

#LOOK: need to try data scaling...


#with ratings small: 521 
#train users: 150 test users

#seed_int: 3
#stock model: 0.73469082908731 
#other model: 0.7311272710900113

#seed_int: 2
#stock model: 0.7287513354608883
#other model: 0.7262573325419073

#seed_int: 4
#stock model: 0.6955065011420399
#other model: 0.7009906552586537

#train 2: seed 2: 0.712004583353562
#train: seed 2:0 .7088469144533179

#train 2: seed 3: 0.750989364450266
#train: seed 3: 0.7486500086499461

#train 2: seed 4: 
#train: seed 4: 

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jackson\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0.7565252196712352
0.22500300407409668
0.7621295756966385
5.036997556686401
0.7595544087809621
5.015999794006348


IndexError: index 2939 is out of bounds for axis 0 with size 1500