## Download data source

* Download the data needed for this jupyter notebook from kaggle and store it in a new folder (the-movies-dataset) in the current directory.


* Upon running this cell, the user will be asked for their username and key which can be found in a fresh api token from kaggle.

* Instructions to get api token to authenticate the data request (Note: kaggle account required):
    1. Sign into kaggle.
    2. Go to the 'Account' tab of your user profile and select 'Create New Token'. 
    3. This will trigger the download of kaggle.json, a file containing your API credentials.

* If the folder has been created and the files are already in that folder, than this cell does nothing and requires no credentials.

* Data Source Information: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv


In [1]:
import opendatasets as od

od.download("https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:

Your Kaggle Key:Downloading the-movies-dataset.zip to ./the-movies-dataset


100%|██████████| 228M/228M [00:23<00:00, 10.2MB/s] 





## Combine Raw Data

Combining certain data from the necessary csv files into a single dataframe (complete_df).

* Rows are removed from each dataframe when they do not have sufficent data for a column or the data from a column does not exist.
* This kind of row removal is done before multiple copies of the same movie data becomes present in multple rows, to save time and space.
* Iteration through rows of a dataframe at this level is inefficient compared to list iteration.
* This is why the dataframes are converted into lists before iteration and then back again to dataframes, so the merge function can be applied to combine the data into a single dataframe (complete_df).

In [37]:
import pandas as pd
import time

start_time = time.time()


pd.set_option('display.max_colwidth', None)

movies_df = pd.read_csv('./the-movies-dataset/movies_metadata.csv',usecols=("genres","id" ,"title","tagline", "overview","production_companies"),
                          dtype={'genres':"string","id":"string","title": "string", "tagline": "string","overview":"string",
                                    "production_companies" :"string"})[["genres","id" ,"title","tagline", "overview","production_companies"]]
movies_df.dropna(inplace = True)
movies_lst = [row for row in movies_df.values.tolist() if not (row[0][len(row[0])  - 2:] == "[]" or row[5][len(row[5]) - 2:] == "[]")]
movies_df = pd.DataFrame(movies_lst, columns = ("genres","id" ,"title","tagline", "overview","production_companies"), dtype = str)


#LOOK: trying small dataset for testing

ratings_df = pd.read_csv('./the-movies-dataset/ratings_small.csv', usecols = ("userId", "movieId", "rating"),
                       dtype={"userId": "string","movieId": "string","rating": "string"})[["userId", "movieId", "rating"]]
ratings_df.rename(columns={"movieId": "id"}, inplace = True)
ratings_df.dropna(inplace = True)


# Question: What if the removal of duplicate movie ids per user was processed here instead of the cell below???
# Answer: The duplicate removal function can be ran here,...
# but the complete_list in the cell below can also be iterated over with relative complexity in order to remove duplicates.
# The iteration in the next cell also populates the gap list...
# which is critical to be ran directly before the function that determines bounds for users rated movies.
# So, omitting the no duplicate function in this cell and making it run in the next cell avoids redundant iteration.


# Question: What if the test and train ratings bounds was enforced here instead of the cell below???
# Answer: The merge functions below needs to be executed before determining test and train users, because merge will remove rows and ratings from users...
# before enforcing the users to be in a certain bounds for the number of their ratings. 
# The current timing of this function will ensure that the final users are within the set train or test bounds.


keywords_df = pd.read_csv('./the-movies-dataset/keywords.csv', usecols = ("id", "keywords"), dtype={"id": "string","keywords":"string"})[["id", "keywords"]]
keywords_df.dropna(inplace = True)
keywords_lst = [row for row in keywords_df.values.tolist() if not (row[1][len(row[1])  - 2:] == "[]")]
keywords_df = pd.DataFrame(keywords_lst, columns = ("id", "keywords"), dtype = str)


credits_df = pd.read_csv("./the-movies-dataset/credits.csv", usecols = ("cast", "id"), dtype={"cast": "string", "id": "string"})[["cast", "id"]]
credits_df.dropna(inplace = True)
credits_lst = [row for row in credits_df.values.tolist() if (not row[0][len(row[0])  - 2:] == "[]")]
credits_df = pd.DataFrame(credits_lst, columns = ("cast", "id"), dtype = str)


# Default merge is inner: This only keeps movies that have the id existing in both dataframes.
complete_df =  pd.merge(movies_df, ratings_df, on ="id")
complete_df =  pd.merge(complete_df,keywords_df, on ="id")
complete_df  = pd.merge(complete_df,credits_df, on ="id")


complete_df.sort_values(by = 'userId', inplace = True)


# Master dataframe: For each (user id, movie id) row combination there is the combined movie data from movies_df, ratings_df, keywords_df, and credits_df for the movie id in question.
# The columns are reordered.
complete_df  = complete_df.loc[:,['userId','id','rating',"title", "genres","production_companies","keywords", "cast", "tagline", "overview" ]]



# For testing:
print("Minutes taken:", (time.time()-start_time)/60)
print(complete_df.head())



# Tested on personal machine:
# Old run with dataframe iteration (old code): 1 minute and 5.7 seconds
# New run with list conversion before iteration (current code): 37.1 seconds

Minutes taken: 0.04170004924138387
      userId    id rating                           title  \
10151      1  2105    4.0                    American Pie   
9353       1  1371    2.5                       Rocky III   
12209      1  2193    2.0                        My Tutor   
15297      1  2294    2.0  Jay and Silent Bob Strike Back   
24848     10  1690    3.0                          Hostel   

                                                                                              genres  \
10151                               [{'id': 35, 'name': 'Comedy'}, {'id': 10749, 'name': 'Romance'}]   
9353                                                                   [{'id': 18, 'name': 'Drama'}]   
12209  [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]   
15297                                                                 [{'id': 35, 'name': 'Comedy'}]   
24848                                                                 [{'id': 

## Data Extraction and Selection
1. Select data from users that have a number of ratings within a certain bounds.
2. Select a random subset of this data and simplify it.


In [38]:
import ast
import random
import time
import matplotlib.pyplot as plt
from matplotlib.pyplot import hist

start_time = time.time()


# LOOK: To make a fair comparison to the best possible implementation of the netflix data
# a closer distribution of user ratings to the netflix data should be selected 
# perhaps simalir proportion of users in each increment of 10 ratings
# problem: this does not correctly represent the population of the non-netflix dataset so it would have weaker applciaiton!!!
# But, another question is, is the netflix data biased???
# was the whole problem statment more theoretical then practical???
# how is it that the two datasets are vastly different???



# Also, since these are different datasets, a higher distribution is more extreme for this dataset than the netflix dataset
# meaning there is a higher selection bias for this dataset 

# the right distibution of users ratings should be selected with trial and error


# Note: in the netflix data, the distribution of nof user ratings does not change for users tested and users not tested
# this should be mimicked with this data



SEED_INT = 3
# Seed for consistent results across runtimes:
random.seed(SEED_INT)


def populate_names(item):
    """Extract names from the syntax of certain data entries:"""
    string  = item[1:-1]
    jsons = string.split("}, ")   
    names = ""
    index = 0
    for item in jsons:
        if(index == len(jsons)-1):
            temp_dict = ast.literal_eval(item)
            names+=str(temp_dict["name"])
        else:
            temp_dict = ast.literal_eval(item+"}")
            names+=str(str(temp_dict["name"])+" ")
        index += 1
    return names


def provide_data(row):
    """Extract data from row of complete_list:"""
    movie_data = []
    movie_data.append(int(row[0]))
    movie_data.append(int(row[1]))
    movie_data.append(float(row[2]))
    movie_data.append(row[3])  

    movie_data.append(populate_names(row[4]))
    movie_data.append(populate_names(row[5]))
    movie_data.append(populate_names(row[6]))
    movie_data.append(populate_names(row[7]))

    movie_data.append(str(row[8]))
    movie_data.append(str(row[9]))
    return movie_data
    


# The list of rows with users id, the users rating for the movie, and raw data for the movie:
# Note: It is sorted by user_id.
complete_list = complete_df.values.tolist()

print("Complete number of users:", len(list(complete_df["userId"].unique()))) # 260788

# The complete list of user rows without ratings of the same movie more than once for a given user:
complete_list_no_dups = []

# Distinquish the user the row belongs to:
last_id = complete_list[0][0]

# The set of movies that a user has rated:
# It is used to omit later ratings of a movie that the user has already rated.
movie_set = set()

# The number of rows of movie data a single user takes up for each user:
gaps = []

# Appended to gaps when all of a users rows of movie data have been counted:
gap_len = 0


# Populates gaps and complete_list_no_dups by omitting movies that already have a rating in respect to each user:
# Note: This code is faster than using dataframe methods.
# Example: Filter data by user and then remove duplicate movie ids for each user.
# This avoids slow dataframe iteration, but the filter method is also slow.
for row in complete_list:
    if last_id != row[0]:
        movie_set= set()
        complete_list_no_dups.append(row)
        movie_set.add(row[1])
        gaps.append(gap_len)
        gap_len = 1
    else:
        if row[1] not in movie_set:
            complete_list_no_dups.append(row)
            gap_len+=1
            movie_set.add(row[1])
    last_id = row[0]

# Add the last gap_len:
gaps.append(gap_len)



# Index in the complete_list_no_dups list:
full_index = 0 
bounds = [] 



# Populates bounds_train and bounds_test by testing each user if they are a valid train or test user:
for user_index in range(len(gaps)):
    bounds.append([full_index, full_index+gaps[user_index]])
    full_index+=gaps[user_index]    


#LOOK: change back to 1300 when bounds are needed!
bounds_train = random.sample(bounds, 671)


bounds_test= random.sample(bounds_train, 150)

#LOOK: Now with the remaining bounds that were not sampled, iterate through them until...
# a number of users within a certain bounds of ratings has been selelcted


# commented out to test no trainbounds
# random.shuffle(bounds)
# bounds_test = []

# for item in bounds:
#     if item[1]-item[0] >=30 and item[1]-item[0] <=50:
#         bounds_test.append(item)
#         if len(bounds_test) == 200:
#             break



# Transformed data of the selected train users and test users (in that order):
sampled_data = []


cnt = 0

for bound in bounds_train:
    for movie in complete_list_no_dups[bound[0]:bound[1]]:
        movie_data = provide_data(movie)
        movie_data[0] = cnt
        sampled_data.append(movie_data)
    cnt+=1

for bound in bounds_test:
    for movie in complete_list_no_dups[bound[0]:bound[1]]:
        movie_data = provide_data(movie)
        movie_data[0] = cnt
        sampled_data.append(movie_data)
    cnt+=1


print("Minutes taken:", (time.time()-start_time)/60)






Complete number of users: 671
Minutes taken: 0.5788999676704407


## Write Data

Save selected data in constructed_data.csv file so that cells below it can run without running this cell and above.


In [39]:
import csv
import os

current_directory = os.getcwd()
final_directory = os.path.join(current_directory, 'constructed_data')
if not os.path.exists(final_directory):
   os.makedirs(final_directory)

with open("constructed_data/constructed_data_2.csv", "w", encoding="utf-8", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['userId','id','rating',"title", "genres","production_companies","keywords", "cast", "tagline", "overview"])
    writer.writerows(sampled_data)

In [40]:
#This cell is for testing how long the ratings small is...
import csv

data_list =[]

with open("ratings_small/ratings_small.csv", 'r', encoding="utf-8") as f:
    csv_reader = csv.reader(f)
    data_list = list(csv_reader)

data_list = data_list[1:]

cnt =0 
id = -1
for item in data_list:
     if item[0] != id:
          cnt+=1
          id = item[0]

print(cnt)


671


## Read Data
This is the starting cell to run if the data is already saved to the constructed_data.csv. 

In [41]:
import csv

data_list =[]

with open("constructed_data/constructed_data_2.csv", 'r', encoding="utf-8") as f:
    csv_reader = csv.reader(f)
    data_list = list(csv_reader)

data_list = data_list[1:]


## Format and re-sample Data:

Format the data into a list of movie data rows for each movie rated for the user for each user. Then, select a subset of that data for each user type.

In [42]:
import random


SEED_INT = 3

random.seed(SEED_INT)

user_to_data_train = []
user_to_data_test = []

user_id = data_list[0][0]

ratings = []


i = 0
for row in data_list:
    if (row[0]!=user_id):
        if(i<521):
            user_to_data_train.append(ratings)
        else:
            user_to_data_test.append(ratings)
        user_id = row[0]
        ratings = [row]
    else:
        ratings.append(row)
    i+=1



user_to_data_test.append(ratings)







#LOOK: Should try introducing higher rating bounds for train users
#in contrast to the other model in complete_11_03_2023.ipynb the test users need to be picked completely randomly 


## Create Features and Target values

* The train and test version of feature 1,2, and 3 are populated and in the final cell some subset of (feature 1, 2 and 3) is used to train and test the final model.
* The target values are ratings for each user from the randomly selected movie that they rated. They are also either train or test ratings used to train or test the the final model.


In [48]:
from gensim.parsing.preprocessing import remove_stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
import random
from ordered_set import OrderedSet
from collections import Counter
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.cluster import KMeans
import time
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from surprise import SVD,Dataset,Reader
import pandas as pd


start_time = time.time()

N_VALUE = 100

SEED_INT = 3
# Seed for consistent results across runtimes:
random.seed(SEED_INT)



movies_order = OrderedSet()
movie_ratings_sum_dict = dict()
movie_ratings_count_dict = dict()
overall_average = 0
cnt = 0

train_user_to_movie_to_rating = [] 

for user in user_to_data_train:
    movie_to_rating  = dict()
    for movie in user:
        movies_order.add(movie[1])
        movie_to_rating[movie[1]] = float(movie[2])
        if(movie[1] in movie_ratings_sum_dict.keys()):
            movie_ratings_sum_dict[movie[1]] += float(movie[2])
            movie_ratings_count_dict[movie[1]] += 1
        else:
            movie_ratings_sum_dict[movie[1]] = float(movie[2])
            movie_ratings_count_dict[movie[1]] = 1
        overall_average+=float(movie[2])
        cnt += 1
    train_user_to_movie_to_rating.append(movie_to_rating)


test_user_to_movie_to_rating = [] 
target_movie = []
target_rating = []

for user in user_to_data_test:
    rand_num  = random.randint(0, len(user)-1)
    index = 0
    movie_to_rating  = dict()
    for movie in user:
        movies_order.add(movie[1])
        if(index == rand_num):
            target_movie.append(movie[1])
            target_rating.append(float(movie[2]))
        else:
            if(movie[1] in movie_ratings_sum_dict.keys()):
                movie_ratings_sum_dict[movie[1]] += float(movie[2])
                movie_ratings_count_dict[movie[1]] += 1
            else:
                movie_ratings_sum_dict[movie[1]] = float(movie[2])
                movie_ratings_count_dict[movie[1]] = 1
            movie_to_rating[movie[1]] = float(movie[2])
            overall_average+=float(movie[2])
            cnt += 1
        index+=1
    test_user_to_movie_to_rating.append(movie_to_rating)

overall_average  = overall_average/cnt

movie_ratings_avg_list = []


#LOOK: here is where removing movies based on the value of movie_ratings_count_dict[movie] can be done
#Note: cant remove movies that are a target rating for a user
#this means it is probably better to assign traget ratings after this action


for movie in movies_order:
    if movie in movie_ratings_sum_dict.keys():
        movie_ratings_avg_list.append(movie_ratings_sum_dict[movie]/movie_ratings_count_dict[movie])
    else:
        movie_ratings_avg_list.append(overall_average)
    

train_user_averages = []
test_user_averages = []
for user in train_user_to_movie_to_rating:
    if len(user)==0:
        train_user_averages.append(overall_average)
    else:
        train_user_averages.append(sum([user[key] for key in user.keys()])/len(user))
for user in test_user_to_movie_to_rating:
    if len(user)==0:
        test_user_averages.append(overall_average)
    else:
        test_user_averages.append(sum([user[key] for key in user.keys()])/len(user))
        
user_averages = train_user_averages + test_user_averages

#LOOK: new normalization method

# users_to_movie_ratings_transformed = []

# for i in range(len(user_to_data_train)):
#     j = 0
#     lst = []
#     for movie in movies_order: 
#         if movie in train_user_to_movie_to_rating[i].keys():
#             lst.append(train_user_to_movie_to_rating[i][movie] - train_user_averages[i])
#         else:
#             lst.append(0)
#         j += 1
#     users_to_movie_ratings_transformed.append(lst)

# target_movie_index = []

# for i in range(len(user_to_data_test)):
#     j = 0
#     lst = []
#     for movie in movies_order: 
#         if(target_movie[i] == movie):
#             lst.append(0)
#             target_movie_index.append(j)
#         elif movie in test_user_to_movie_to_rating[i].keys():
#             lst.append(test_user_to_movie_to_rating[i][movie] - test_user_averages[i])
#         else:
#             lst.append(0)
#         j += 1
#     users_to_movie_ratings_transformed.append(lst)



#LOOK: old normalization method
users_to_movie_ratings_transformed = []
train_users_to_movie_ratings_transformed = []
test_users_to_movie_ratings_transformed = []

users_to_movie_ratings = []
train_users_to_movie_ratings = []
test_users_to_movie_ratings = []

#bit map of raw ratings given by 1 and filled in ratings given by 0
rating_map = []

for i in range(len(user_to_data_train)):
    j = 0
    lst_1 = []
    lst_2 = []
    row_map = []
    for movie in movies_order: 
        if movie in train_user_to_movie_to_rating[i].keys():
            lst_1.append(train_user_to_movie_to_rating[i][movie] - movie_ratings_avg_list[j])
            lst_2.append(train_user_to_movie_to_rating[i][movie])
            row_map.append(1)
        else:
            lst_1.append(0)
            #try adding zero here...
            lst_2.append(movie_ratings_avg_list[j])
            row_map.append(0)
        j += 1
    rating_map.append(row_map)
    train_users_to_movie_ratings_transformed.append(lst_1)
    train_users_to_movie_ratings.append(lst_2)

target_movie_index = []

for i in range(len(user_to_data_test)):
    j = 0
    lst_1 = []
    lst_2 = []
    row_map = []
    for movie in movies_order: 
        if(target_movie[i] == movie):
            lst_1.append(0)
            #try adding zero here...
            lst_2.append(movie_ratings_avg_list[j])
            target_movie_index.append(j)
            #LOOK: change back if predict function doesn't work!!!
            row_map.append(0)
        elif movie in test_user_to_movie_to_rating[i].keys():
            lst_1.append(test_user_to_movie_to_rating[i][movie] - movie_ratings_avg_list[j])
            lst_2.append(test_user_to_movie_to_rating[i][movie])
            row_map.append(1)
        else:
            lst_1.append(0)
            #try adding zero here...
            lst_2.append(movie_ratings_avg_list[j])
            row_map.append(0)
        j += 1
    rating_map.append(row_map)
    test_users_to_movie_ratings_transformed.append(lst_1)
    test_users_to_movie_ratings.append(lst_2)

users_to_movie_ratings_transformed = train_users_to_movie_ratings_transformed + test_users_to_movie_ratings_transformed
users_to_movie_ratings = train_users_to_movie_ratings + test_users_to_movie_ratings


def svd_full(user_to_ratings_transformed, n, averages):
    """
    1. Get the svd of the user_to_ratings_full_transform 
    2. Truncate each factor to n components
    3. Multiply the truncated components together (U X s) X V 
    4. Scale back the values to the orginal rating scale (1-5) and return result
    """
    U, S, V = np.linalg.svd(user_to_ratings_transformed, full_matrices=False)
    
    # Simplify factors to n components:
    U=U[:,0:n]
    S=np.diag(S)
    S=S[0:n,0:n]
    V=V[0:n,:]

    # Reconstruct to a new array:
    US = np.dot(U,S)
    USV = np.dot(US,V)

    # This tranforms the UsV row by row into the original rating scale (1-5).
    # LOOK: Old normalization method
    USV = USV + np.tile(averages, (USV.shape[0],1))

    # LOOK: New normalization method
    # averages_reshaped = np.reshape(averages,  (len(averages), 1))
    # USV = USV + np.repeat(np.array(averages_reshaped), USV.shape[1], 1)

    # Be consistent with data structures:
    return list(USV)


#LOOK: This is where the iterative svd algorithm will reside
#LOOK: as expected, the perfomance of this iterative algorithm is the same as
# The automatic algorithm, but this is the stepping stone for...
# regularized svd...

# def svd_full_alt(user_to_ratings_transformed, n, averages):

#     R = np.array(user_to_ratings_transformed)
#     U = np.zeros((R.shape[0], n))
#     V = np.random.uniform(-1, 1, (R.shape[1], n))


#     last_error = np.inf
#     while(True):
#         # E = (norm(R - UVt))^2
#         current_error = np.square(np.linalg.norm(R-np.matmul(U, np.transpose(V))))

#         if  last_error - current_error <= .001:
#             break

#         # U = RV((VtV)^-1)
#         # V = RtU((UtU)^-1)
#         U = np.matmul(np.matmul(R, V),
#                        np.linalg.inv(np.matmul(np.transpose(V), V)))
#         V = np.matmul(np.matmul(np.transpose(R), U),
#                        np.linalg.inv(np.matmul(np.transpose(U), U)))

#         last_error  = current_error
    
#     US = np.dot(U,np.transpose(V))
#     US = US + np.tile(averages, (US.shape[0],1))

#     return list(US)



def svd_full_alt(user_to_ratings, n):

    #experiment with this
    reg_term = 0.02

    R = np.array(user_to_ratings)
    U = np.zeros((R.shape[0], n))
    #LOOK: This is used with normalization
    # V = np.random.uniform(-1, 1, (R.shape[1], n))
    V = np.random.uniform(1, 5, (R.shape[1], n))

    last_error = np.inf
    while(True):
        # E = (norm(R - UVt))^2
        current_error = np.square(np.linalg.norm(R-np.matmul(U, np.transpose(V)))) 
        + reg_term*np.square(np.linalg.norm(U))
        + reg_term*np.square(np.linalg.norm(V))

        if  last_error - current_error <= .001:
            break

        # U = RV((VtV)^-1)
        # V = RtU((UtU)^-1)
        U = np.matmul(np.matmul(R, V),
                       np.linalg.inv(np.matmul(np.transpose(V), V)+reg_term*np.identity(n)))
        V = np.matmul(np.matmul(np.transpose(R), U),
                       np.linalg.inv(np.matmul(np.transpose(U), U)+reg_term*np.identity(n)))

        last_error  = current_error
    
    US = np.dot(U,np.transpose(V))
    # LOOK: removed to omit pre and post scaling 
    # US = US + np.tile(averages, (US.shape[0],1))

    return list(US)

#LOOK: the train predict version is used because including target ratings in the full procedure gives some value to
#the training that noramlly wouldn't be there if the movies to predict are unknown before changing

def predict(u, i, b1, b2, p, q, R, overall_average):

    b1_c = np.copy(b1) 
    b2_c = np.copy(b2) 
    p_c = np.copy(p) 
    q_c = np.copy(q) 

    rt = .02
    lr = .005
    prediction = -1


    for _ in range(15):
        prediction = overall_average+b1_c[u]+b2_c[i]+np.dot(p_c[u],q_c[i])
        error = R[u][i]-prediction
        b1_c[u] += lr*(error- rt*b1_c[u])
        b2_c[i] += lr*(error- rt*b2_c[i])
        temp = lr*(error*q_c[i] -rt*p_c[u])
        q_c[i] += lr*(error*p_c[u] -rt*q_c[i])
        p_c[u] += temp

    return prediction



def train(user_to_ratings, n, overall_average, rating_map):

    rt = .02
    lr = .005

    np.random.seed(SEED_INT)
    R = np.array(user_to_ratings)
    q = np.random.normal(0, .1, (R.shape[1], n))
    p = np.random.normal(0, .1, (R.shape[0], n))

    #LOOK: these are user and item biases respectively
    b1 = [0]*R.shape[0]
    b2 = [0]*R.shape[1]


    R_hat = np.zeros((R.shape[0], R.shape[1]))
    eui = np.zeros((R.shape[0], R.shape[1]))

    #LOOK: caching is not done here
    for _ in range(20):
        for u in range(R.shape[0]):
            for i in range(R.shape[1]):
                if rating_map[u][i]==1:
                    #This should not be computed in order... but at the same time..
                    R_hat[u][i] = overall_average+b1[u]+b2[i]+np.dot(p[u],q[i])
                    eui[u][i] = R[u][i]-R_hat[u][i]
                    b1[u] += lr*(eui[u][i]- rt*b1[u])
                    b2[i] += lr*(eui[u][i]- rt*b2[i])
                    temp = lr*(eui[u][i]*q[i] -rt*p[u])
                    q[i] += lr*(eui[u][i]*p[u] -rt*q[i])
                    p[u] += temp


    return b1, b2, p, q, R

def svd_full_alt_sgd(user_to_ratings, n, overall_average, rating_map):

    #experiment with this
    rt = .02
    lr = .005

    np.random.seed(SEED_INT)
    R = np.array(user_to_ratings)
    # q = np.zeros((R.shape[1], n))
    # p = np.zeros((R.shape[0], n))
    q = np.random.normal(0, .1, (R.shape[1], n))
    p = np.random.normal(0, .1, (R.shape[0], n))

    #LOOK: these are user and item biases respectively
    b1 = [0]*R.shape[0]
    b2 = [0]*R.shape[1]


    R_hat = np.zeros((R.shape[0], R.shape[1]))
    eui = np.zeros((R.shape[0], R.shape[1]))

    # last_error = np.inf
    # while(True):
        # error_sum = 0
        # for u in range(R.shape[0]):
        #     for i in range(R.shape[1]):
        #         R_hat[u][i] = u+b1[u]+b2[i]+np.dot(p[u],q[i])
        #         eui[u][i] = R[u][i]-R_hat[u][i]
        #         term_1 = (eui[u][i])**2
        #         term_2 = rt*(b1[u]**2 +b2[i]**2
        #                     +np.linalg.norm(p[u])**2
        #                     +np.linalg.norm(q[i])**2)
        #         error_sum+=(term_1+term_2)

        # if  last_error - error_sum <= .001:
        #     break
        # last_error  = error_sum

    #with biases:
    #try making outer loop the inner loop...
    #need to try omitting target movies and just predict them
    #LOOK: caching is not done here
    for _ in range(20):
        for u in range(R.shape[0]):
            for i in range(R.shape[1]):
                if rating_map[u][i]==1:
                    #This should not be computed in order... but at the same time..
                    R_hat[u][i] = overall_average+b1[u]+b2[i]+np.dot(p[u],q[i])
                    eui[u][i] = R[u][i]-R_hat[u][i]
                    b1[u] += lr*(eui[u][i]- rt*b1[u])
                    b2[i] += lr*(eui[u][i]- rt*b2[i])
                    temp = lr*(eui[u][i]*q[i] -rt*p[u])
                    q[i] += lr*(eui[u][i]*p[u] -rt*q[i])
                    p[u] += temp


    #how can this be optimized with numpy???
    #try without biases:
    # for a in range(20):
    #     for u in range(R.shape[0]):
    #         for i in range(R.shape[1]):
    #             R_hat[u][i] = np.dot(p[u],q[i])
    #             eui[u][i] = R[u][i]-R_hat[u][i]
    #             temp = lr*(eui[u][i]*q[i] -rt*p[u])
    #             q[i] += lr*(eui[u][i]*p[u] -rt*q[i])
    #             p[u] += temp



    print(R_hat)
    return list(R_hat)

#LOOK implementing suprise...

train_list = []

index =0
for user_ratings, user_dict in zip(train_users_to_movie_ratings, train_user_to_movie_to_rating):
    for rating, movie_id in zip(user_ratings, list(movies_order)):
        if movie_id in user_dict.keys():
            train_list.append((index, movie_id, rating))
    index+=1

i = 0
user_movies_to_predict = []
for user_ratings, user_dict in zip(test_users_to_movie_ratings, test_user_to_movie_to_rating):
    for rating, movie_id in zip(user_ratings, list(movies_order)):
        if movie_id in user_dict.keys():
            train_list.append((index, movie_id, rating))
        elif(movie_id == target_movie[i]):
            user_movies_to_predict.append((index, movie_id))
    index+=1
    i+=1


reader = Reader()
train_df = pd.DataFrame(train_list, columns=['userId', 'movieId', 'rating'])

dataset = Dataset.load_from_df(train_df, reader)
svd_model = SVD(random_state = SEED_INT)
trainset = dataset.build_full_trainset()
svd_model_trained = svd_model.fit(trainset)

predictions_0 = []

for item in user_movies_to_predict:
    predictions_0.append(svd_model_trained.predict(item[0], item[1], verbose=False).est)

print(mean_absolute_error(target_rating , predictions_0))
print(r2_score(target_rating , predictions_0))
#End of suprise

#predictions for sgd model:
svd_out_full_alt_sdg = svd_full_alt_sgd(users_to_movie_ratings, N_VALUE, overall_average, rating_map)
predictions_i = [svd_out_full_alt_sdg[i+len(user_to_data_train)][target_movie_index[i]] for i in range(len(user_to_data_test))]

print(mean_absolute_error(target_rating , predictions_i))
print(r2_score(target_rating , predictions_i))


#train/predict sgd model...
b1, b2, p, q, R = train(users_to_movie_ratings, N_VALUE, overall_average, rating_map)
predictions_j = [predict(u+len(user_to_data_train), target_movie_index[u], b1, b2, p, q, R, overall_average) for u in range(len(user_to_data_test))]

print(mean_absolute_error(target_rating , predictions_j))
print(r2_score(target_rating , predictions_j))

svd_out_full = svd_full(users_to_movie_ratings_transformed, N_VALUE, movie_ratings_avg_list)
# svd_out_full = svd_full(users_to_movie_ratings_transformed, N_VALUE, user_averages)

predictions_1 = [svd_out_full[i+len(user_to_data_train)][target_movie_index[i]] for i in range(len(user_to_data_test))]
print(mean_absolute_error(target_rating , predictions_1))
print(r2_score(target_rating , predictions_1))

svd_out_full_alt = svd_full_alt(users_to_movie_ratings, N_VALUE)
# svd_out_full_alt = svd_full_alt(users_to_movie_ratings_transformed, N_VALUE, user_averages)

predictions_2 = [svd_out_full_alt[i+len(user_to_data_train)][target_movie_index[i]] for i in range(len(user_to_data_test))]
print(mean_absolute_error(target_rating , predictions_2))
print(r2_score(target_rating , predictions_2))


predictions_3 = []
for index in target_movie_index:
    predictions_3.append(movie_ratings_avg_list[index])
print(mean_absolute_error(target_rating , predictions_3))
print(r2_score(target_rating , predictions_3))

predictions_4 = []
for avg in test_user_averages:
    predictions_4.append(avg)
print(mean_absolute_error(target_rating , predictions_4))
print(r2_score(target_rating , predictions_4))

print(time.time() - start_time)


#LOOK: need to try creating svd from the iterative process and compare the automated one (done)

#LOOK: try using r svd without pre and post scaling (done)

#LOOK: 
#sample code: https://www.kaggle.com/code/ankitahankare/collaborative-filtering-based-recommender-system
#A difference between his and my implementation is that in mine the target values for test users are never included in training
#this is simply because, in a realistic implementation of the model the users dont have these ratings
#Another difference is that the target ratings in his are randomly chosen (user,movie) pairs, where in mine they are one per test user

#LOOK: Try running the sample code (done) 

#https://surprise.readthedocs.io/en/stable/model_selection.html#cross-validation

#LOOK: need to try with the small dataset (done) 

#LOOK: need to try removing random seeds (done)

#LOOK: Try changing outer and inner loop (done)

#LOOK: try less training, more training 

#LOOK: With train svd_full_alt_sgd how do you train the with the train data only without using the fill in values for target movies???


#stock model: 0.73469082908731
#other model: 0.7344406498456546 


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jackson\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


0.73469082908731
0.19286782477340159
[[3.23824892 4.00031962 3.86628608 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
3.690830235439901
-12.441134873610771
0.738227431848343
0.18735604184731924
0.7542808714099318
0.15866462819353588
0.7550506671143457
0.15677365008160282
0.7958538758702122
0.07814078337818142
0.7946165173412278
0.06020840723032861
44.40399932861328
