# PROSJEKT INF161
## Movie recommendation project - model
`@author` Kristin Loka Øydna

In this file I create different models to predict the rating a user would give to a movie. I have generated a baseline based on the mean ratings and a content based model based on what genre the user have liked. In addition I have computed two features to check if this improves the generalization root mean squared error of the model.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyRegressor, DummyClassifier
from sklearn.metrics import mean_squared_error
import datetime as dt
from sklearn.linear_model import LinearRegression

In [2]:
# Reading the cleaned files
rang_df = pd.read_csv("/Users/kristinlokaoydna/UiB - Bioinformatikk/H20/INF161/Prosjekt/cleaned_data/rangering.csv")
film_df = pd.read_csv("/Users/kristinlokaoydna/UiB - Bioinformatikk/H20/INF161/Prosjekt/cleaned_data/film.csv")
bruker_df = pd.read_csv("/Users/kristinlokaoydna/UiB - Bioinformatikk/H20/INF161/Prosjekt/cleaned_data/bruker.csv")

In [3]:
# Merging dataframes
merged_df = pd.merge(rang_df, film_df, on = "FilmID")
merged_df = pd.merge(merged_df, bruker_df, on = "BrukerID")
merged_df

Unnamed: 0,BrukerID,FilmID,Tidsstempel,Rangering,Tittel,Action,Adventure,Animation,Children's,Comedy,...,Mystery,Romance,Sci-Fi,Thriller,War,Western,Kjonn,Alder,Jobb,Postkode
0,0,791,959442983.0,2,Armageddon (1998),1,1,0,0,0,...,0,0,1,1,0,0,M,45.0,6.0,92103
1,0,2975,959443833.0,4,"Room with a View, A (1986)",0,0,0,0,0,...,0,1,0,0,0,0,M,45.0,6.0,92103
2,0,3407,959443373.0,4,"Grand Day Out, A (1992)",0,0,1,0,1,...,0,0,0,0,0,0,M,45.0,6.0,92103
3,0,189,959445005.0,4,"Far Off Place, A (1993)",0,1,0,1,0,...,0,1,0,0,0,0,M,45.0,6.0,92103
4,0,773,959443944.0,2,Anna and the King (1999),0,0,0,0,0,...,0,1,0,0,0,0,M,45.0,6.0,92103
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
899853,2145,3490,965318885.0,2,Bronco Billy (1980),0,1,0,0,0,...,0,1,0,0,0,0,M,45.0,5.0,77662
899854,2145,240,965318885.0,3,Braddock: Missing in Action III (1988),1,0,0,0,0,...,0,0,0,0,1,0,M,45.0,5.0,77662
899855,2145,892,965319138.0,3,MacKenna's Gold (1969),0,0,0,0,0,...,0,0,0,0,0,1,M,45.0,5.0,77662
899856,2145,3807,965319221.0,5,Steel Magnolias (1989),0,0,0,0,0,...,0,0,0,0,0,0,M,45.0,5.0,77662


I have split the data into train (70%) and test (30%). I did not split into validation because we do not need this to predict the model since we have to predict for every user in `predict.ipynb`.

In [4]:
# Split into train and test set
X = merged_df.drop(columns=["Rangering", "Kjonn", "Jobb", "Postkode", "Tidsstempel", "Tittel", "Alder"])
y = merged_df["Rangering"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Models
The baseline is a simple model that other models can be compared to. We calculate the generalization root mean squared error of all models. The RMSE of content based and collaborative models should be better (lower) than the RMSE of the baseline. I made a baseline looking at the mean rating for each user. 


In [5]:
# Create baseline based on mean of ratings
dummy_regressor = DummyRegressor(strategy = "mean")
dummy_regressor.fit(X_train,y_train)
print('Baseline RMSE:',np.sqrt(mean_squared_error(y_test, dummy_regressor.predict(X_test))))

Baseline RMSE: 1.1167012375354881


In [6]:
train = X_train.join(y_train)
test = X_test.join(y_test)

#### Content-Based Filtering
In content-based filtering you look at the content the user likes. In this case we look at the ratings a user have given movies in the same genre. If they have given movies in one genre high ratings, chances are they will like other movies in this genre that are recommended to them. I used linear regression based on ratings to make a content-based model.


In [7]:
model_dict = {}
predictions = []

for uid in train.BrukerID.unique():
    bruker = train.loc[train['BrukerID'] == uid]
    
    X = bruker.drop(["BrukerID", "FilmID", "Rangering"], axis = 1)
    y = bruker['Rangering']
    
    linreg = LinearRegression().fit(X, y)
    model_dict[uid] = linreg

for i, row in test.iterrows():

    X1 = np.array(row.iloc[2:20]).reshape(-1, 18)

    
    pred = model_dict[row.BrukerID].predict(X1)
    pred = np.clip(pred, 1, 5)
    predictions.append(pred)
    
mse_content = mean_squared_error(predictions, test['Rangering'])
rmse_content = np.sqrt(mse_content)

print("Content-based filtering RMSE on test set:",rmse_content)

Content-based filtering RMSE on test set: 1.0698375445196797


#### Collaborative filtering
In collaborative filtering you look at the correlation between users or itmes. If tow users have liked similar movies in the past, chances are they will like movies the other user have liked. You can also assume they will like the same item (here; movie) in the future as they did in the past. You can also look at similarities beween different movies (f.ex that they are in the same genre). 

I tried to make a collaborative model, but did not finish it, and therefor I did not get a generelized RMSE to copare to the RMSE of the other models. This also means that I can't use collaborative filtering to predict the top 10 movies for a spesific user.

What I have done so fare is to look at the correlation beween every user to try to find users with similar taste in movies. I found the mean rating per column (user id) and put in instead of NaN. What I should have done instead is to find the k-nearest neighbour with the best correlation to every user who rated the movie, and predict the mean of the k-nearest neighbours and add this mean to every user id.

In [8]:
movieuser_table = pd.pivot_table(train, values="Rangering", index="FilmID", columns=['BrukerID'])

movieuser_table.replace(np.nan, 0, inplace=True)
mean_userrating = movieuser_table.mean(axis=1)
movieuser_table = movieuser_table.transpose()
for i in movieuser_table.columns:
    movieuser_table[i] = movieuser_table[i].replace([0.0], mean_userrating[i])

movieuser_table 
    
# Find correlation between users based on rating
user_index = list(movieuser_table.index)

corr_matrix = pd.DataFrame(np.corrcoef(movieuser_table))

corr_matrix.index = user_index
corr_matrix.columns = user_index
corr_matrix
   

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6031,6032,6033,6034,6035,6036,6037,6038,6039,6040
0,1.000000,0.191848,0.138901,0.249523,0.168896,0.276901,0.178050,0.236471,0.192819,0.185246,...,0.231860,0.228488,0.227216,0.188362,0.148559,0.232039,0.168899,0.324252,0.252306,0.185100
1,0.191848,1.000000,0.260659,0.334602,0.319005,0.407576,0.258869,0.319107,0.285602,0.309753,...,0.378492,0.296763,0.254681,0.237840,0.281202,0.319482,0.267127,0.449963,0.328991,0.313756
2,0.138901,0.260659,1.000000,0.286451,0.230708,0.306853,0.235230,0.311450,0.215089,0.250152,...,0.352377,0.293483,0.256836,0.258401,0.217857,0.288536,0.285840,0.379664,0.247875,0.310242
3,0.249523,0.334602,0.286451,1.000000,0.317498,0.403949,0.295680,0.339092,0.309181,0.334631,...,0.395734,0.316932,0.290182,0.299331,0.294558,0.358064,0.328949,0.496072,0.351317,0.330257
4,0.168896,0.319005,0.230708,0.317498,1.000000,0.373776,0.277453,0.263816,0.339869,0.299429,...,0.373507,0.289110,0.248353,0.257085,0.255392,0.289941,0.256830,0.437367,0.277683,0.300204
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0.232039,0.319482,0.288536,0.358064,0.289941,0.390181,0.331030,0.383579,0.283792,0.372596,...,0.422135,0.317804,0.278504,0.269975,0.320239,1.000000,0.290871,0.479860,0.419195,0.309063
6037,0.168899,0.267127,0.285840,0.328949,0.256830,0.323118,0.281545,0.398452,0.293421,0.265634,...,0.364464,0.288274,0.217876,0.356294,0.205458,0.290871,1.000000,0.408675,0.277620,0.310032
6038,0.324252,0.449963,0.379664,0.496072,0.437367,0.544849,0.418215,0.452932,0.409028,0.472632,...,0.531292,0.430197,0.450793,0.387981,0.426763,0.479860,0.408675,1.000000,0.489902,0.443873
6039,0.252306,0.328991,0.247875,0.351317,0.277683,0.379212,0.333692,0.306662,0.295669,0.315782,...,0.365703,0.300555,0.311550,0.246376,0.274001,0.419195,0.277620,0.489902,1.000000,0.288188


### Features
I created two new features I think will improve the RMSE of the model i chose (content-based). I tried to find out what I would find informative to look at if I was a user. I created one feature based on when the movie was released, and created timeintervals. In this way you can easely find a movie from a spesific decade. In the other feature i divided the differend age-groups into intervals to make it easy to find movies other in the same age-group as you have liked. 

I split the merged dataset with features into train (70%) and test (30%), and used the content-based model to find the RMSE.

In [9]:
feature_film = film_df.copy()

feature_film["Årstall"] = feature_film["Tittel"].str.extract('\((\d{4})\)',expand = True)
feature_film["Årstall"] = feature_film["Årstall"].astype(int)

feature_film["1910-1940"] = [1 if (row <= 1940) else 0 for row in feature_film["Årstall"]]
feature_film["1941-1970"] = [1 if (1940 < row <= 1970) else 0 for row in feature_film["Årstall"]]
feature_film["1971-1980"] = [1 if (1970 < row <= 1980) else 0 for row in feature_film["Årstall"]]
feature_film["1981-1990"] = [1 if (1980 < row <= 1990) else 0 for row in feature_film["Årstall"]]
feature_film["1991-2000"] = [1 if (1990 < row <= 2000) else 0 for row in feature_film["Årstall"]]

feature_film = feature_film.drop(["Årstall"], axis=1)
feature_film

Unnamed: 0,FilmID,Tittel,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,...,Romance,Sci-Fi,Thriller,War,Western,1910-1940,1941-1970,1971-1980,1981-1990,1991-2000
0,0,Autumn in New York (2000),0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,1
1,1,"Vie est belle, La (Life is Rosey) (1987)",0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,1,0
2,2,Defying Gravity (1997),0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
3,3,Ruthless People (1986),0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,5,Defending Your Life (1991),0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3686,3948,Cat People (1982),0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3687,3949,"Saltmen of Tibet, The (1997)",0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
3688,3950,Bride of Re-Animator (1990),0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3689,3951,True Lies (1994),1,1,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,1


In [10]:
feature_bruker = bruker_df.copy()

feature_bruker["Under 18-24"] = [1 if (row < 25) else 0 for row in feature_bruker["Alder"]]
feature_bruker["25-44"] = [1 if (25 <= row < 45) else 0 for row in feature_bruker["Alder"]]
feature_bruker["45-55+"] = [1 if (45 <= row) else 0 for row in feature_bruker["Alder"]]

feature_bruker

Unnamed: 0,BrukerID,Kjonn,Alder,Jobb,Postkode,Under 18-24,25-44,45-55+
0,0,M,45.0,6.0,92103,0,0,1
1,1,M,50.0,16.0,55405,0,0,1
2,2,M,18.0,20.0,44089,1,0,0
3,3,M,25.0,1.0,33304,0,1,0
4,4,M,35.0,6.0,48105,0,1,0
...,...,...,...,...,...,...,...,...
6035,6036,M,45.0,0.0,61821,0,0,1
6036,6037,F,25.0,0.0,Unknown,0,1,0
6037,6038,M,25.0,16.0,33301,0,1,0
6038,6039,M,35.0,14.0,92075,0,1,0


In [11]:
feature_merged = pd.merge(rang_df, feature_film, on = "FilmID")
feature_merged = pd.merge(feature_merged, feature_bruker, on = "BrukerID")
feature_merged

Unnamed: 0,BrukerID,FilmID,Tidsstempel,Rangering,Tittel,Action,Adventure,Animation,Children's,Comedy,...,1971-1980,1981-1990,1991-2000,Kjonn,Alder,Jobb,Postkode,Under 18-24,25-44,45-55+
0,0,791,959442983.0,2,Armageddon (1998),1,1,0,0,0,...,0,0,1,M,45.0,6.0,92103,0,0,1
1,0,2975,959443833.0,4,"Room with a View, A (1986)",0,0,0,0,0,...,0,1,0,M,45.0,6.0,92103,0,0,1
2,0,3407,959443373.0,4,"Grand Day Out, A (1992)",0,0,1,0,1,...,0,0,1,M,45.0,6.0,92103,0,0,1
3,0,189,959445005.0,4,"Far Off Place, A (1993)",0,1,0,1,0,...,0,0,1,M,45.0,6.0,92103,0,0,1
4,0,773,959443944.0,2,Anna and the King (1999),0,0,0,0,0,...,0,0,1,M,45.0,6.0,92103,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
899853,2145,3490,965318885.0,2,Bronco Billy (1980),0,1,0,0,0,...,1,0,0,M,45.0,5.0,77662,0,0,1
899854,2145,240,965318885.0,3,Braddock: Missing in Action III (1988),1,0,0,0,0,...,0,1,0,M,45.0,5.0,77662,0,0,1
899855,2145,892,965319138.0,3,MacKenna's Gold (1969),0,0,0,0,0,...,0,0,0,M,45.0,5.0,77662,0,0,1
899856,2145,3807,965319221.0,5,Steel Magnolias (1989),0,0,0,0,0,...,0,1,0,M,45.0,5.0,77662,0,0,1


In [12]:
X_f = feature_merged.drop(columns=["Rangering", "Kjonn", "Jobb", "Postkode", "Tidsstempel", "Tittel", "Alder"])
y_f = feature_merged["Rangering"]


X_f_train, X_f_test, y_f_train, y_f_test = train_test_split(X_f, y_f, test_size=0.3, random_state=42)

In [13]:
fea_train = X_f_train.join(y_f_train)
fea_test = X_f_test.join(y_f_test)

In [14]:
model_dict_f = {}
predictions_f = []

for uid in fea_train.BrukerID.unique():
    tmp = fea_train.loc[fea_train['BrukerID'] == uid]
    
    X_f = tmp[['Action', 'Adventure', 'Animation', "Children's", 'Comedy',
               'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical',
               'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western', 
               '1910-1940','1941-1970','1971-1980','1981-1990','1991-2000', 'Under 18-24','25-44','45-55+']]
    y_f = tmp['Rangering']
    
    linreg_f = LinearRegression().fit(X_f, y_f)
    model_dict_f[uid] = linreg_f

for i, row in fea_test.iterrows():
    X1_f = np.array(row[['Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 
                       'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical',
                       'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western','1910-1940',
                       '1941-1970','1971-1980','1981-1990','1991-2000', 'Under 18-24','25-44','45-55+']]).reshape(-1, 26)

 
    pred_f = model_dict_f[row.BrukerID].predict(X1_f)
    pred_f = np.clip(pred_f, 1, 5) 
    predictions_f.append(pred_f)
    
mse_content_f = mean_squared_error(predictions_f, fea_test['Rangering'])
rmse_content_f = np.sqrt(mse_content_f)

print("RMSE of content-based model with features on the test set:",rmse_content_f)

RMSE of content-based model with features on the test set: 1.065725024456886


### Conclusion 
The RMSE of the content-based model is better than the RMSE of the baseline. This is good since the baseline is a simple model only looking at the mean of the ratings. The RMSE is also slightly better when we include the features. This means that the features will improve the recomendation predictions, and I choose to use the dataset with the features in `predict.ipynb`.

I have saved the content-based model to use later to predict the ratings for every user. 

In [15]:
import pickle

with open("content_model.pickle", "wb") as best_model:
    pickle.dump(model_dict_f, best_model, protocol=pickle.HIGHEST_PROTOCOL)

In [16]:
feature_merged.to_csv (r'/Users/kristinlokaoydna/UiB - Bioinformatikk/H20/INF161/Prosjekt/movieuser_df.csv', index = False, header=True)