#Lab7: Recommender Systems

---

We will use the MovieLense 1M ratings data (downloaded from http://www.grouplens.org/), which contains around 1,000,000 ratings (1-5) from 6,000 users on 4,000 movies.<br> 

<br>
<br>

**USERS FILE DESCRIPTION** <br>

User information is in the file "users.dat".<br>

- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

- Occupation is chosen from the following choices:

	*  0:  "other" or not specified
	*  1:  "academic/educator"
	*  2:  "artist"
	*  3:  "clerical/admin"
	*  4:  "college/grad student"
	*  5:  "customer service"
	*  6:  "doctor/health care"
	*  7:  "executive/managerial"
	*  8:  "farmer"
	*  9:  "homemaker"
	* 10:  "K-12 student"
	* 11:  "lawyer"
	* 12:  "programmer"
	* 13:  "retired"
	* 14:  "sales/marketing"
	* 15:  "scientist"
	* 16:  "self-employed"
	* 17:  "technician/engineer"
	* 18:  "tradesman/craftsman"
	* 19:  "unemployed"
	* 20:  "writer"

<br>
<br>

**MOVIES FILE DESCRIPTION** <br>

Movie information is in the file "movies.dat" 

- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:
	* Action
	* Adventure
	* Animation
	* Children's
	* Comedy
	* Crime
	* Documentary
	* Drama
	* Fantasy
	* Film-Noir
	* Horror
	* Musical
	* Mystery
	* Romance
	* Sci-Fi
	* Thriller
	* War
	* Western

<br>
<br>

**RATINGS FILE DESCRIPTION** <br>

All ratings are contained in the file "ratings.dat" 
- UserIDs range between 1 and 6040 
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Unix Timestamp is represented in seconds since the epoch (the number of seconds that have elapsed since January 1, 1970)
- Each user has at least 20 ratings

## 1: Upload and clean data

In [None]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import pairwise_distances
from scipy.sparse import csr_matrix
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn import preprocessing

import torch
import torch.nn as nn
import torch.optim as optim
from itertools import chain

In [None]:
# Read user data
u_columns = ['user_id', 'gender', 'age', 'occupation', 'zip_code']
users = pd.read_csv('/content/drive/MyDrive/DL_data/users.dat', sep='::', names=u_columns, engine='python')
users

In [None]:
# Read movie data
m_columns = ['movie_id', 'title', 'genre']
movies = pd.read_csv('/content/drive/MyDrive/DL_data/movies.dat', sep='::', names=m_columns, encoding='latin-1', engine='python')
movies

In [None]:
# Read rating data
r_columns = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('/content/drive/MyDrive/DL_data/ratings.dat', sep = '::', names=r_columns, engine='python')
ratings

In [None]:
# create one merged DataFrame
movie_ratings = pd.merge(movies, ratings)
MovieLense = pd.merge(movie_ratings, users)
MovieLense

In [None]:
# Show the head of data frame
MovieLense.head()

## 2: Data Preprocessing

In [None]:
# Encode movie_id and user_id
label_encoder = preprocessing.LabelEncoder()
ratings['movie_id'] = label_encoder.fit_transform(ratings['movie_id'])
ratings['user_id'] = label_encoder.fit_transform(ratings['user_id'])
ratings

In [None]:
# Sort data based on 'user_id' and 'timestamp'
ratings = ratings.sort_values(by=['user_id', 'timestamp'])
ratings

In [None]:
# Partition the data
test_data = ratings.drop_duplicates(subset=["user_id"], keep='last')
index_df = ratings.index.isin(test_data.index)
train_data = ratings.iloc[~index_df]
print(len(train_data), len(test_data))

In [None]:
# Remove the timestamp column
train_data = train_data[['user_id', 'movie_id', 'rating']]
test_data = test_data[['user_id', 'movie_id', 'rating']]
print(train_data.shape, test_data.shape)

## 3: Explore the MovieLense data

In [None]:
# Total number of users
user_num = len(ratings['user_id'].unique())
user_num

In [None]:
# Total number of movies
movie_num = len(ratings['movie_id'].unique())
movie_num

In [None]:
# Rating information
ratings['rating'].mean()

In [None]:
# Rating distribution
sns.countplot(x='rating', data=ratings)

## 4: Collaborative Filtering Recommender Systems

In [None]:
# Create user-item matrix for training and testing data
train_matrix = np.zeros([user_num, movie_num])
for line in train_data.itertuples():
  train_matrix[line.user_id, line.movie_id] = line.rating

test_matrix = np.zeros([user_num, movie_num])
for line in test_data.itertuples():
  test_matrix[line.user_id, line.movie_id] = line.rating

In [None]:
# calculate the average rating for each user
average_user_rating = np.true_divide(train_matrix.sum(1),(train_matrix!=0).sum(1))

# create a train_matrix_sp represents users' preferences on different movies
train_matrix_sp = csr_matrix(train_matrix, dtype=np.float64)
nz = train_matrix_sp.nonzero()
train_matrix_sp[nz] -= average_user_rating[nz[0]]
train_matrix_sp = train_matrix_sp.toarray()

# calculate the user and movie similarity
user_similarity = pairwise_distances(train_matrix_sp)
movie_similarity = pairwise_distances(train_matrix_sp.T)
np.fill_diagonal(user_similarity, 0)
np.fill_diagonal(movie_similarity, 0)
print(user_similarity)
print(movie_similarity)

In [None]:
# Create a collaborative filtering algorithm
zero_index = np.zeros(train_matrix_sp.shape)
zero_index[nz] = 1
def collaborative_filtering (type = 'user'):
  if type == 'user':
    pre_rating = average_user_rating[:, np.newaxis] + np.dot(user_similarity, train_matrix_sp)/np.dot(user_similarity, zero_index)
  if type == 'item':
    pre_rating = (np.dot(movie_similarity, train_matrix.T)/np.dot(movie_similarity, zero_index.T)).T
  return pre_rating


In [None]:
# make predictions
user_prediction = collaborative_filtering(type='user')
item_prediction = collaborative_filtering(type='item')
user_prediction = np.nan_to_num(user_prediction, nan=4)
item_prediction = np.nan_to_num(item_prediction, nan=4)

In [None]:
# Examine the evaluation results of user-based collaborative filtering on testing data: MAE and RMSE
MAE = mean_absolute_error(test_matrix[test_matrix!=0], user_prediction[test_matrix!=0])
RMSE = mean_squared_error(test_matrix[test_matrix!=0], user_prediction[test_matrix!=0], squared=False)
print("MAE:", MAE)
print("RMSE:", RMSE)

In [None]:
# Examine the evaluation results of item-based collaborative filtering on testing data: MAE and RMSE
MAE = mean_absolute_error(test_matrix[test_matrix!=0], item_prediction[test_matrix!=0])
RMSE = mean_squared_error(test_matrix[test_matrix!=0], item_prediction[test_matrix!=0], squared=False)
print("MAE:", MAE)
print("RMSE:", RMSE)

Q1. Which recommender system has better performance, user-based or item-based, and why? <br>


## 5: Neural Collaborative Filtering

In [None]:
# Build a neural network on training data
class neural_network(nn.Module):
    def __init__(self,  emb_size, hidden_size1, hidden_size2, hidden_size3, hidden_size4, out_size):
        super().__init__()

        self.user_emb = nn.Embedding(user_num, emb_size)
        self.item_emb = nn.Embedding(movie_num, emb_size)
        
        self.network = nn.Sequential(
          nn.Linear(emb_size*2, hidden_size1),
          nn.ReLU(),
          nn.Linear(hidden_size1, hidden_size2),
          nn.ReLU(),
          nn.Linear(hidden_size2, hidden_size3),
          nn.ReLU(),
          nn.Linear(hidden_size3, hidden_size4),
          nn.ReLU(),
          nn.Linear(hidden_size4, out_size))

    def forward(self, u_id, v_id):
        u = self.user_emb(u_id)
        v = self.item_emb(v_id)
        c = torch.cat([u,v], dim = 1)
        out = self.network(c)
        out_sig = torch.sigmoid(out) * 5.0
        return out_sig.squeeze()

In [None]:
# Create tensor from pandas dataframe


# Create tensor dataset


# Define training and testing data loader, and set batch size to 512


In [None]:
# Define training loop function
def training_loop(n_epochs, optimizer, model, loss_fn, train_loader):
    for epoch in range(0, n_epochs):
        # Training Phase 
        model.train()
        loss_train = 0.0
        for user_input, movie_input, labels in train_loader: # (user_input, movie_input, labels) are from (train_user_tensor, train_movie_tensor, train_rating_tensor) in train_dataset
                                                             # (user_input, movie_input, labels) are the inputs for each batch
            outputs = model(user_input, movie_input) # (user_input, movie_input) correspond to the u_id, v_id, which are the inputs of the forward(self, u_id, v_id) function
            loss = loss_fn(outputs, labels)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            loss_train += loss.item()

        if epoch == 0 or epoch == n_epochs-1 or epoch % 1 == 0:
            print('Epoch {}, Training loss {}'.format(epoch, loss_train / len(train_loader)))

In [None]:
# Model training


In [None]:
# Define testing function
def test(model, train_loader, test_loader):
 
  # testing phase
  model.eval()
  predict_train = []
  predict_test = []
  label_train = []
  label_test = []

  with torch.no_grad():
      for user_input, movie_input, labels in train_loader: # (user_input, movie_input, labels) are from (train_user_tensor, train_movie_tensor, train_rating_tensor) in train_dataset
                                                           # (user_input, movie_input, labels) are the inputs for each batch
          outputs = model(user_input, movie_input)         # (user_input, movie_input) correspond to the u_id, v_id, which are the inputs of the forward(self, u_id, v_id) function
          predict_train.append(outputs.tolist())
          label_train.append(labels.tolist())

      for user_input, movie_input, labels in test_loader: # (user_input, movie_input, labels) are from (test_user_tensor, test_movie_tensor, test_rating_tensor) in test_dataset
                                                          # (user_input, movie_input, labels) are the inputs for each batch
          outputs = model(user_input, movie_input)        # (user_input, movie_input) correspond to the u_id, v_id, which are the inputs of the forward(self, u_id, v_id) function
          predict_test.append(outputs.tolist())
          label_test.append(labels.tolist())
  
  MAE_train = mean_absolute_error(list(chain(*label_train)), list(chain(*predict_train)))
  RMSE_train = mean_squared_error(list(chain(*label_train)), list(chain(*predict_train)), squared=False)

  MAE_test = mean_absolute_error(list(chain(*label_test)), list(chain(*predict_test)))
  RMSE_test = mean_squared_error(list(chain(*label_test)), list(chain(*predict_test)), squared=False)

  print("Training MAE and RMSE:", MAE_train, RMSE_train)
  print()
  print("Testing MAE and RMSE:", MAE_test, RMSE_test)

In [None]:
# Examine evaluation results


In [None]:
!jupyter nbconvert --to html "/content/drive/MyDrive/DL_lab/Lab7:Recommender_Systems.ipynb"