# IEOR4571 Final Project: Data Preparation



Team members: Diyue Gu (dg3198), Jingyan Xu (jx2424p), Yifei Zhang (yz3925), Chelsea Cui (ac4788), Yishi Wang (yw3619)


## Data Preparation

This notebook is to get our data prepared for the hybrid model. Based on the main idea of our hybrid model, we use a switching model, where users are funneled into different recommendation algorithm depending on their profile. Specificly, a user with only a few rated items may be funneled into a content-based model, a user with a moderate amount of ratings may be funneled into a MF model, and users with a lot of ratings are funneled into a deep learning model.


Our switching criteria:
*   Content-based : Category 0, number of ratings 0 to 6
*   MF model: Category 1, number of ratings 7 to 10
*   Deep Learning model, Category 2 : number of ratings 11 and above

## Import Packages


Python packages needed for data preparation are imported here.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import os
import numpy as np
rng = np.random.RandomState(42)

## Import Raw Data

In [None]:
!curl -O http://files.grouplens.org/datasets/movielens/ml-latest.zip
!unzip ml-latest.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  264M  100  264M    0     0  45.2M      0  0:00:05  0:00:05 --:--:-- 51.5M
Archive:  ml-latest.zip
replace ml-latest/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
!cd ml-latest/
!ls

drive  ml-latest  ml-latest.zip  sample_data


In [None]:
ratings_df = pd.read_csv('/content/ml-latest/ratings.csv', sep=',', header=0)

In [None]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,307,3.5,1256677000.0
1,1,481,3.5,1256677000.0
2,1,1091,1.5,1256677000.0
3,1,1257,4.5,1256677000.0
4,1,1449,4.5,1256677000.0


## Data Sampling and Train Test Split

In [None]:
def sample_dataset(df,target, size, upper_trim=None, lower_trim=0,rng=None): 
  """
  Input:
    - df: A dataframe
    - target: Target column to sample
    - size: Number of objects needed
    - lower_trim: Lower bound of value, if any
    - upper_trim: Upper bound of value, if any
  Output:
    - a user-by-item matrix of size (user_size,item_size)
  """

  if upper_trim == None:
    upper_trim = df[[target,'rating']].groupby(target).agg(['count']).unstack().max()

  qualified_df = ((df[[target,'rating']].groupby(target).agg(['count']).unstack() > lower_trim) & 
                        (df[[target,'rating']].groupby(target).agg(['count']).unstack() < upper_trim)).reset_index().iloc[:,2:]
  
  qualified_df.columns = [target,'count']
  qualified_ids = qualified_df[qualified_df['count'] == True][target].values
  selected_ids = rng.choice(qualified_ids, size, replace=False)
  
  res_df = df[df[target].isin(selected_ids)]

  return res_df

We sampled 20000 unique users and 1000 unique movies as our sample dataset. And then use train-test-split to split our sampled dataset to train set and test set.

In [None]:
sampled_test = sample_dataset(ratings_df, 'userId', 20000, lower_trim = 100, rng = rng )

In [None]:
sampled_test = sample_dataset(sampled_test, 'movieId', 1000, lower_trim = 100, rng = rng)

In [None]:
train_t, test_t = train_test_split(sampled_test, test_size = 0.20)

In [None]:
train_t.shape

(777916, 4)

In [None]:
test_t.shape

(194479, 4)

## Assign test data to different models

In [None]:
test_t.head()

Unnamed: 0,userId,movieId,rating,timestamp
10895664,111958,492,2.0,1055160000.0
2389269,24587,4776,3.0,1279389000.0
2372257,24422,2706,3.0,1264370000.0
14037709,143706,153,3.0,847443800.0
5757687,59314,97913,4.0,1478893000.0


In [None]:
user_count = test_t.groupby('userId')[['rating']].count()

In [None]:
user_count.head()

Unnamed: 0_level_0,rating
userId,Unnamed: 1_level_1
4,32
10,3
15,9
19,6
36,6


Here we assign users with different number of ratings into different groups.
* We assign users whose number of ratings is 0 to 6 as category 0 to Content-based model.
* We assign users whose number of ratings is 7 to 10 as category 1 to MF model.
* We assign users whose number of ratings is 11 and above as category 2 to Deep Learning model.

In [None]:
user_count['category'] = 1
user_count['category'][user_count['rating'] <= 6] = 0
user_count['category'][user_count['rating'] > 11] = 2

In [None]:
user_count.head()

Unnamed: 0_level_0,rating,category
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
4,32,2
10,3,0
15,9,1
19,6,0
36,6,0


In [None]:
test_t = test_t.merge(user_count, on = 'userId', how = 'left')

In [None]:
test_t.head()

Unnamed: 0,userId,movieId,rating_x,timestamp,rating_y,category
0,111958,492,2.0,1055160000.0,73,2
1,24587,4776,3.0,1279389000.0,9,1
2,24422,2706,3.0,1264370000.0,11,1
3,143706,153,3.0,847443800.0,7,1
4,59314,97913,4.0,1478893000.0,66,2


In [None]:
test_for_content = test_t[test_t['category'] == 0].iloc[: , 0 : 3].rename(columns={"rating_x": "rating"})
test_for_mf = test_t[test_t['category'] == 1].iloc[: , 0 : 3].rename(columns={"rating_x": "rating"})
test_for_dp = test_t[test_t['category'] == 2].iloc[: , 0 : 3].rename(columns={"rating_x": "rating"})

In [None]:
print(test_for_content.shape)
print(test_for_mf.shape)
print(test_for_dp.shape)

(37335, 3)
(44436, 3)
(112708, 3)


In [None]:
test_for_content.head()

Unnamed: 0,userId,movieId,rating
8,24651,47,4.5
11,12503,1301,3.0
14,12773,1466,4.0
18,106506,593,4.0
21,152897,4232,0.5


In [None]:
train_t = train_t.iloc[:, 0:3]

In [None]:
train_t.head()

Unnamed: 0,userId,movieId,rating
8135517,83642,2953,2.5
3358636,34473,1355,1.0
14618475,149605,849,2.0
8570273,88346,47,4.0
2994589,30792,750,3.5


In [None]:
train_t.shape

(777916, 3)

## Export prepared datasets to csv

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
train_t.to_csv("/content/drive/MyDrive/final/train.csv")

In [None]:
test_for_content.to_csv("/content/drive/MyDrive/final/test_for_content.csv")

In [None]:
test_for_mf.to_csv("/content/drive/MyDrive/final/test_for_mf.csv")

In [None]:
test_for_dp.to_csv("/content/drive/MyDrive/final/test_for_dp.csv")

In [None]:
train_t.head()