# Transform

We need to prepare the data for using Deep Learning algorithms. There are two main transformations:

- **Preprocessing**:
    - Users: Convert to label encoded values
    - Items: Convert to label encoded values
    - User Side Features: Convert to 
    - Ratings: Convert the interaction to Explicit or Implicit signals


- **Data Spliting**: Create train & test datasets for evaluating the dataset
    - Random Split
    - Stratified Split
    - Chronological Split
    

In [1]:
# set the environment path to find reco
import sys
sys.path.append("../")

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
users = pd.read_csv("data/users.csv")
items = pd.read_csv("data/items.csv")
ratings = pd.read_csv("data/ratings.csv")

In [4]:
users.head()

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [5]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [6]:
items.head()

Unnamed: 0,movie_id,title,genre_unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,SciFi,Thriller,War,Western,year,overview,original_language,runtime,vote_average,vote_count
0,1,Toy Story (1995),0,0,0,1,1,1,0,0,...,0,0,0,0,1995.0,"Led by Woody, Andy's toys live happily in his ...",en,81.0,7.9,10878.0
1,2,GoldenEye (1995),0,1,1,0,0,0,0,0,...,0,1,0,0,1995.0,James Bond must unmask the mysterious head of ...,en,130.0,6.8,2037.0
2,3,Four Rooms (1995),0,0,0,0,0,0,0,0,...,0,1,0,0,1995.0,It's Ted the Bellhop's first night on the job....,en,98.0,6.1,1251.0
3,4,Get Shorty (1995),0,1,0,0,0,1,0,0,...,0,0,0,0,1995.0,Chili Palmer is a Miami mobster who gets sent ...,en,105.0,6.5,501.0
4,5,Copycat (1995),0,0,0,0,0,0,1,0,...,0,1,0,0,1995.0,An agoraphobic psychologist and a female detec...,en,124.0,6.5,424.0


In [7]:
items.iloc[0]

movie_id                                                             1
title                                                 Toy Story (1995)
genre_unknown                                                        0
Action                                                               0
Adventure                                                            0
Animation                                                            1
Children                                                             1
Comedy                                                               1
Crime                                                                0
Documentary                                                          0
Drama                                                                0
Fantasy                                                              0
FilmNoir                                                             0
Horror                                                               0
Musica

## Preprocessing

In [8]:
from reco.preprocess import encode_user_item

In [9]:
encoded_ratings, user_encoder, item_encoder = encode_user_item(ratings, "user_id", "movie_id")

Number of users:  943
Number of items:  1682


In [10]:
encoded_ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp,user_index,item_index
0,196,242,3,881250949,195,241
1,186,302,3,891717742,185,301
2,22,377,1,878887116,21,376
3,244,51,2,880606923,243,50
4,166,346,1,886397596,165,345


In [11]:
ratings['rating'] = ratings['rating'].values.astype(np.float32)
min_rating = min(ratings['rating'])
max_rating = max(ratings['rating'])

## Data Splitting

- Random Split
- Stratified Split
- Chronological Split

In [12]:
from reco.preprocess import random_split, user_split, sample_data

In [13]:
sampledf = sample_data()

### Random Split 

In [14]:
train_random, val_random, test_random = random_split(sampledf, [0.6, 0.2, 0.2])

In [15]:
train_random

Unnamed: 0,user_index,item_index,rating,timestamp,split_index
0,1,1,4,2000-01-01,0
1,1,1,4,2000-01-01,0
2,1,2,3,2000-01-02,0
3,1,2,3,2000-01-02,0
4,1,2,3,2000-01-02,0
5,2,1,4,2000-01-01,0
6,2,2,5,2000-01-01,0
7,2,1,4,2000-01-03,0
8,2,2,5,2000-01-03,0


In [16]:
val_random

Unnamed: 0,user_index,item_index,rating,timestamp,split_index
9,2,3,5,2000-01-03,1
10,3,3,5,2000-01-01,1
11,3,3,5,2000-01-03,1


In [17]:
test_random

Unnamed: 0,user_index,item_index,rating,timestamp,split_index
12,3,3,5,2000-01-03,2
13,3,3,5,2000-01-03,2
14,3,1,4,2000-01-04,2


### User Chronological Split

In [18]:
train_chrono, val_chrono, test_chrono = user_split(sampledf, "timestamp", "user_index", [0.6, 0.2, 0.2])

In [19]:
train_chrono

Unnamed: 0,user_index,item_index,rating,timestamp,split_index
0,1,1,4,2000-01-01,0
1,1,1,4,2000-01-01,0
2,1,2,3,2000-01-02,0
5,2,1,4,2000-01-01,0
6,2,2,5,2000-01-01,0
7,2,1,4,2000-01-03,0
10,3,3,5,2000-01-01,0
11,3,3,5,2000-01-03,0
12,3,3,5,2000-01-03,0


In [20]:
val_chrono

Unnamed: 0,user_index,item_index,rating,timestamp,split_index
3,1,2,3,2000-01-02,1
8,2,2,5,2000-01-03,1
13,3,3,5,2000-01-03,1


In [21]:
test_chrono

Unnamed: 0,user_index,item_index,rating,timestamp,split_index
4,1,2,3,2000-01-02,2
9,2,3,5,2000-01-03,2
14,3,1,4,2000-01-04,2
