# Data Split

The purpose of a recommender system is to predict content a user will choose, watch, and rate highly. Since future selections hinge heavily on content that has been previously watched and evaluated, time becomes an important dimension in building a model.

In order to better predict how user preferences evolve with time, I employed a chronological split for each user. This is my simplified Python version of [Microsoft's](https://github.com/microsoft/recommenders) [ChronoSplit for Recommenders in pyspark](https://github.com/microsoft/recommenders/blob/main/recommenders/datasets/spark_splitters.py).

Step-by-step explanation of the code below:
1. Identify the train and holdout set size
2. Split the data into 4 groups based on the reviews a customer has given
    - User has >= 3 reviews: Final review will be in the holdout, 2nd to last will be in the test
    - User has 2 reviews: Final review will be in the test set
3. Holdout Set: A random sample of final reviews made by users who review more than 3x
    - Ensures that the model will perform very well on frequent customers
4. Test Set:
    - 2nd to last reviews of those users who are in the holdout set
    - Final reviews of those who review only twice
5. Chronological Split: The remaining data is ordered linearly
    - 42,612 of the latest observations are added to the test set
5. Train Set: All remaining observations will be used to train the models

In [1]:
import pandas as pd
import numpy as np

In [2]:
#notify me when a long running cell is complete
%load_ext jupyternotify

<IPython.core.display.Javascript object>

In [3]:
data = pd.read_csv('data/1m_useratt_minreq.csv')
minorityrec = pd.read_csv('data/minreq.csv')

#for split
data['r_date'] = data['r_date'].astype('datetime64[ns]')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   mid                     1000000 non-null  int64         
 1   cust_id                 1000000 non-null  int64         
 2   rating                  1000000 non-null  float64       
 3   r_date                  1000000 non-null  datetime64[ns]
 4   m_decade                1000000 non-null  int64         
 5   m_avg_rating            1000000 non-null  float64       
 6   user_engagement         1000000 non-null  int64         
 7   cust_act_activity_rank  1000000 non-null  int64         
 8   adopters                1000000 non-null  int64         
 9   m_minreq                1000000 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(6)
memory usage: 76.3 MB


## 1. Identify Train/Test/Holdout Size

In [4]:
testsize = round(len(data) * 0.2)
hosize = round(len(data) * 0.1)
data = data.sort_values(by=['cust_id'])
data.head()

Unnamed: 0,mid,cust_id,rating,r_date,m_decade,m_avg_rating,user_engagement,cust_act_activity_rank,adopters,m_minreq
320284,13462,6,3.0,2004-11-13,5,3.340967,7,4,3,0.0
370521,2782,6,5.0,2004-09-15,4,4.314801,7,4,3,0.0
78196,15105,6,3.0,2005-12-04,4,3.581345,7,4,3,0.0
291435,10730,6,5.0,2004-09-15,5,3.349899,7,4,3,0.0
677004,6339,6,1.0,2004-09-25,4,3.659341,7,4,3,1.0


## 2. Identify how many reviews a user has given

In [5]:
#splitting data into 4 groups based on how many reviews they've given
for cust in data['cust_id'].unique():
    i = data.index[data['cust_id'] == cust]
    
    #if more than 2, the final review will be in holdout, 2nd to last will be in the test
    if len(i)> 2:
        data.loc[i[-1], 'split'] = 4
        data.loc[i[-2], 'split'] = 3
        data.loc[i[:-2], 'split'] = 1
        
    #if 2, the final review will be in the test
    if len(i) == 2:
        data.loc[i[-1], 'split'] = 2
        data.loc[i[0], 'split'] = 1
        
    #everyone with one review gets a linear temporal split
    else:
        data.loc[i[0], 'split'] = 1
        
#sort values by date
data = data.sort_values(by=['r_date'])

## 3. Holdout Set

In [6]:
#random sample of final reviews
holdout = data[(data['split'] == 4)].sample(n=hosize, random_state=1)
holdout.head()

Unnamed: 0,mid,cust_id,rating,r_date,m_decade,m_avg_rating,user_engagement,cust_act_activity_rank,adopters,m_minreq,split
153641,8596,533018,4.0,2004-10-19,4,4.129328,3,3,3,0.0,4.0
469815,6974,2523636,2.0,2003-07-04,4,4.403211,6,4,2,1.0,4.0
390996,15409,872268,4.0,2004-10-18,5,3.448441,3,3,3,0.0,4.0
155017,17324,1003661,5.0,2005-08-01,5,3.90613,12,5,3,1.0,4.0
898127,17769,1293202,1.0,2004-10-05,5,2.539474,4,3,2,0.0,4.0


## 4. Test Set

In [7]:
#if the final review is in the holdout set, the 2nd-to-last will be in the test set

#new df with all 2nd to last reviews
ho2 = data[(data['split'] == 3)]

#list of all users in the holdout set
ho_custid = holdout['cust_id'].to_list()

# keep only users in the holdout using mask method
mask = ho2['cust_id'].isin(ho_custid)
test = ho2[mask]

test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100000 entries, 43064 to 748396
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   mid                     100000 non-null  int64         
 1   cust_id                 100000 non-null  int64         
 2   rating                  100000 non-null  float64       
 3   r_date                  100000 non-null  datetime64[ns]
 4   m_decade                100000 non-null  int64         
 5   m_avg_rating            100000 non-null  float64       
 6   user_engagement         100000 non-null  int64         
 7   cust_act_activity_rank  100000 non-null  int64         
 8   adopters                100000 non-null  int64         
 9   m_minreq                100000 non-null  float64       
 10  split                   100000 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(6)
memory usage: 9.2 MB


In [8]:
#using the rest of the data for train/test
traintest = data.loc[~data.index.isin(holdout.index)]
traintest = traintest.loc[~traintest.index.isin(test.index)]
traintest = traintest.sort_values(by=['r_date'])

In [9]:
#sanity check
traintest.info()

<class 'pandas.core.frame.DataFrame'>
Index: 800000 entries, 0 to 997311
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   mid                     800000 non-null  int64         
 1   cust_id                 800000 non-null  int64         
 2   rating                  800000 non-null  float64       
 3   r_date                  800000 non-null  datetime64[ns]
 4   m_decade                800000 non-null  int64         
 5   m_avg_rating            800000 non-null  float64       
 6   user_engagement         800000 non-null  int64         
 7   cust_act_activity_rank  800000 non-null  int64         
 8   adopters                800000 non-null  int64         
 9   m_minreq                800000 non-null  float64       
 10  split                   800000 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(6)
memory usage: 73.2 MB


In [10]:
#for reviewers who reviewed 2x, final reviews
test2 = traintest[(traintest['split'] == 2)]

#concat test
test = pd.concat([test, test2])
test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 157388 entries, 43064 to 163663
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   mid                     157388 non-null  int64         
 1   cust_id                 157388 non-null  int64         
 2   rating                  157388 non-null  float64       
 3   r_date                  157388 non-null  datetime64[ns]
 4   m_decade                157388 non-null  int64         
 5   m_avg_rating            157388 non-null  float64       
 6   user_engagement         157388 non-null  int64         
 7   cust_act_activity_rank  157388 non-null  int64         
 8   adopters                157388 non-null  int64         
 9   m_minreq                157388 non-null  float64       
 10  split                   157388 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(6)
memory usage: 14.4 MB


## 5. Chronological Split

In [11]:
#chronological split, fill in the remainder of the test size
test_lin = traintest.loc[~traintest.index.isin(test.index)][-(testsize - len(test)):]
test = pd.concat([test, test_lin])

In [12]:
#sanity check
test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200000 entries, 43064 to 997311
Data columns (total 11 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   mid                     200000 non-null  int64         
 1   cust_id                 200000 non-null  int64         
 2   rating                  200000 non-null  float64       
 3   r_date                  200000 non-null  datetime64[ns]
 4   m_decade                200000 non-null  int64         
 5   m_avg_rating            200000 non-null  float64       
 6   user_engagement         200000 non-null  int64         
 7   cust_act_activity_rank  200000 non-null  int64         
 8   adopters                200000 non-null  int64         
 9   m_minreq                200000 non-null  float64       
 10  split                   200000 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(6)
memory usage: 18.3 MB


## 6. Train Set

In [14]:
#remainder in train
train = traintest.loc[~traintest.index.isin(test.index)]

print('holdout shape: ', holdout.shape)
print('test shape: ', test.shape)
print('train shape: ', train.shape)

holdout shape:  (100000, 11)
test shape:  (200000, 11)
train shape:  (700000, 11)


Because this code usually takes a long time to run, I have exported the splits as separate .csv files to make using the data easier.

In [15]:
train.to_csv('data/train_1M.csv', index = False)
test.to_csv('data/test_1M.csv', index=False)
holdout.to_csv('data/ho_1M.csv', index=False)

In [16]:
test

Unnamed: 0,mid,cust_id,rating,r_date,m_decade,m_avg_rating,user_engagement,cust_act_activity_rank,adopters,m_minreq,split
43064,16438,882798,4.0,1999-12-30,4,3.732673,4,3,1,0.0,3.0
89202,15894,422071,3.0,1999-12-30,4,3.475410,7,4,1,0.0,3.0
452139,15455,2522229,4.0,1999-12-31,4,3.526316,8,4,1,0.0,3.0
964459,15599,802939,3.0,2000-01-05,2,3.333333,4,3,1,0.0,3.0
748729,9635,1611303,2.0,2000-01-05,4,3.531250,10,4,1,0.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...
613135,16377,2034915,5.0,2005-12-31,5,4.319194,2,2,5,0.0,1.0
906727,15463,1953749,5.0,2005-12-31,4,3.466667,3,3,4,0.0,1.0
997371,17750,1148389,2.0,2005-12-31,5,3.500000,15,5,3,0.0,1.0
861147,5908,1151752,5.0,2005-12-31,0,3.563380,19,5,4,0.0,1.0


In [17]:
test['split'].value_counts()

split
3.0    101655
2.0     57388
1.0     39325
4.0      1632
Name: count, dtype: int64

In [18]:
train['split'].value_counts()

split
1.0    646921
4.0     26551
3.0     26528
Name: count, dtype: int64