In [1]:
from pathlib import Path

import pandas as pd
import numpy as np

In [3]:
RAW_DATA_DIR = Path("data/raw/amazon")

In [20]:
reviews = pd.read_feather(RAW_DATA_DIR / "Movies_and_TV.f")

In [21]:
reviews.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"03 11, 2013",A3478QRKQDOPQ2,1527665,"{'Color:': None, 'Format:': ' VHS Tape', 'Shap...",jacki,really happy they got evangelised .. spoiler a...,great,1362960000,,
1,5,True,"02 18, 2013",A2VHSG6TZHU1OB,1527665,"{'Color:': None, 'Format:': ' Amazon Video', '...",Ken P,Having lived in West New Guinea (Papua) during...,Realistic and Accurate,1361145600,3.0,
2,5,False,"01 17, 2013",A23EJWOW1TLENE,1527665,"{'Color:': None, 'Format:': ' Amazon Video', '...",Reina Berumen,Excellent look into contextualizing the Gospel...,Peace Child,1358380800,,
3,5,True,"01 10, 2013",A1KM9FNEJ8Q171,1527665,"{'Color:': None, 'Format:': ' Amazon Video', '...",N Coyle,"More than anything, I've been challenged to fi...",Culturally relevant ways to share the love of ...,1357776000,,
4,4,True,"12 26, 2012",A38LY2SSHVHRYB,1527665,"{'Color:': None, 'Format:': ' Amazon Video', '...",Jodie Vesely,This is a great movie for a missionary going i...,Good Movie! Great for cross-cultural missionar...,1356480000,,


We will use the last 5 years of data to build the recsys

In [22]:
reviews["reviewDate"] = pd.to_datetime(reviews["unixReviewTime"], unit="s")

In [23]:
start_date = reviews.reviewDate.max() - pd.DateOffset(years=5)

In [24]:
reviews_sample = reviews[reviews.reviewDate >= start_date]

In [25]:
reviews_sample.shape

(6256392, 13)

Also, to train the model, we will probably use users that have interacted with at least N items (say N=5)

In [26]:
user_counts = reviews.reviewerID.value_counts()
user_counts = user_counts[user_counts >= 5].reset_index()
user_counts.columns = ["reviewerID", "counts"]

In [27]:
user_counts.head()

Unnamed: 0,reviewerID,counts
0,AV6QDP8Q0ONK4,4254
1,A1GGOC9PVDXW7Z,2292
2,A328S9RN3U5M68,2175
3,ABO2ZI2Y5DQ9T,2136
4,AWG2O9C42XW5G,2046


In [28]:
reviews_sample = reviews_sample[
    reviews_sample.reviewerID.isin(user_counts.reviewerID)
]

In [29]:
reviews_sample.shape

(2475976, 13)

For the purposes of this book/exercise we will use N (say N=3) months to run a "live" simulation, so let's already separate that from the data

In [38]:
start_date = reviews_sample.reviewDate.max() - pd.DateOffset(months=3)

In [39]:
start_date

Timestamp('2018-07-01 00:00:00')

In [40]:
data_for_live_simulation = reviews_sample[reviews_sample.reviewDate >= start_date]

In [41]:
data_for_live_simulation.shape

(5529, 13)

In [37]:
reviews_sample.reviewDate.max()

Timestamp('2018-10-01 00:00:00')

In [42]:
reviews_sample = reviews_sample[reviews_sample.reviewDate < start_date]

In [43]:
reviews_sample.shape

(2470447, 13)

In [44]:
reviews_sample.reviewDate.min(), reviews_sample.reviewDate.max()

(Timestamp('2013-10-03 00:00:00'), Timestamp('2018-06-30 00:00:00'))

Regarding to how one might split into train, valid and test, we could go a number of ways. We could split simply based on a timelime, which is probably what makes the most sense. Also, we could use the method followed when using the mult-VAE (splitting based on customers, rather than time) or in the neural cf paper (use the last interaction for testing and all the former for training). 

For the time being I will go for the 1st one and split based on the timeline, using the most recent 10% for testing and the previous 10% as validation set, but before that, let's have a look to the data

In [48]:
reviews_sample.count()/reviews_sample.shape[0]

overall           1.000000
verified          1.000000
reviewTime        1.000000
reviewerID        1.000000
asin              1.000000
style             0.957579
reviewerName      0.999972
reviewText        0.999355
summary           0.999790
unixReviewTime    1.000000
vote              0.083006
image             0.002177
reviewDate        1.000000
dtype: float64

## Overall (stars)

In [45]:
reviews_sample.overall.value_counts()

5    1551203
4     445960
3     232772
1     129517
2     110995
Name: overall, dtype: int64

as we knew, most of them are 5. If we were for a multi categorical problem, I will map 1-2 into 0, 3 as 1 and 4-5 as 2

## Verified

In [46]:
reviews_sample.verified.value_counts()

True     2113284
False     357163
Name: verified, dtype: int64

binary, categorical feature

## reviewTime

Here we could build a number of behavioural features: most commin time of the day when they buy, most common day of the week or mean or median timediff between events 

## reviewerID

I will find the presentation where the guys at Catboost use users as a categorical feature (would be "equivalent" as using embeddings) but when using lightgbm/catboost, I intend to use it as cat feature and see what happens. 

## asin

Here we could build user features based on the products they buy, for example: most common category or min/max/mean/median price.

Also, same comment as before applies