# DATA DESCRIPTION

## Where the dataset is from?
- This dataset was constructed by YOOCHOOSE GmbH to support participants in the RecSys Challenge 2015.
- The dataset can be download in https://www.kaggle.com/chadgostopp/recsys-challenge-2015
- The YOOCHOOSE dataset contain a collection of sessions from a retailer, where each session is encapsulating the click events that the user performed in the session. For some of the sessions, there are also buy events; means that the session ended with the user bought something from the web shop. The data was collected during several months in the year of 2014, reflecting the clicks and purchases performed by the users of an on-line retailer in Europe. 





## Why we choose this dataset?
- Unfortunately, the majory of e-commerce data is private. This data can help to identify internal patterns in the e-commerce area. We can identify what kind of users a certain store is interacting with, how often they access the sites, what location they establish and how each product and user is distributed.
- Also this data is quite large, which can truly simulate e-commerce data in general

## Why it is interesting for us?

- By the fact we are brazilian and e-commerce market in Brazil is trending, this can be helpful to add value in our country. 
- Deal with e-commerce data is a challenge since we have extremely unbalanced data and sessions that vary a lot in size, as we will see below.
- During the great pandemic of COVID-19 that we are experiencing, large retail stores had to reinvent themselves for the online world, which makes studying online user data increasingly important.

## How the data is organized?

- We have a sequential data of customers that can lead to buy something or not.



- The file yoochoose-clicks.dat comprising the clicks of the users over the items.

  - Each record/line in the file has the following fields/format: Session ID, Timestamp, Item ID, Category

  - Session ID – the id of the session. In one session there are one or many clicks. Could be represented as an integer number.

  - Timestamp – the time when the click occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ

  - Item ID – the unique identifier of the item that has been clicked. Could be represented as an integer number.

  - Category – the context of the click. The value "S" indicates a special offer, "0" indicates  a missing value, a number between 1 to 12 indicates a real category identifier, any other number indicates a brand. E.g. if an item has been clicked in the context of a promotion or special offer then the value will be "S", if the context was a brand i.e BOSCH, then the value will be an 8-10 digits number. If the item has been clicked under regular category, i.e. sport, then the value will be a number between 1 to 12. 

- The file yoochoose-buys.dat comprising the buy events of the users over the items.
  - Each record/line in the file has the following fields: Session ID, Timestamp, Item ID, Price, Quantity
  - Session ID - the id of the session. In one session there are one or many buying events. Could be represented as an integer number.
  - Timestamp - the time when the buy occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
  - Item ID – the unique identifier of item that has been bought. Could be represented as an integer number.
  - Price – the price of the item. Could be represented as an integer number.
-Quantity – the quantity in this buying.  Could be represented as an integer number.





# CONSTRUCT DATASET --- 

- The objective of this session is to create clean data and extract new information (as columns) for better data analysis.
- To clean the data we are going to:
  - remove possible null values.
  - remove duplicate data.
  - remove possible noise data.
- Also, we want to create:
  - A column with times (hour, month, week of day, etc).
  - A column with the time between two clicks (dwelltime)
  - An item rank column, that is, a value that informs how much in quantity an item was purchased.
  - A label column that can be True or False. True if a user session terminates with bought items, False if not.


## Pre-Loading data

### Drive

In [None]:
from google.colab import drive
# 4/vAFKK2XMqhFnE1xT-qBasfw57ybPZ0tPSdrKKA5CYUcvl9NlLtBXVoE
drive.mount('/content/gdrive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [None]:
import os
os.chdir('/content/gdrive/My Drive/recsys/buyer_classification/')

### Download raw data

In [None]:
!wget http://s3-eu-west-1.amazonaws.com/yc-rdata/yoochoose-data.7z

--2020-04-04 20:36:48--  http://s3-eu-west-1.amazonaws.com/yc-rdata/yoochoose-data.7z
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.105.42
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.105.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 287211932 (274M) [application/octet-stream]
Saving to: ‘yoochoose-data.7z.1’


2020-04-04 20:41:17 (1.02 MB/s) - ‘yoochoose-data.7z.1’ saved [287211932/287211932]



In [None]:
!7z x yoochoose-data.7z.1


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,8 CPUs Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz (406F1),ASM,AES-NI)

Scanning the drive for archives:
  0M Sca        1 file, 287211932 bytes (274 MiB)

Extracting archive: yoochoose-data.7z.1
--
Path = yoochoose-data.7z.1
Type = 7z
Physical Size = 287211932
Headers Size = 255
Method = LZMA:24
Solid = +
Blocks = 2

      0% - yoochoose-buys.da                          1% - yoochoose-buys.da                          2% - yoochoose-buys.da                          2% 1 - yoochoose-clicks.da                              3% 1 - yoochoose-clicks.da                              4% 1 - yoochoose-clicks.da                              5% 1 - yoochoose-clicks.da                              6% 1 - yoochoose-clicks.da                              7% 1 - yoochoose-clicks.da                              8% 1 - yoochoose-clicks.da                              9% 1 


### Load dataset

In [None]:
import pandas as pd

c_cols = ['session_id','timestamp','item_id','category']
b_cols = ['session_id','timestamp','item_id','price','quantity']

dc_path = 'yoochoose-clicks.dat'
db_path = 'yoochoose-buys.dat'

db = pd.read_csv(db_path,sep=',',header=None,names=b_cols)
db

Unnamed: 0,session_id,timestamp,item_id,price,quantity
0,420374,2014-04-06T18:44:58.314Z,214537888,12462,1
1,420374,2014-04-06T18:44:58.325Z,214537850,10471,1
2,281626,2014-04-06T09:40:13.032Z,214535653,1883,1
3,420368,2014-04-04T06:13:28.848Z,214530572,6073,1
4,420368,2014-04-04T06:13:28.858Z,214835025,2617,1
...,...,...,...,...,...
1150748,11368701,2014-09-26T07:52:51.357Z,214849809,554,2
1150749,11368691,2014-09-25T09:37:44.206Z,214700002,6806,5
1150750,11523941,2014-09-25T06:14:47.965Z,214578011,14556,1
1150751,11423202,2014-09-26T18:49:34.024Z,214849164,1046,1


In [None]:
dc = pd.read_csv(dc_path,sep=',',header=None,names=c_cols)
dc

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,session_id,timestamp,item_id,category
0,1,2014-04-07T10:51:09.277Z,214536502,0
1,1,2014-04-07T10:54:09.868Z,214536500,0
2,1,2014-04-07T10:54:46.998Z,214536506,0
3,1,2014-04-07T10:57:00.306Z,214577561,0
4,2,2014-04-07T13:56:37.614Z,214662742,0
...,...,...,...,...
33003939,11299809,2014-09-25T09:33:22.412Z,214819412,S
33003940,11299809,2014-09-25T09:43:52.821Z,214830939,S
33003941,11299811,2014-09-24T19:02:09.741Z,214854855,S
33003942,11299811,2014-09-24T19:02:11.894Z,214854838,S


In [None]:
data = dc

In [None]:
data.shape

(33003944, 4)

## Data Cleaning

### Remove possible null rows

In [None]:
data[data.isnull()].sum()

session_id    0.0
timestamp     0.0
item_id       0.0
category      0.0
dtype: float64

### Drop duplicates

In [None]:
print("Total dataset rows: ", data.shape)
data = data.drop_duplicates()
print("Total dataset rows: ", data.shape)

Total dataset rows:  (33003944, 4)
Total dataset rows:  (33003876, 4)


### Remove noise data in purchased items

In [None]:
db.describe()

Unnamed: 0,session_id,item_id,price,quantity
count,1150753.0,1150753.0,1150753.0,1150753.0
mean,5914902.0,220453300.0,1423.527,0.6460865
std,3347447.0,48973050.0,4651.549,1.14452
min,11.0,214507300.0,0.0,0.0
25%,2958503.0,214716700.0,0.0,0.0
50%,5968063.0,214835000.0,0.0,0.0
75%,8824554.0,214849800.0,1046.0,1.0
max,11562120.0,1178838000.0,334998.0,30.0


In [None]:
db = db[db.quantity != 0]
db = db[db.price != 0]
db

Unnamed: 0,session_id,timestamp,item_id,price,quantity
0,420374,2014-04-06T18:44:58.314Z,214537888,12462,1
1,420374,2014-04-06T18:44:58.325Z,214537850,10471,1
2,281626,2014-04-06T09:40:13.032Z,214535653,1883,1
3,420368,2014-04-04T06:13:28.848Z,214530572,6073,1
4,420368,2014-04-04T06:13:28.858Z,214835025,2617,1
...,...,...,...,...,...
1150748,11368701,2014-09-26T07:52:51.357Z,214849809,554,2
1150749,11368691,2014-09-25T09:37:44.206Z,214700002,6806,5
1150750,11523941,2014-09-25T06:14:47.965Z,214578011,14556,1
1150751,11423202,2014-09-26T18:49:34.024Z,214849164,1046,1


## Sample dataset

In [None]:
data = data.sample(int(5e6))
data

Unnamed: 0,session_id,timestamp,item_id,category
28866956,9777361,2014-09-07T18:27:45.186Z,214844559,11
5007369,1696067,2014-04-24T15:46:06.432Z,214829362,0
6276609,1899713,2014-05-02T00:04:44.006Z,214532940,0
13578129,4528294,2014-06-10T17:50:36.477Z,214642787,0
2251022,785562,2014-04-14T12:53:09.308Z,214711438,0
...,...,...,...,...
15177658,5161458,2014-06-21T13:49:05.969Z,214835469,0
22411162,7656593,2014-08-10T15:29:33.509Z,214663232,1
10235978,3198206,2014-05-25T13:54:27.373Z,214844192,0
21051997,7081707,2014-08-01T13:14:17.405Z,214684715,S


## Data Processing

### Add times columns

In [None]:
data['timestamp'] = pd.to_datetime(data['timestamp'])

data['hour'] = data.timestamp.dt.hour
data['month'] = data.timestamp.dt.month
data['weekday'] = data.timestamp.dt.weekday
data['day'] = data.timestamp.dt.day
data['week'] = data.timestamp.dt.week

### Add dwelltime (time between clicks)

In [None]:
from tqdm import tqdm
# tqdm.pandas(desc="") last tqdm
tqdm.pandas()

def apply_dwelltime(sample):
    sample['gsize'] = sample.groupby(['session_id']).session_id.transform('size')
    sample['dwelltime'] = sample[sample.gsize > 1].groupby(['session_id']).timestamp\
                .progress_apply(lambda x: x.diff()).dt.seconds
    sample = sample.fillna(0.0)
    sample = sample.drop("gsize",axis=1)
    return sample

In [None]:
data = apply_dwelltime(data)
data


  0%|          | 0/924402 [00:00<?, ?it/s][A
  0%|          | 1/924402 [00:00<103:47:22,  2.47it/s][A
  0%|          | 249/924402 [00:00<72:39:51,  3.53it/s][A
  0%|          | 491/924402 [00:00<50:53:01,  5.04it/s][A
  0%|          | 734/924402 [00:00<35:38:27,  7.20it/s][A
  0%|          | 963/924402 [00:00<24:58:34, 10.27it/s][A
  0%|          | 1199/924402 [00:00<17:30:41, 14.64it/s][A
  0%|          | 1434/924402 [00:01<12:17:16, 20.86it/s][A
  0%|          | 1660/924402 [00:01<8:38:00, 29.69it/s] [A
  0%|          | 1887/924402 [00:01<6:04:33, 42.18it/s][A
  0%|          | 2129/924402 [00:01<4:17:01, 59.80it/s][A
  0%|          | 2373/924402 [00:01<3:01:46, 84.54it/s][A
  0%|          | 2616/924402 [00:01<2:09:06, 118.99it/s][A
  0%|          | 2860/924402 [00:01<1:32:14, 166.50it/s][A
  0%|          | 3103/924402 [00:01<1:06:27, 231.06it/s][A
  0%|          | 3342/924402 [00:01<48:26, 316.94it/s]  [A
  0%|          | 3599/924402 [00:01<35:41, 429.96it/s][A
  0%

### Add Item Rank

In [None]:
import operator, itertools
from decimal import *
import numpy as np
from collections import defaultdict
sid_price = list(zip(db.item_id,db.quantity))

d = dict()

for item_id,qt in sid_price:
    if item_id not in d:
        d[item_id] = 0
    d[item_id] = d[item_id] + qt

    
item_rank = d
item_rank

In [None]:
data['item_rank'] = data['item_id'].map(item_rank).fillna(-1)
data

### Add Buyer Label

In [None]:
data['label'] = data.session_id.isin(db.session_id)
data

## Save data

In [None]:
data.to_csv("sample.csv")

In [None]:
db.to_csv("purchased_items.csv")