# Retail Rocket - Anonymous e-commerce site
## Buy or not buy

Retail rocket provided an anonymous [data set]() of an e-commerce site web activity to Kaggle for analysis. Three main data sets were provided; events, item properties, and item category tree.

1. **events** - The behaviorail event of each visitor to the site. Three different types of events were tracked, view, add to cart, and transaction (or buy).
2. **item properties** - Time series in weekly buckets of the item properties. The item properties can change over time, and thus the property of when the visitor interacted with the item is tracked.
3. **category tree** - A hierarchal data set relating category to parent category. This data is not utilized in this anlysis.

The data set is from an actual e-commerce site, and Retail Rocket anonymized many significant parts of the data. However, the category ID, but not the category name, and item availability was not anonymized. Some of the more interesting parts of the data, such as price and discount, were anonymized, which limited some more interesting type of analysis. My analysis focused on splitting the event data into individual sessions and attempting to classify whether or not a previous session will predict whether or not the next session will result in a buy or not.

First, let's import custom libraries to perform the analysis.

In [2]:
import data_transformation # loads and transform the raw csv data into data frames which can be used for analysis
import create_features # generates the features which will be utilized for classification
import model_selection # runs the clas`bsification models

import pandas as pd
import seaborn as sns

In [3]:
%load_ext autoreload
%autoreload 2

### Data Loading

The events data set was a dump of site activty from May - September 2015. There was no indication of a session for each visitor, so the session for each visiotr had to be calculated. A new session was calcualted and "started" each time there was a delay of at least 30 minutes (configurable) between events.

In [4]:
print(f'The session limit is set to {data_transformation.SESSION_TIME_LIMIT} minutes.')

The session limit is set to 30 minutes.


Let's load the events data.

In [5]:
events = data_transformation.load()

Total number of events: 2,756,101

Session calcualted and sequenced.

Added category and item availability property.


In [6]:
print(f'Number of unique items: {len(events.itemid.unique()):,}')
print(f'Number of unique visitors: {len(events.visitorid.unique()):,}')
print(f'Events start at {events.local_date_time.min():%m-%d-%Y} and ends at {events.local_date_time.max():%m-%d-%Y}')
events.sample(10)

Number of unique items: 235,061
Number of unique visitors: 1,407,580
Events start at 05-02-2015 and ends at 09-17-2015


Unnamed: 0,timestamp,visitorid,event,itemid,transactionid,local_date_time,minutes_since_prev_event,session_id,seq,categoryid,available
1338123,1436302686197,310759,view,447987,,2015-07-07 13:58:06.197,0.095833,310759_78069,9.0,,
1649290,1437501181900,1254532,view,398883,,2015-07-21 10:53:01.900,0.0,1254532_1,1.0,,
1864309,1438230319036,440406,view,84773,,2015-07-29 21:25:19.036,0.0,440406_1,1.0,53.0,1.0
866476,1434317188460,1407335,view,71952,,2015-06-14 14:26:28.460,0.545267,1407335_1,1.0,973.0,0.0
865802,1434314720384,828245,view,341333,,2015-06-14 13:45:20.384,0.0,828245_1,1.0,1263.0,1.0
1376252,1436428827368,703922,view,254888,,2015-07-09 01:00:27.368,2.62125,703922_1,1.0,1429.0,1.0
1626636,1437418511459,833064,view,260404,,2015-07-20 11:55:11.459,0.0,833064_1,1.0,730.0,0.0
1245759,1435889700975,271126,view,295782,,2015-07-02 19:15:00.975,0.0,271126_1,1.0,29.0,0.0
1620021,1437401640061,736018,view,55925,,2015-07-20 07:14:00.061,0.600583,736018_1,1.0,1018.0,1.0
1994946,1438827743462,879315,view,458803,,2015-08-05 19:22:23.462,0.0,879315_1,1.0,1503.0,1.0


The full data set is rather large. We are going to reduce the data set to analyze the first two sessions for each visitor. The second session will be the observation data set, and the first session will be utilized to generate features to predict whether or not the second session will result in a buy or not.

In [7]:
obs, prior_obs = data_transformation.create_observations(events, 2)

In [8]:
buy_percent = prior_obs[prior_obs.buy_event == 1].shape[0] / prior_obs.shape[0]
print(f'Precentage of first session which results in a buy are {buy_percent:.2%}')
prior_obs.head()

Precentage of first session which results in a buy are 1.21%


Unnamed: 0,session_id,seq,buy_event,visitor_id
0,1000001_1,1.0,0,1000001
2,1000007_1,1.0,0,1000007
4,1000042_1,1.0,0,1000042
6,1000057_1,1.0,0,1000057
8,1000067_1,1.0,0,1000067


Historically, very fey sessions results in a buy transaction. The conversion rate is also not great at 1.21%, with the 2018 average around 2-3% based on this [article](https://www.smartinsights.com/ecommerce/ecommerce-analytics/ecommerce-conversion-rates/). There is potential for improvement for this company and a great question to answer is who will buy or not buy next. If the not buy can be accurately identifed, then effective ad campaigns can be developed to help convert the not buy into a buy.

### Feature generation

The features we will generate will be calculated from the previous session and are the following,

1. Count of views
2. Length of session
3. Number of unique items viewed
4. Number of add to cart events
5. Number of transactions
6. Average item availability, e.g. if 3 pages were viewed and 2 out of the 3 items were available, then item availability is 66%. 

In [9]:
feature_df = create_features.gen_features(events, prior_obs, obs)

feature_df.sample(10)

Unnamed: 0,session_id,seq,buy_event,visitor_id,view_count,session_length,item_views,add_to_cart_count,transaction_count,avg_avail
2781,1019335_256360,2.0,0,1019335,1.0,0.0,1.0,0.0,0.0,1.0
38444,1270618_318686,2.0,0,1270618,2.0,16.794783,1.0,0.0,0.0,1.0
157316,831599_209027,2.0,0,831599,1.0,0.0,1.0,0.0,0.0,0.0
7615,1052939_264483,2.0,0,1052939,1.0,0.0,1.0,0.0,0.0,1.0
101641,4428_1143,2.0,0,4428,2.0,0.517267,2.0,0.0,0.0,1.0
18969,1132889_284840,2.0,0,1132889,3.0,11.951233,3.0,0.0,0.0,1.0
29305,120674_30661,2.0,0,120674,1.0,0.0,1.0,0.0,0.0,0.0
74343,252093_63897,2.0,0,252093,2.0,14.356933,2.0,0.0,0.0,0.0
8012,1055773_265101,2.0,0,1055773,3.0,2.3892,3.0,0.0,0.0,0.333333
107769,485447_121742,2.0,0,485447,1.0,0.0,1.0,0.0,0.0,1.0


All features are calculated and we can now generate the training and test data sets. The train/test split will be 75/25. The buy population is very small, and thus stratifed sampling will be used when generating the train and test data sets. Lastly, the train data set will be upsampled utilizing SMOTE for model selection.

In [10]:
X, y, X_train, X_test, y_train, y_test = model_selection.create_Xy(feature_df)

In [11]:
X_train.describe()

Unnamed: 0,view_count,session_length,item_views,add_to_cart_count,transaction_count,avg_avail
count,208854.0,208854.0,208854.0,208854.0,208854.0,208854.0
mean,2.654698,6.620509,2.010979,0.296388,0.071502,0.724582
std,4.626543,18.09293,3.325323,0.914663,0.467081,0.425255
min,1.0,0.0,1.0,0.0,0.0,0.0
25%,1.0,0.0,1.0,0.0,0.0,0.333333
50%,1.0,0.0,1.0,0.0,0.0,1.0
75%,2.688323,5.573479,2.0,0.0,0.0,1.0
max,132.0,391.790017,118.0,44.0,16.0,1.0


In [12]:
y_train.describeribe()

Unnamed: 0,0
count,208854.0
mean,0.5
std,0.500001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


### Model selection

We will now train 3 different types of models, Logistic, Gradient Boost, and Random Forrest. A 10-fold stratified cross validation will be performed and paramters will be tuned using BayesianSearch optimization. The metric utilized for model selection will be AUC. AUC was chosen to help distinguish which model can generally classify the training data set better.

In [13]:
import pickle

results = model_selection.cv_models(X_train, y_train, n_iters=1)

features = ['view_count', 'session_length', 'item_views', 'add_to_cart_count',
   'transaction_count', 'avg_avail']

for k, v in results.items():
    print(f'{k} Summary')
    print()
    print(f'Best out-of-sample AUC score: {v.best_score_}')
    print()
    print(f'Feature Importance:')
    for f, i in zip(features, v.best_estimator_.feature_importances_):
        print(f'{f}: \t {i:.2%}')
    print()
    print('----------------------------')
    pickle.dump( v.best_estimator_, open( f"../data/best_{k}_model.pkl", "wb" ) )


KeyboardInterrupt: 

1. Need to plot the ROC curve of each of the best models.
2. Need to say which model is best
3. Need to think about cost benefit analysis on where to set the probability
4. Report some metric on test data set