# Churn - Classification Analysis

## Overview

- [Description](#Description)  
- [Data Description](#Data-Description)

## Description

Our objective is to predict user actions on our ecommerce site. Examples of direct monetary impact by user actions are:
- Predict bounce (see [Bounce rate](https://en.wikipedia.org/wiki/Bounce_rate)) - we can show them a pop up and prompt them to convert or do some action other than leaving your site.
- Discover which areas of the site are weak.
- Detect user-friendliness in some platforms (example mobile platform).
- Make data-driven decisions.
- Improve user experience.

## Data Description

Columns:
- **is_mobile** (0/1). Binary feature
- **n_products_viewed** (int >=0). Numerical feature
- **visit_duration** (real >=0). Numerical feature
- **is_returning_visitor** (0/1). Binary feature
- **time_of_day** (0/1/2/3 = 24h split into 4 categories). Categorical feature
- **user_action** (bounce / add_to_cart / begin_checkout / finish_checkout). Target
    - bounce (0): the user just left your site
    - add_to_cart (1): the user added to their card but did not begin the checkout process
    - begin_checkout (2): the user started the checkout process but never completed it
    - finish_checkout (3): the user paid and we have a successful order

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('data/churn.csv')

In [2]:
df.head()

Unnamed: 0,is_mobile,n_products_viewed,visit_duration,is_returning_visitor,time_of_day,user_action
0,1,0,0.65751,0,3,0
1,1,1,0.568571,0,2,1
2,1,0,0.042246,1,1,0
3,1,1,1.659793,1,1,2
4,0,1,2.014745,1,1,2


In [3]:
df.dtypes

is_mobile                 int64
n_products_viewed         int64
visit_duration          float64
is_returning_visitor      int64
time_of_day               int64
user_action               int64
dtype: object

## Data Preparation

- Checking Missing Data (see [Missing Data](../../00 Data Preparation/01_Missing Data.ipynb))
- Feature scaling (see [Feature Scaling](../../00 Data Preparation/03_Feature_Scaling.ipynb))(necessary for some classification algorithms)
- One-hot-encoding for categorical data (see [Categorical Data](../../00 Data Preparation/02_Categorical Data.ipynb))

In [4]:
df.describe()

Unnamed: 0,is_mobile,n_products_viewed,visit_duration,is_returning_visitor,time_of_day,user_action
count,500.0,500.0,500.0,500.0,500.0,500.0
mean,0.486,0.854,1.05588,0.518,1.588,0.748
std,0.500305,1.046362,0.976711,0.500176,1.121057,0.89336
min,0.0,0.0,0.000141,0.0,0.0,0.0
25%,0.0,0.0,0.32855,0.0,1.0,0.0
50%,0.0,1.0,0.804717,1.0,2.0,0.0
75%,1.0,1.0,1.499518,1.0,3.0,1.0
max,1.0,4.0,6.368775,1.0,3.0,3.0


In [5]:
df.isnull().sum()

is_mobile               0
n_products_viewed       0
visit_duration          0
is_returning_visitor    0
time_of_day             0
user_action             0
dtype: int64

In [6]:
for column in df.columns:
    print(column)
    print('----------------')
    print(df[column].value_counts())
    print('\n')

is_mobile
----------------
0    257
1    243
Name: is_mobile, dtype: int64


n_products_viewed
----------------
0    240
1    153
2     62
3     30
4     15
Name: n_products_viewed, dtype: int64


visit_duration
----------------
3.211422    1
0.720206    1
0.272298    1
1.387943    1
3.369352    1
0.465946    1
1.724702    1
0.851933    1
3.806308    1
0.597065    1
0.454992    1
0.707463    1
0.308415    1
0.350968    1
0.143695    1
0.355265    1
1.480493    1
0.091978    1
0.479018    1
1.112894    1
2.046040    1
0.728014    1
0.878761    1
0.564505    1
1.075602    1
0.076742    1
0.359318    1
2.378410    1
0.032269    1
0.571083    1
           ..
1.113962    1
1.176730    1
6.089099    1
2.013423    1
0.166900    1
1.678533    1
1.457982    1
0.572542    1
0.984629    1
0.018690    1
0.866789    1
1.509405    1
0.329477    1
0.603154    1
0.424030    1
0.046689    1
0.303757    1
0.569088    1
3.428653    1
0.573198    1
0.652474    1
2.266245    1
1.664271    1
1.391700    1
3

In [7]:
y = df[['user_action']]
X = df.drop(labels=['user_action'], axis=1)

In [8]:
y.head()

Unnamed: 0,user_action
0,0
1,1
2,0
3,2
4,2


In [9]:
X.head()

Unnamed: 0,is_mobile,n_products_viewed,visit_duration,is_returning_visitor,time_of_day
0,1,0,0.65751,0,3
1,1,1,0.568571,0,2
2,1,0,0.042246,1,1
3,1,1,1.659793,1,1
4,0,1,2.014745,1,1


In [10]:
from sklearn.preprocessing import StandardScaler
sd = StandardScaler(with_mean=True, with_std=True)

features_to_scale = ['n_products_viewed','visit_duration']

Xscaled = sd.fit_transform(X[features_to_scale])
X[features_to_scale].describe()

Unnamed: 0,n_products_viewed,visit_duration
count,500.0,500.0
mean,0.854,1.05588
std,1.046362,0.976711
min,0.0,0.000141
25%,0.0,0.32855
50%,1.0,0.804717
75%,1.0,1.499518
max,4.0,6.368775


In [11]:
X = X.drop(labels=features_to_scale, axis=1)

for idx, feature in enumerate(features_to_scale):
    X[feature] = Xscaled[:, idx]

X[features_to_scale].describe()

Unnamed: 0,n_products_viewed,visit_duration
count,500.0,500.0
mean,-9.414691000000001e-17,-3.2862600000000006e-17
std,1.001002,1.001002
min,-0.8169784,-1.081995
25%,-0.8169784,-0.7454186
50%,0.1396708,-0.2574098
75%,0.1396708,0.4546714
max,3.009618,5.445026


In [12]:
X.head()

Unnamed: 0,is_mobile,is_returning_visitor,time_of_day,n_products_viewed,visit_duration
0,1,0,3,-0.816978,-0.408278
1,1,0,2,0.139671,-0.499428
2,1,1,1,-0.816978,-1.038843
3,1,1,1,0.139671,0.618932
4,0,1,1,0.139671,0.982712


In [13]:
X = pd.get_dummies(X, columns=['time_of_day'])
X.head()

Unnamed: 0,is_mobile,is_returning_visitor,n_products_viewed,visit_duration,time_of_day_0,time_of_day_1,time_of_day_2,time_of_day_3
0,1,0,-0.816978,-0.408278,0,0,0,1
1,1,0,0.139671,-0.499428,0,0,1,0
2,1,1,-0.816978,-1.038843,0,1,0,0
3,1,1,0.139671,0.618932,0,1,0,0
4,0,1,0.139671,0.982712,0,1,0,0


## Binary Classification

First we are going to perfom a **binary classification** for the bounce user action. In this case we are going to convert the actions 'add_to_cart', 'begin_checkout', and 'finish_checkout' in a non-bounce action. So we are going to classify data as:
- y=0 if user bounces
- y=1 if user doesn't bounce

In [14]:
y['user_action'].value_counts()

0    253
1    145
2     77
3     25
Name: user_action, dtype: int64

In [15]:
y = y['user_action'].apply(lambda x: 1 if x>0 else 0)
y.value_counts()

0    253
1    247
Name: user_action, dtype: int64