# Zeplin Churn Prediction (Classification)

Predict October churn from July data. 

Usefule articles:

https://medium.com/analytics-vidhya/handling-categorical-features-using-encoding-techniques-in-python-7b46207111ca

https://www.kaggle.com/learn/data-cleaning

https://www.kaggle.com/dansbecker/model-validation

https://www.kaggle.com/alexisbcook/categorical-variables

feautre importance: https://towardsdatascience.com/explain-your-machine-learning-with-feature-importance-774cd72abe

SKlearn importance: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Hyperparameter Turning: https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

Gradient boosting XGBoost, LightGBM, CatBoost: https://towardsdatascience.com/xgboost-lightgbm-and-other-kaggle-competition-favorites-6212e8b0e835

### Load CSV file with Pandas

In [3]:
import pandas as pd

In [4]:
file_path = '../Desktop/random_forest/random_forest_main_v1.csv'

df = pd.read_csv(file_path)

len(df)
df.head()

Unnamed: 0,oid,name,paymentPlan,status,subscriptionStatus,minimumSeatCount,stripeCustomerId,domain,billingDomain,totalMembers,...,CoCoEnabled,FeaturesActivatedJuly,isStrategic,companySizeBand,companyCountry,alexaGlobal,companyRevenue,creeateDate,novemberARR,novemberAccountStatus
0,5d8340b2f4858088bcde8259,VMLY&R Poland,organization_monthly_v2,active,active,12,cus_B7tGjMvq7DiKZu,vml.com,vml.com,14,...,0,1,,5K-10K,United States,1060495.0,,42947.0,3087.0,active
1,5d77ebe10190471fe092646e,Vodafone,organization_annual_v3,active,active,35,cus_FmnizUSckyaknm,vodafone.com,eworld.com.mt,15,...,0,1,,50K-100K,United Kingdom,13801.0,$10B+,43718.0,4515.0,active
2,5d7208ac8919b56be9042486,NDC mediagroep,organization_monthly_v2,active,active,12,cus_DocdbKVsyHq95Z,ndcmediagroep.nl,ndcmediagroep.nl,17,...,0,1,,251-1K,Netherlands,223990.0,$50M-$100M,43392.0,2499.0,active
3,5d70f50857a1854ec7c92ae9,Fairmas GmbH,organization_annual_v3,active,active,12,cus_CMHMOiUs90oxoD,fairmas.com,fairmas.com,11,...,0,1,,18568,Germany,5090680.0,$1M-$10M,43151.0,1548.0,active
4,5d7f8ae46c5b26290c838d73,WestJet,organization_monthly_v2,active,active,12,cus_FoxyzJvdpSPMeu,westjet.com,westjet.com,25,...,0,0,,10K-50K,Canada,29968.0,$1B-$10B,43724.0,7645.08,active


Drop missing value rows for Y:
https://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-in-a-certain-column-is-nan

**Remove rows with NaN for target value**

In [5]:
df = df[df['novemberAccountStatus'].notna()]

keep = ['active', 'canceled']
df = df[df.novemberAccountStatus.isin(keep)]

pd.unique(df['novemberAccountStatus'])

array(['active', 'canceled'], dtype=object)

### Separate into X and Y variables.

In [6]:
features = df.keys()[:-2]
X = df[features]
y = df.novemberAccountStatus

**Check for missing value**

In [7]:
missing_values_count = X.isnull().sum()

missing_values_count[:]

oid                      0
name                     0
paymentPlan              0
status                   0
subscriptionStatus       0
                      ... 
companySizeBand        815
companyCountry         605
alexaGlobal            560
companyRevenue        1252
creeateDate              0
Length: 75, dtype: int64

## Pre-process Data
### Get list of categorical variable

In [8]:
s = (X.dtypes == 'object')

obj_cols = list(s[s].index)
print(obj_cols, len(obj_cols))

['oid', 'name', 'paymentPlan', 'status', 'subscriptionStatus', 'stripeCustomerId', 'domain', 'billingDomain', 'isStrategic', 'companySizeBand', 'companyCountry', 'companyRevenue'] 12


### Encode categorical to numerical

In [9]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# Make copy to avoid changing original data 
X_copy = X.copy()

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()

for col in obj_cols:
    X_copy[col] = X_copy[col].astype(str).fillna('NONE')
    X_copy[col] = label_encoder.fit_transform(X_copy[col])
    
y = label_encoder.fit_transform(y)

X_copy.head()

Unnamed: 0,oid,name,paymentPlan,status,subscriptionStatus,minimumSeatCount,stripeCustomerId,domain,billingDomain,totalMembers,...,SlackEnabled,JiraEnabled,CoCoEnabled,FeaturesActivatedJuly,isStrategic,companySizeBand,companyCountry,alexaGlobal,companyRevenue,creeateDate
0,2961,3510,5,0,0,12,1158,3459,3441,14,...,0,1,0,1,1,8,86,1060495.0,9,42947.0
1,2925,3602,3,0,0,35,3370,3470,1081,15,...,1,0,0,1,1,6,85,13801.0,2,43718.0
2,2909,2229,5,0,0,12,2338,2177,2182,17,...,1,0,0,1,1,4,58,223990.0,8,43392.0
3,2904,1194,3,0,0,12,1576,1092,1103,11,...,0,0,0,1,1,2,31,5090680.0,5,43151.0
4,2945,3687,5,0,0,12,3387,3554,3530,25,...,0,0,0,0,1,1,13,29968.0,4,43724.0


### Remove any rows with missing values

In [10]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer()
imputed_X_copy = pd.DataFrame(imputer.fit_transform(X_copy))

imputed_X_copy.columns = X_copy.columns

imputed_X_copy.head()

Unnamed: 0,oid,name,paymentPlan,status,subscriptionStatus,minimumSeatCount,stripeCustomerId,domain,billingDomain,totalMembers,...,SlackEnabled,JiraEnabled,CoCoEnabled,FeaturesActivatedJuly,isStrategic,companySizeBand,companyCountry,alexaGlobal,companyRevenue,creeateDate
0,2961.0,3510.0,5.0,0.0,0.0,12.0,1158.0,3459.0,3441.0,14.0,...,0.0,1.0,0.0,1.0,1.0,8.0,86.0,1060495.0,9.0,42947.0
1,2925.0,3602.0,3.0,0.0,0.0,35.0,3370.0,3470.0,1081.0,15.0,...,1.0,0.0,0.0,1.0,1.0,6.0,85.0,13801.0,2.0,43718.0
2,2909.0,2229.0,5.0,0.0,0.0,12.0,2338.0,2177.0,2182.0,17.0,...,1.0,0.0,0.0,1.0,1.0,4.0,58.0,223990.0,8.0,43392.0
3,2904.0,1194.0,3.0,0.0,0.0,12.0,1576.0,1092.0,1103.0,11.0,...,0.0,0.0,0.0,1.0,1.0,2.0,31.0,5090680.0,5.0,43151.0
4,2945.0,3687.0,5.0,0.0,0.0,12.0,3387.0,3554.0,3530.0,25.0,...,0.0,0.0,0.0,0.0,1.0,1.0,13.0,29968.0,4.0,43724.0


In [11]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(imputed_X_copy, y, random_state=0)

### Decision Tree

In [12]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=1)

model.fit(train_X, train_y)

DecisionTreeClassifier(random_state=1)

In [13]:
from sklearn.metrics import roc_auc_score, accuracy_score

predictedARR = model.predict(val_X)
roc_auc_score(val_y, predictedARR), accuracy_score(val_y, predictedARR)

(0.6320710696338837, 0.9096098953377736)

### Random Forest

In [14]:
from sklearn.ensemble import RandomForestClassifier

forest_model = RandomForestClassifier(random_state=1)
forest_model.fit(train_X, train_y)
forestPredictedARR = forest_model.predict(val_X)
roc_auc_score(val_y, forestPredictedARR), accuracy_score(val_y, forestPredictedARR)

(0.543135319454415, 0.9486203615604186)

In [15]:
from sklearn.ensemble import RandomForestClassifier

forest_model = RandomForestClassifier(random_state=1, max_depth=20, n_estimators=10)
forest_model.fit(train_X, train_y)
forestPredictedARR = forest_model.predict(val_X)
roc_auc_score(val_y, forestPredictedARR), accuracy_score(val_y, forestPredictedARR)

(0.568916008614501, 0.9495718363463368)

### Hyper-parameter Tuning

In [18]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Create random grid
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

# Get best parameter
rf = RandomForestClassifier()

forest_model_tuned = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, 
                                        n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
forest_model_tuned.fit(train_X, train_y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   22.7s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  5.2min finished


AttributeError: 'RandomForestClassifier' object has no attribute 'best_params_'

In [19]:
forest_model_tuned.best_params_

{'n_estimators': 1000,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 50,
 'bootstrap': False}

In [20]:
# Train and score
tuned_prediction = forest_model_tuned.best_estimator_.predict(val_X)
roc_auc_score(val_y, tuned_prediction), accuracy_score(val_y, tuned_prediction)

(0.5788496051687008, 0.9524262607040913)

### Apply Best Parameters

In [23]:
from sklearn.model_selection import GridSearchCV

# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [False],
    'max_depth': [30, 50, 80, 150],
    'max_features': ['auto', 2, 3],
    'min_samples_leaf': [1, 3, 6],
    'min_samples_split': [2, 4, 8],
    'n_estimators': [800, 1000, 1500, 2000]
}
# Create a based model
rf = RandomForestClassifier()

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
grid_search.fit(train_X, train_y)
grid_search.best_params_

Fitting 3 folds for each of 432 candidates, totalling 1296 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   25.3s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed:  9.6min
[Parallel(n_jobs=-1)]: Done 989 tasks      | elapsed: 15.2min
[Parallel(n_jobs=-1)]: Done 1296 out of 1296 | elapsed: 19.9min finished


{'bootstrap': False,
 'max_depth': 50,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 4,
 'n_estimators': 1000}

In [24]:
best_grid = grid_search.best_estimator_

tuned_prediction2 = best_grid.predict(val_X)
roc_auc_score(val_y, tuned_prediction2), accuracy_score(val_y, tuned_prediction2)

(0.5699210337401293, 0.9514747859181731)

### Measure Feature Importance

In [27]:
col_sorted_by_importance=forest_model.feature_importances_.argsort()
feat_imp=pd.DataFrame({
    'cols':X.columns[col_sorted_by_importance],
    'imps':forest_model.feature_importances_[col_sorted_by_importance]
})

import plotly_express as px
px.bar(feat_imp, x='cols', y='imps')

### Drop lowest feature importance columns

In [29]:
X = X[X.columns[col_sorted_by_importance][20:]]

IndexError: index 69 is out of bounds for axis 0 with size 55

In [31]:
drop_cols = ['subscriptionStatus', 'stripeCustomerId', 'oid']
X.drop(drop_cols, axis=1, inplace=True)
len(X.columns)

52

In [33]:
val_X = val_X[X.columns]
train_X = train_X[X.columns]
len(val_X.columns)

52

In [35]:
grid_search.fit(train_X, train_y)
best_grid = grid_search.best_estimator_
tuned_prediction2 = best_grid.predict(val_X)
roc_auc_score(val_y, tuned_prediction2), accuracy_score(val_y, tuned_prediction2)

Fitting 3 folds for each of 432 candidates, totalling 1296 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   28.3s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 341 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 624 tasks      | elapsed: 10.1min
[Parallel(n_jobs=-1)]: Done 989 tasks      | elapsed: 16.0min
[Parallel(n_jobs=-1)]: Done 1296 out of 1296 | elapsed: 20.6min finished


(0.5089285714285714, 0.9476688867745005)

In [36]:
col_sorted_by_importance=best_grid.feature_importances_.argsort()
feat_imp=pd.DataFrame({
    'cols':X.columns[col_sorted_by_importance],
    'imps':forest_model.feature_importances_[col_sorted_by_importance]
})

import plotly_express as px
px.bar(feat_imp, x='cols', y='imps')