# Predicting Customer Churn - Data Modeling: Decision Trees & Linear Classifiers

## Setup

### Common imports

In [9]:
# Data processing libraries
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Data modeling libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

### Data loading

In [10]:
df = pd.read_csv('data/customer-churn.csv')
df.head()

Unnamed: 0,COLLEGE,INCOME,OVERAGE,LEFTOVER,HOUSE,HANDSET_PRICE,OVER_15MINS_CALLS_PER_MONTH,AVERAGE_CALL_DURATION,REPORTED_SATISFACTION,REPORTED_USAGE_LEVEL,CONSIDERING_CHANGE_OF_PLAN,LEAVE
0,0,31953,0,6,313378,161,0,4,3,3,1,1
1,1,36147,0,13,800586,244,0,6,3,3,2,1
2,1,27273,230,0,305049,201,16,15,3,4,3,1
3,0,120070,38,33,788235,780,3,2,3,0,2,0
4,1,29215,208,85,224784,241,21,1,4,3,0,1


In [11]:
df.shape

(20000, 12)

In [12]:
df.columns

Index(['COLLEGE', 'INCOME', 'OVERAGE', 'LEFTOVER', 'HOUSE', 'HANDSET_PRICE',
       'OVER_15MINS_CALLS_PER_MONTH', 'AVERAGE_CALL_DURATION',
       'REPORTED_SATISFACTION', 'REPORTED_USAGE_LEVEL',
       'CONSIDERING_CHANGE_OF_PLAN', 'LEAVE'],
      dtype='object')

## Data Modeling

In [29]:
training = df.loc[:, df.columns != 'LEAVE']
labels = df['LEAVE']

### Evaluating performance

We will use the ratio between correct and wrong predictions as a simple measure of performance to compare different models:

In [30]:
def evaluate_performance(model, training, labels):
    labels_pred = model.predict(training)
    n_correct = sum(labels_pred == labels)    
    print(
        f"Model performance: {((n_correct / len(labels_pred)) * 100):.2f}%")

## Decision tree with depth = 10

In [15]:
tree_depth_10_clf = DecisionTreeClassifier(
    max_depth = 10)

tree_depth_10_clf.fit(training, labels)

DecisionTreeClassifier(max_depth=10)

In [16]:
evaluate_performance(tree_depth_10_clf, training, labels)

Model performance: 73.92%


## Decision tree with depth = 20

In [17]:
tree_depth_20_clf = DecisionTreeClassifier(
    max_depth = 20)

tree_depth_20_clf.fit(training, labels)

DecisionTreeClassifier(max_depth=20)

In [18]:
evaluate_performance(tree_depth_20_clf, training, labels)

Model performance: 91.29%


## Logistic regression

In [23]:
logistic_regr_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("logistic_regression", LogisticRegression()),
])

logistic_regr_clf.fit(training, labels)

Pipeline(steps=[('scaler', StandardScaler()),
                ('logistic_regression', LogisticRegression())])

In [24]:
evaluate_performance(logistic_regr_clf, training, labels)

Model performance: 62.06%


## Linear support vector machine

In [26]:
svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC()), # SVC stands for support vector classifier
])

In [27]:
svm_clf.fit(training, labels)

Pipeline(steps=[('scaler', StandardScaler()), ('linear_svc', LinearSVC())])

In [28]:
evaluate_performance(svm_clf, training, labels)

Model performance: 61.99%


> A decision tree seems to be the best modeling technique to predict customer churn in the dataset. However, we have just measured performance on the training data, and we do not know yet how well these models would perform on new data.

## Evaluation

After selecting the best model, the last step is to evaluate its performance and assess its generability.

We will leave this task for another class.

## References

Provost, F., & Fawcett, T. (2013). Data science for business: what you need to know about data mining and data-analytic thinking. Chapter 4.

Géron, A. (2019) Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Chapter 5.