# Introduction - Customer Churn Prediction notebook
In this notebook, we illustrate how you can train a model for Churn Prediction using scikit learn. After training the model, you step through the instructions to deploy the model using Watson Machine Learning.

This notebook is a variation of the original notebook reference in this github repo: https://github.com/elenalowery/cpd4_demo/blob/master/assets/jupyterlab/Predict_Customer_Churn_CPD4.ipynb


In [1]:
# Install required Python modules
!pip install sklearn-pandas > /dev/null


## Step 1: Review Use Case
The analytics use case implemented in this notebook is telco churn prediction. It is a simple use case which illustrates typical process for model development and deployment using Cloud Pak for Data.

In [2]:
import subprocess
CURRENT_BRANCH = subprocess.run(['git', 'rev-parse', '--abbrev-ref', 'HEAD'], stdout=subprocess.PIPE)\
    .stdout.strip().decode("utf-8")

if CURRENT_BRANCH in ['prd','uat']:
    CURRENT_ENV=CURRENT_BRANCH
else:
    CURRENT_ENV='dev'
    
print('Current branch     : {}'.format(CURRENT_BRANCH))
print('Current environment: {}'.format(CURRENT_ENV))

Current branch     : dev
Current environment: dev


In [3]:
import pandas as pd
customer_data_df=pd.read_csv('/userfs/assets/data_asset/CUSTOMER_DATA_ready-'+CURRENT_ENV+'.csv')
customer_data_df.head(10)

Unnamed: 0,ID,LONGDISTANCE,INTERNATIONAL,LOCAL,DROPPED,PAYMETHOD,LOCALBILLTYPE,LONGDISTANCEBILLTYPE,USAGE,RATEPLAN,...,CREDITCARD,DOB,ADDRESS_1,CITY,STATE,ZIP,ZIP4,LONGITUDE,LATITUDE,CHURN
0,1,23,0,206,0,CC,Budget,Intnl_discount,229,3,...,1814139000000000.0,32dad3590f2243b8709201348e1ae897,159 HUTTON ST BSMT A,ABSECON,NJ,8201,0,,,T
1,1004,28,0,60,0,Auto,FreeLocal,Standard,89,4,...,6494422000000000.0,c643e317495168f62085716c81ec164d,1724 WHITEHAVEN,GLYNDON,MN,56547,0,,,F
2,1005,24,0,5,0,CH,Budget,Standard,29,4,...,3218720000000000.0,80c40ce517ca57e0919e238e0e29e75c,95 W 25TH ST APT 1,WAPPINGERS FALLS,NY,12590,1723,,,F
3,1006,28,0,97,0,CC,FreeLocal,Standard,125,1,...,3016220000000000.0,df7b078f544b61f867ad0dc1fa51c046,66 KULLA DR,RICHLAND,NE,68601,0,-97.377539,41.441233,T
4,1008,0,0,4,2,CC,Budget,Standard,4,2,...,7070216000000000.0,273a525adc7bb0bd49252e47dab190e9,5621 MCCARTY RD,EVERETT,WA,98205,0,,,F
5,1009,29,0,9,0,CC,Budget,Intnl_discount,38,2,...,4919386000000000.0,efb18ce1ef44f169687df57e9b9fdf53,2000 CALLE 4,CAROLINA,PR,979,0,,,F
6,1010,13,0,40,0,CC,Budget,Standard,53,4,...,9402648000000000.0,227f74a0e2d7b254a9c73ec61528ee94,3801 YOSEMITE BLVD STE F,HOUSTON,TX,77024,7776,,,F
7,1016,16,0,114,0,CH,Budget,Standard,130,1,...,8522563000000000.0,92e4302092a290acd3bc1fb75ada5267,843 EUCLID ST APT 101S,KIRKLAND,WA,98034,0,-122.209175,47.709619,T
8,1017,7,0,6,0,CC,Budget,Standard,13,3,...,2981966000000000.0,32bd821d9a01040a89f9a7d3766017ce,3801 MAC CV,NEW YORK,NY,10019,0,-73.990852,40.768196,F
9,1018,21,0,87,0,CC,Budget,Standard,108,1,...,3074091000000000.0,e78d37c276f03bdfa0eef28dc18f9c3a,390 W BROADWAY ST,BUTLER,NJ,7405,0,,,F


In [4]:
# COPY the dataFrame into a new dataFrame called *data*
data=customer_data_df.copy()

In [5]:
# List all the columns
print(data.columns)

Index(['ID', 'LONGDISTANCE', 'INTERNATIONAL', 'LOCAL', 'DROPPED', 'PAYMETHOD',
       'LOCALBILLTYPE', 'LONGDISTANCEBILLTYPE', 'USAGE', 'RATEPLAN', 'GENDER',
       'STATUS', 'CHILDREN', 'ESTINCOME', 'CAROWNER', 'AGE', 'CREDITCARD',
       'DOB', 'ADDRESS_1', 'CITY', 'STATE', 'ZIP', 'ZIP4', 'LONGITUDE',
       'LATITUDE', 'CHURN'],
      dtype='object')


In [6]:
# Keep only the columns that are relevant for churn prediction
data = data[['LONGDISTANCE', 'INTERNATIONAL', 'LOCAL', 'DROPPED', 'PAYMETHOD', 'LOCALBILLTYPE', 'LONGDISTANCEBILLTYPE', 'USAGE', 'RATEPLAN', 'GENDER','STATUS', 'CHILDREN', 'ESTINCOME', 'CAROWNER', 'AGE', 'CHURN']]
data.head()


Unnamed: 0,LONGDISTANCE,INTERNATIONAL,LOCAL,DROPPED,PAYMETHOD,LOCALBILLTYPE,LONGDISTANCEBILLTYPE,USAGE,RATEPLAN,GENDER,STATUS,CHILDREN,ESTINCOME,CAROWNER,AGE,CHURN
0,23,0,206,0,CC,Budget,Intnl_discount,229,3,F,S,1,38000.0,N,24.393333,T
1,28,0,60,0,Auto,FreeLocal,Standard,89,4,F,M,1,8073.11,N,46.0,F
2,24,0,5,0,CH,Budget,Standard,29,4,M,M,0,95448.6,Y,53.68,F
3,28,0,97,0,CC,FreeLocal,Standard,125,1,M,S,1,24141.5,Y,17.006667,T
4,0,0,4,2,CC,Budget,Standard,4,2,M,S,1,31952.0,N,34.266667,F


## Step 2: Try the Random Forest model

In [7]:
import pandas as pd
import sklearn
pd.options.display.max_columns = 999

import warnings
warnings.filterwarnings('ignore')

from scipy.stats import chi2_contingency,ttest_ind
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, roc_curve, roc_auc_score

import numpy as np

import urllib3, requests, json

In [8]:
#convert CHURN to 1/0
le = LabelEncoder()
data.loc[:,'CHURN']= le.fit_transform(data.loc[:,'CHURN'])
data.head()

Unnamed: 0,LONGDISTANCE,INTERNATIONAL,LOCAL,DROPPED,PAYMETHOD,LOCALBILLTYPE,LONGDISTANCEBILLTYPE,USAGE,RATEPLAN,GENDER,STATUS,CHILDREN,ESTINCOME,CAROWNER,AGE,CHURN
0,23,0,206,0,CC,Budget,Intnl_discount,229,3,F,S,1,38000.0,N,24.393333,1
1,28,0,60,0,Auto,FreeLocal,Standard,89,4,F,M,1,8073.11,N,46.0,0
2,24,0,5,0,CH,Budget,Standard,29,4,M,M,0,95448.6,Y,53.68,0
3,28,0,97,0,CC,FreeLocal,Standard,125,1,M,S,1,24141.5,Y,17.006667,1
4,0,0,4,2,CC,Budget,Standard,4,2,M,S,1,31952.0,N,34.266667,0


In [9]:
# define the label and features
y = np.float32(data.CHURN)
x = data.drop(['CHURN','RATEPLAN','GENDER','ESTINCOME','STATUS','AGE','USAGE'], axis = 1)

In [10]:
x.columns

Index(['LONGDISTANCE', 'INTERNATIONAL', 'LOCAL', 'DROPPED', 'PAYMETHOD',
       'LOCALBILLTYPE', 'LONGDISTANCEBILLTYPE', 'CHILDREN', 'CAROWNER'],
      dtype='object')

In [11]:
# Apply the LabelEncoder to encode the input features in numeric form where applicable
from sklearn_pandas import DataFrameMapper

'''
mapper = DataFrameMapper(
    [('GENDER', LabelEncoder()),
     ('STATUS', LabelEncoder()),
     ('CHILDREN', None),
     ('ESTINCOME',None),
     ('CAROWNER', LabelEncoder()),
     ('AGE',None),
     ('LONGDISTANCE',None),
     ('INTERNATIONAL',None),
     ('LOCAL',None),
     ('DROPPED',None),
     ('PAYMETHOD',LabelEncoder()),
     ('LOCALBILLTYPE',LabelEncoder()),
     ('LONGDISTANCEBILLTYPE',LabelEncoder()),
     ('USAGE',None),
     ('RATEPLAN',None)
    ]
)
'''

mapper = DataFrameMapper(
    [
     ('CHILDREN', None),
     ('CAROWNER', LabelEncoder()),
     ('LONGDISTANCE',None),
     ('INTERNATIONAL',None),
     ('LOCAL',None),
     ('DROPPED',None),
     ('PAYMETHOD',LabelEncoder()),
     ('LOCALBILLTYPE',LabelEncoder()),
     ('LONGDISTANCEBILLTYPE',LabelEncoder())
    ]
)

In [12]:
# split the data to training and testing set
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

In [13]:
# fit the model

import sklearn.pipeline
from sklearn.preprocessing import OneHotEncoder

random_forest = RandomForestClassifier()
steps = [('mapper', mapper),('RandonForestClassifier', random_forest)]
pipeline = sklearn.pipeline.Pipeline(steps)
model=pipeline.fit( X_train, y_train )
model

Pipeline(steps=[('mapper',
                 DataFrameMapper(drop_cols=[],
                                 features=[('CHILDREN', None),
                                           ('CAROWNER', LabelEncoder()),
                                           ('LONGDISTANCE', None),
                                           ('INTERNATIONAL', None),
                                           ('LOCAL', None), ('DROPPED', None),
                                           ('PAYMETHOD', LabelEncoder()),
                                           ('LOCALBILLTYPE', LabelEncoder()),
                                           ('LONGDISTANCEBILLTYPE',
                                            LabelEncoder())])),
                ('RandonForestClassifier', RandomForestClassifier())])

In [14]:
### call pipeline.predict() on your X_test data to make a set of test predictions
y_prediction = pipeline.predict( X_test )

### test your predictions using sklearn.classification_report()

report = sklearn.metrics.classification_report( y_test, y_prediction )
### and print the report
print(report)

              precision    recall  f1-score   support

         0.0       0.98      0.92      0.95       168
         1.0       0.90      0.97      0.93       115

    accuracy                           0.94       283
   macro avg       0.94      0.95      0.94       283
weighted avg       0.95      0.94      0.94       283



### Evaluate

Accuracy of the trained model is very good so we can now decide to deploy this model to be used by the applications.