# Mobile Customer Churn

In this Portfolio task you will work with some (fake but realistic) data on Mobile Customer Churn.  Churn is where
a customer leaves the mobile provider.   The goal is to build a simple predictive model to predict churn from available features. 

The data was generated (by Hume Winzar at Macquarie) based on a real dataset provided by Optus.  The data is simulated but the column headings are the same. (Note that I'm not sure if all of the real relationships in this data are preserved so you need to be cautious in interpreting the results of your analysis here).  

The data is provided in file `MobileCustomerChurn.csv` and column headings are defined in a file `MobileChurnDataDictionary.csv` (store these in the `files` folder in your project).

Your high level goal in this notebook is to try to build and evaluate a __predictive model for churn__ - predict the value of the CHURN_IND field in the data from some of the other fields.  Note that the three `RECON` fields should not be used as they indicate whether the customer reconnected after having churned. 

__Note:__ you are not being evaluated on the _accuracy_ of the model but on the _process_ that you use to generate it.  You can use a simple model such as Logistic Regression for this task or try one of the more advanced methods covered in recent weeks.  Explore the data, build a model using a selection of features and then do some work on finding out which features provide the most accurate results.  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
from scipy.cluster.hierarchy import linkage, dendrogram, cut_tree
from scipy.spatial.distance import pdist 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, plot_confusion_matrix
from sklearn.feature_selection import RFE

In [2]:
churn = pd.read_csv("files/MobileCustomerChurn.csv", na_values=["NA", "#VALUE!"], index_col='INDEX')
churn.head()

Unnamed: 0_level_0,CUST_ID,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,...,CONTRACT_STATUS,PREV_CONTRACT_DURATION,HANDSET_USED_BRAND,CHURN_IND,MONTHLY_SPEND,COUNTRY_METRO_REGION,STATE,RECON_SMS_NEXT_MTH,RECON_TELE_NEXT_MTH,RECON_EMAIL_NEXT_MTH
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,46,1,30.0,CONSUMER,46,54.54,NON BYO,15,0,...,OFF-CONTRACT,24,SAMSUNG,1,61.4,COUNTRY,WA,,,
2,2,60,3,55.0,CONSUMER,59,54.54,NON BYO,5,0,...,OFF-CONTRACT,24,APPLE,1,54.54,METRO,NSW,,,
3,5,65,1,29.0,CONSUMER,65,40.9,BYO,15,0,...,OFF-CONTRACT,12,APPLE,1,2.5,COUNTRY,WA,,,
4,6,31,1,51.0,CONSUMER,31,31.81,NON BYO,31,0,...,OFF-CONTRACT,24,APPLE,1,6.48,COUNTRY,VIC,,,
5,8,95,1,31.0,CONSUMER,95,54.54,NON BYO,0,0,...,OFF-CONTRACT,24,APPLE,1,100.22,METRO,NSW,,,


In [3]:
churn = churn.dropna()

### One hot encoding

In [4]:
churn = churn.append(pd.get_dummies(churn["HANDSET_USED_BRAND"], dummy_na=False))

In [5]:
churn = churn.append(pd.get_dummies(churn["CONTRACT_STATUS"], dummy_na=False))

In [6]:
churnAnalysis = churn.drop("CUST_ID", axis = 1)

In [7]:
churnAnalysis

Unnamed: 0_level_0,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,CFU,SERVICE_TENURE,PLAN_ACCESS_FEE,BYO_PLAN_STATUS,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,...,RECON_EMAIL_NEXT_MTH,APPLE,GOOGLE,HUAWEI,OTHER,SAMSUNG,UNKNOWN,NO-CONTRACT,OFF-CONTRACT,ON-CONTRACT
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8153,15.0,1.0,66.0,CONSUMER,15.0,31.81,NON BYO,15.0,9.0,24.0,...,0.0,,,,,,,,,
8155,49.0,2.0,55.0,CONSUMER,49.0,45.44,NON BYO,29.0,0.0,24.0,...,0.0,,,,,,,,,
8159,71.0,2.0,34.0,CONSUMER,51.0,72.72,NON BYO,29.0,0.0,24.0,...,0.0,,,,,,,,,
8169,9.0,1.0,27.0,SMALL BUSINESS,9.0,72.72,NON BYO,9.0,15.0,24.0,...,0.0,,,,,,,,,
8172,46.0,1.0,34.0,CONSUMER,46.0,72.72,NON BYO,7.0,17.0,24.0,...,0.0,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46201,,,,,,,,,,,...,,,,,,,,0.0,0.0,0.0
46202,,,,,,,,,,,...,,,,,,,,0.0,0.0,0.0
46204,,,,,,,,,,,...,,,,,,,,0.0,0.0,0.0
46205,,,,,,,,,,,...,,,,,,,,0.0,0.0,0.0


In [10]:
#churnAnalysis = churn[["ACCOUNT_TENURE", "ACCT_CNT_SERVICES", "AGE", "SERVICE_TENURE", "PLAN_ACCESS_FEE", "PLAN_TENURE", #"MONTHS_OF_CONTRACT_REMAINING"
#                       "PREV_CONTRACT_DURATION", "MONTHLY_SPEND", "CHURN_IND"]]
churnAnalysis = churn.drop(["CFU", "BYO_PLAN_STATUS", "UNKNOWN", "CONTRACT_STATUS", "CUST_ID", "HANDSET_USED_BRAND", "COUNTRY_METRO_REGION", "STATE"], axis = 1)
churnAnalysis = churnAnalysis.fillna(0)
churnAnalysis

Unnamed: 0_level_0,ACCOUNT_TENURE,ACCT_CNT_SERVICES,AGE,SERVICE_TENURE,PLAN_ACCESS_FEE,PLAN_TENURE,MONTHS_OF_CONTRACT_REMAINING,LAST_FX_CONTRACT_DURATION,PREV_CONTRACT_DURATION,CHURN_IND,...,RECON_TELE_NEXT_MTH,RECON_EMAIL_NEXT_MTH,APPLE,GOOGLE,HUAWEI,OTHER,SAMSUNG,NO-CONTRACT,OFF-CONTRACT,ON-CONTRACT
INDEX,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
8153,15.0,1.0,66.0,15.0,31.81,15.0,9.0,24.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8155,49.0,2.0,55.0,49.0,45.44,29.0,0.0,24.0,24.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8159,71.0,2.0,34.0,51.0,72.72,29.0,0.0,24.0,24.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8169,9.0,1.0,27.0,9.0,72.72,9.0,15.0,24.0,24.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8172,46.0,1.0,34.0,46.0,72.72,7.0,17.0,24.0,24.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46201,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
46202,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
46204,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
46205,0.0,0.0,0.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


1.0

In [9]:
train, test = train_test_split(churnAnalysis, test_size=0.2, random_state=142)
print(train.shape)
print(test.shape)

ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

In [None]:
X_train = train.drop(['CHURN_IND'], axis=1)
y_train = train['CHURN_IND']
X_test = test.drop(['CHURN_IND'], axis=1)
y_test = test['CHURN_IND']


lr = LogisticRegression()
lr.fit(X_train, y_train)

In [None]:
train_preds = lr.predict(X_train)
test_preds = lr.predict(X_test)

In [None]:
print("Train Accuracy: ")
print(accuracy_score(y_train, train_preds))
print("Test Accuracy: ")
print(accuracy_score(y_test, test_preds))

In [None]:
print("Train Confusion Matrix: ")
plot_confusion_matrix(lr, X_test, y_test, cmap="OrRd", colorbar = False)

In [None]:
lr = LogisticRegression()
rfe = RFE(estimator = lr, n_features_to_select = 5, step=1)
rfe.fit(X_train, y_train)

In [None]:
X_train.columns[rfe.support_]

In [None]:
train_rfe_preds = rfe.predict(X_train)
test_rfe_preds = rfe.predict(X_test)
print("Train Accuracy: ")
print(accuracy_score(y_train, train_rfe_preds))
print("Test Accuracy: ")
print(accuracy_score(y_test, test_rfe_preds))

In [None]:
train_accuracies = []
test_accuracies = []
for i in range(1, X_train.shape[1]+1):
    lr = LogisticRegression()
    rfe = RFE(estimator = lr, n_features_to_select = i, step=1)
    rfe.fit(X_train, y_train)
    train_rfe_preds = rfe.predict(X_train)
    test_rfe_preds = rfe.predict(X_test)
    
    train_accuracies.append(accuracy_score(y_train, train_rfe_preds))
    test_accuracies.append(accuracy_score(y_test, test_rfe_preds))

In [None]:
plt.figure(figsize=(12,5))
plt.plot(range(1, X_train.shape[1]+1), train_accuracies)
plt.plot(range(1, X_train.shape[1]+1), test_accuracies)
plt.title("Train and Test Accuracy at Each Number of Features")
plt.xlabel("Number of Features")
plt.ylabel("Accuracy")

In [None]:
lr = LogisticRegression()
rfe = RFE(estimator = lr, n_features_to_select = 1, step=1)
rfe.fit(X_train, y_train)
print("The most important value for this model is: ")
print(X_train.columns[rfe.support_])