## Final Project Submission

Please fill out:
* Student name: James Brochhausen
* Student pace: part time
* Scheduled project review date/time: 10/08/2020
* Instructor name: James Irving

## Introduction

In the following notebook we will be uncovering information from a dataset. Particularly, we will be understanding why a customer churns. To do this we will be walking through two classification models. The data we will be reviewing will be for a telecommunication. What we explore throughout this is, which features (columns) are the most important. We also need to understand their importance, meaning, is it important because it causes churn or is it important because it does not cause churn. This will be the main goal to understand throughout this notebook.

## Importing and Exploring Data

In [None]:
#importing necessary functions.
import pandas as pd
import numpy as np
import pandas as pd
import sklearn.metrics as metrics
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
# from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_curve, auc
from sklearn import tree
# from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import plot_confusion_matrix, recall_score
from xgboost import XGBRFClassifier,XGBClassifier
import shap
from imblearn.over_sampling import SMOTE
shap.initjs()
import warnings
from collections import Counter
warnings.filterwarnings('ignore')

In [None]:
# Classification Function
def classify(y_true, y_pred,X_true,clf,cm_kws=dict(cmap="Greens",
                                  normalize='true'),figsize=(9,5),
                   plot_roc_auc=True):
    
    # Class report 
    print(metrics.classification_report(y_true,y_pred))

    if plot_roc_auc:
        num_cols=1
    else:
        num_cols=1

In [None]:
#Opening my data
df = pd.read_csv('bigml_59c28831336c6604c800002a.csv')
pd.set_option('display.max_columns', None)
df.head()

In [None]:
df.nunique()

In [None]:
df.info()

In [None]:
df.isna().sum()

## EDA

In [None]:
# pd.plotting.scatter_matrix(X)

In [None]:
columns = list(df.columns)

In [None]:
df.head()

In [None]:
t_f = df['churn'].value_counts()
ax = sns.barplot(t_f.index, t_f.values).set_title('Total Churn Vs. No Churn')
# ax.set(xlabel = 'Churn', ylabel = 'Total Amount')
# ax.set(xlabel='Churn', ylabel='Client Volume')
plt.xlabel('Churn')
plt.ylabel('Client Volume')
plt.show()

In [None]:
df['churn'].value_counts()

In [None]:
# fig, axes = plt.subplots(nrows=len(columns), ncols=1, figsize=(14,4))

days = ['account length']
minutes = ['total day minutes','total eve minutes','total night minutes',
           'total intl minutes', 'customer service calls']
calls = ['total day calls','total eve calls','total night calls',
         'total intl calls']
charge = ['total day charge','total eve charge','total night charge',
          'total intl charge']

mean = df.groupby('churn').mean()
print(mean)



In [None]:
# fig = plt.figure()
fig, ax = plt.subplots(2, 3, figsize=(20,10))
for i, col in enumerate(minutes):  
    sns.catplot(x='churn', y=col, data=df, kind = 'swarm', ax=ax[i//3][i%3])
    plt.close()
#Take a look at violin plot and Swarm.    

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(20,10))
for i, col in enumerate(calls):  
    sns.catplot(x='churn', y=col, data=df, kind = 'swarm', split = True,
                hue = 'churn', ax=ax[i//2][i%2])
    plt.close()

In [None]:
fig, ax = plt.subplots(2, 2, figsize=(20,10))
for i, col in enumerate(charge):  
    sns.catplot(x='churn', y=col, data=df, kind = 'swarm', ax=ax[i//2][i%2])
    plt.close()

## Get dummies

In [None]:
# Dropping Phone Number
# Setting my X and y data
y = df['churn'].astype(int)
X = df.drop(columns = ['churn', 'phone number']).copy()

In [None]:
y.value_counts(normalize = True)

In [None]:
X = pd.get_dummies(X)

In [None]:
# 

## Class Imbalance

## Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.30,
                                                    random_state=10)

In [None]:
# X_sm, y_sm = SMOTE().fit_resample(X, y)

In [None]:
std_scale = StandardScaler()

X_train_scaled = std_scale.fit_transform(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X.columns)
X_test_scaled = std_scale.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns = X.columns)

In [None]:
X_train, y_train = SMOTE().fit_resample(X_train_scaled, y_train)

In [None]:
print('Resampled dataset shape %s' % Counter(y_sm))

## Random Forest

In [None]:
rf_clf = RandomForestClassifier(random_state=10)

rf_clf.fit(X_train, y_train)

In [None]:
y_pred = rf_clf.predict(X_test_scaled)

In [None]:
acc = accuracy_score(y_test,y_pred) * 100
print(('Accuracy :{0}'.format(acc))),2

# AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,
                                                                y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
print('\nAUC :{0}'.format(round(roc_auc, 2)))

# Printed confusion matrix 
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'],
            margins=True)

In [None]:
classify(y_test,y_pred,X_test_scaled,rf_clf)

plot_confusion_matrix(rf_clf, X_test_scaled, y_test, values_format='.3g',
                     normalize = 'true')

plt.title('Confusion Matrix')
plt.show()

## XGB

In [None]:
xgb_rf = XGBClassifier()
xgb_rf.fit(X_train, y_train)
# print('Training score: ' ,round(xgb_rf.score(X_train,y_train),2))
# print('Test score: ',round(xgb_rf.score(X_test,y_test),2))

y_pred2 = xgb_rf.predict(X_test)

classify(y_test,y_pred2,X_test,xgb_rf)


plot_confusion_matrix(xgb_rf, X_test, y_pred2, values_format='.3g', 
                      normalize = 'true')
plt.title('Confusion Matrix')
plt.show()

## SHAP

In [None]:
explainer = shap.TreeExplainer(xgb_rf)

In [None]:
shap_values = explainer.shap_values(X_train_scaled,y_train)
shap.summary_plot(shap_values, X_train_scaled, plot_type="bar")

In [None]:
shap.summary_plot(shap_values,X_train_scaled)

## GridSearch

In [None]:
best_score = cross_val_score(rf_clf, X_train, y_train, cv=3)

mean_best_score = np.mean(best_score)
print(mean_best_score)

In [None]:
grid_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 2, 4, 5, 6, 8, 10, 12],
    'min_samples_split': [2, 4, 6, 8, 10],
    'min_samples_leaf': [2,4, 6, 8, 10, 12, 14],
}

In [None]:
rf_grid_search = GridSearchCV(rf_clf, grid_params, scoring = 'recall', cv=3,
                              return_train_score=True)

rf_grid_search.fit(X_train, y_train)

In [None]:
y_pred_acc = rf_grid_search.predict(X_test)

In [None]:
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))

In [None]:
# rf_gs_training_score = np.mean(rf_grid_search.cv_results_['mean_train_score'])

# # Mean test score
# rf_gs_testing_score = rf_grid_search.score(X_test, y_test)

# print(f"Avg. Training Score: {rf_gs_training_score :.2%}")
# print(f"Avg. Test Score: {rf_gs_testing_score :.2%}")
# print('Recall Score : ' + str(recall_score(y_test,y_pred2)))
# # print("Best Param Combo:")
# # rf_grid_search.best_params_


In [None]:
plot_confusion_matrix(xgb_rf, X_test, y_pred_acc, values_format='.3g', 
                      normalize = 'true')
plt.title('Confusion Matrix')
plt.show()

In [None]:
rf_score = rf_grid_search.score(X_test, y_test)

print(rf_score*100)

## Conclusion

Following the exploration of the dataset we can conclude on a few things. That our model performed very well. We found a 60% accuracy using our random forests model, a 95% accuracy after apply XGB Boost and finally a 93% accuracy with our gridsearch parameters. The other important result we wanted to look at was Recall. After reviewing the information we can confirm that the top 5 most important features are:

- International Plan Number (less likely to churn)
- Total International Calls (less likely to churn)
- Total International Minutes (less likely to churn)
- Customer Service Calls (more likely to churn)
- Voicemail Plan (less likely to churn)

Moving forward, we recommend that our telecommunications client looks into the increasing the payment plan that goes into their international plan. We found that those who do have it are less likely to churn, so there may be an opportunity to make more money. Those who do have an international plan and have higher minutes and calls are less likely to churn. But those who do not use it as frequently as the others have been found to churn more frequently. We'd recommend looking at different pricing options for lower minute / call volumes from users. I'd recommend reducing the amount of customer service calls. The customers problems should be solved by the first or second call. The more calls they make the higher the likelihood there is of this customer churning. Finally with the voicemail plan we'd recommend looking into the pricing plan as well. There is another opportunity here to lift prices and potentially make more money.
