# Data Science Workflow

## Problem Definition: Using customer data, we want to predict which customers are going to churn or may leave the subscription that you provide.

### Further Information:
### What is Churn?
Churn is the measure of the the amount of people who stop subscribing or buying a product over time. In situations where consumers are subscribing to a service it is important to measure how likely those people are to stop subscribing. In this demo, when churn is true, it will hold the value of 1. When a customer does not churn, it will hold a value of 0.

### Imports

In [None]:
%matplotlib inline
import scipy
import importlib
import os
import re
import sys
import matplotlib
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, recall_score, auc, precision_score, roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score
from collections import OrderedDict

## Data Collection

This phase of the project is where we connect to an external data stream such as a SQL database, an S3 bucket, or a simple CSV file so that we can gain access to our data.

In [None]:
churn_data = pd.read_hdf("C:\\Users\\elijah2352\\Downloads\\churndata.h5")
print(churn_data.head())

## Feature Engineering
One of the most important parts of the data scientist's workflow is that we have to create the right features for us to input into the machine learning algorithm. The idea is that we have to ensure that the features are in the correct format while also maximizing information and minimizing the number of features to avoid curse of dimensionality problems with the data. 

In [None]:
# First, we need to clean each column, including the column names
churn_exp = churn_data.copy()
churn_exp.rename(columns={'Churn?': 'Churn'}, inplace=True)

# Now, you have to clean the columns
churn_exp.loc[:, 'Churn'] = churn_exp.loc[:, 'Churn'].map(lambda x: re.sub('[.]',
    '', x))

# Now we need to convert the columns to something more usable. You also would need to know what columns need to be changed, 
# which can be a chore if you have hundreds of rows of data.

churn_exp.loc[:, 'VMail Plan'].replace('yes', 1, inplace=True)
churn_exp.loc[:, 'VMail Plan'].replace('no', 0, inplace=True)
churn_exp.loc[:, 'Int\'l Plan'].replace('yes', 1, inplace=True)
churn_exp.loc[:, 'Int\'l Plan'].replace('no', 0, inplace=True)

# Now, we have to change the labels that we have to predict:

churn_exp.loc[:, 'Churn'].replace('False', 0, inplace=True)
churn_exp.loc[:, 'Churn'].replace('True', 1, inplace=True)

# We now have to find and standardize the columns of the data for the algorithms like Logistic Regression.

from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler() # Assume .25, .75 quantiles
# This will be a hard part.... 
# Also, it gives you a way to focus on what's important instead of being bombarded with a bunch of information at once. 
# It sort of gives you a good process to draw from.
churn_exp[['Account Length', 'VMail Message', 'Day Mins', 'Day Calls', 
         'Day Charge', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Eve Mins' ,
         'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 
         'Intl Charge', 'CustServ Calls']] =  StandardScaler().fit_transform(churn_exp[['Account Length', 'VMail Message',
                                                                                        'Day Mins', 'Day Calls', 
         'Day Charge', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Eve Mins',
         'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 
         'Intl Charge', 'CustServ Calls']])

# Now we have to do something about the states. We convert them to dummy variables.
state_columns = churn_exp['State'].unique()
for state in state_columns:
    churn_exp.loc[:, state] = (churn_exp.loc[:, 'State'] == state).astype('int')
churn_exp.drop('State', axis=1, inplace=True)

# You would also have to do something with area codes as well
area_code_columns = churn_exp['Area Code'].unique()
for area_code in area_code_columns:
    churn_exp.loc[:, 'Area Code: ' + str(area_code)] = (churn_exp.loc[:, 'Area Code'] == area_code).astype('int')
churn_exp.drop('Area Code', axis=1, inplace=True)

# Now, because each phone number is unique, we can simply drop it.
churn_exp.drop(['Phone'], axis=1, inplace=True)

# Now, we need to rename the column head.
churn_exp.rename(columns={'Churn?' : "Churn"}, inplace=True)
print(churn_exp.head())

# Not only do you have to know the code above, but you also have to actually ensure that it works properly and the like. 
# With Eezzy, you can just have the code right there at your disposal. 

## Model Development

In [None]:
# Split the data between training and testing data.
churn_exp = shuffle(churn_exp)
cutoff_x = int(churn_exp.shape[0]*.80)
start_y = cutoff_x + 1
X_train = churn_exp.drop('Churn', axis=1, inplace=False).iloc[0:cutoff_x, :]
Y_train = churn_exp['Churn'][0:cutoff_x]
X_test = churn_exp.drop('Churn', axis=1, inplace=False).iloc[start_y:, :]
Y_test = churn_exp['Churn'][start_y:]
# Here, you have to know the  

# Decision Trees
from sklearn import tree
DT = tree.DecisionTreeClassifier()
print("Decision Trees")
# The score difference between the highest and lowest model is X_score
# Based on 3 different criteria, the average difference between the average 
DT.fit(X_train, Y_train)
print("Decision Trees Average CV Error: " + str(np.mean(cross_val_score(DT, X_train, Y_train, scoring='roc_auc', cv=10))))
Y_pred = DT.predict(X_test)
print("Accuracy: " + str(accuracy_score(Y_test, Y_pred)))
print("F1 Score: " + str(f1_score(Y_test, Y_pred)))
print("Roc Ruc Score: " + str(roc_auc_score(Y_test, Y_pred)))
print("Precision: " + str(precision_score(Y_test, Y_pred)))
print("Recall: " + str(recall_score(Y_test, Y_pred)))

objects = ("Accuracy", "F1 Score", "Roc Auc Curve", "Precision", "Recall")
y_pos = np.arange(len(objects))
performance = [accuracy_score(Y_test, Y_pred), f1_score(Y_test, Y_pred), roc_auc_score(Y_test, Y_pred),
    precision_score(Y_test, Y_pred), recall_score(Y_test, Y_pred)]
 
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Usage')
plt.title('Programming language usage')
 
plt.show()

# Logsitic Regression
print("Logistic Regression")
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression(penalty='l2', C=1e5)
logistic.fit(X_train, Y_train)
print("Logistic Regression Average CV Error: " + str(np.mean(cross_val_score(logistic, 
                                                                             X_train, Y_train, scoring='roc_auc', cv=10))))
Y_pred = logistic.predict(X_test)
print("Accuracy: " + str(accuracy_score(Y_test, Y_pred)))
print("F1 Score: " + str(f1_score(Y_test, Y_pred)))
print("Roc Ruc Score: " + str(roc_auc_score(Y_test, Y_pred)))
print("Precision: " + str(precision_score(Y_test, Y_pred)))
print("Recall: " + str(recall_score(Y_test, Y_pred)))

objects = ("Accuracy", "F1 Score", "Roc Auc Curve", "Precision", "Recall")
y_pos = np.arange(len(objects))
performance = [accuracy_score(Y_test, Y_pred), f1_score(Y_test, Y_pred), roc_auc_score(Y_test, Y_pred),
    precision_score(Y_test, Y_pred), recall_score(Y_test, Y_pred)]
 
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Usage')
plt.title('Programming language usage')
 
plt.show()

# Random Forests
print("Random Forests")
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10)
rf.fit(X_train, Y_train)
print("Random Forests Average CV Error: " + str(np.mean(cross_val_score(rf, 
                                                                        X_train, Y_train, scoring='roc_auc', cv=10))))
Y_pred = rf.predict(X_test)
print("Accuracy: " + str(accuracy_score(Y_test, Y_pred)))
print("F1 Score: " + str(f1_score(Y_test, Y_pred)))
print("Roc Ruc Score: " + str(roc_auc_score(Y_test, Y_pred)))
print("Precision: " + str(precision_score(Y_test, Y_pred)))
print("Recall: " + str(recall_score(Y_test, Y_pred)))

objects = ("Accuracy", "F1 Score", "Roc Auc Curve", "Precision", "Recall")
y_pos = np.arange(len(objects))
performance = [accuracy_score(Y_test, Y_pred), f1_score(Y_test, Y_pred), roc_auc_score(Y_test, Y_pred),
    precision_score(Y_test, Y_pred), recall_score(Y_test, Y_pred)]
 
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Usage')
plt.title('Programming language usage')
 
plt.show()

# K Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
print("K Nearest Neighbors")
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X_train, Y_train)
# Find the optimal parameters (Yuo would have to gridsearch or random search them)
print("K Nearest Neighbors Average CV Error:" + str(np.mean(cross_val_score(knn, 
                                                                            X_train, Y_train, scoring='roc_auc', cv=10))))
Y_pred = knn.predict(X_test)
print("Accuracy: " + str(accuracy_score(Y_test, Y_pred)))
print("F1 Score: " + str(f1_score(Y_test, Y_pred)))
print("Roc Ruc Score: " + str(roc_auc_score(Y_test, Y_pred)))
print("Precision: " + str(precision_score(Y_test, Y_pred)))
print("Recall: " + str(recall_score(Y_test, Y_pred)))

objects = ("Accuracy", "F1 Score", "Roc Auc Curve", "Precision", "Recall")
y_pos = np.arange(len(objects))
performance = [accuracy_score(Y_test, Y_pred), f1_score(Y_test, Y_pred), roc_auc_score(Y_test, Y_pred),
    precision_score(Y_test, Y_pred), recall_score(Y_test, Y_pred)]
 
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Usage')
plt.title('Programming language usage')
 
plt.show()

In [None]:
from sklearn.model_selection import learning_curve
#TODO: Learning Curve
train_sizes, train_scores, valid_scores = learning_curve(knn, X_train, Y_train, train_sizes = np.arange(.10, 1, .10), cv=10)



#TODO: Probability Charts
#TODO: Correlation Charts (Done, but needs to be done in machine learning graphs)

In [None]:
tbs_plot.ml_plot_learning_curve(train_scores, valid_scores)