# Data Science Workflow

## Problem Definition: Using customer data, we want to predict which customers are going to churn or may leave the subscription that you provide.

### Further Information:
### What is Churn?
Churn is the measure of the the amount of people who stop subscribing or buying a product over time. In situations where consumers are subscribing to a service it is important to measure how likely those people are to stop subscribing. In this demo, when churn is true, it will hold the value of 1. When a customer does not churn, it will hold a value of 0.

### Imports

<style>
th {
background-color:#55FF33;
}
td {
background-color:#00FFFF;
}
</style>

In [None]:
%matplotlib inline
import sys
sys.path.append("C:\\Users\\elijah2352\\Documents\\GitHub\\eezzy")
from imports import *
# Problem with the data science workflow is that I don't want to lock the user into a specified workflow with Eezzy. Eezzy
# should be a flexible tool, not one that locks you into a workflow.

## Data Collection

This phase of the project is where we connect to an external data stream such as a SQL database, an S3 bucket, or a simple CSV file so that we can gain access to our data.

In [None]:
churn_data = pd.read_hdf("C:\\Users\\elijah2352\\Downloads\\churndata.h5")
#print(churn_data.head())
# Oultier Check
tbs_summary.summarize_data(churn_data, 'Churn?')

## Data Preparation: Model Development

Here, we take the data and begin to employ aggregations, filtering, and variable conversions into a format that a machine learning algorithm can understand. This is a bit different from 

In [None]:
churn_exp = churn_data.copy()
eezzy_data.eezzy_check(data=churn_exp, prediction='Churn?', clean_columns=True)

churn_exp = cleaner.eezzy_clean(churn_exp, 'Churn', regex='[.]')
tbs_transform.dummy_variable(data = churn_exp, column = 'State', in_column = False)
tbs_transform.area_code_converter(data = churn_exp, column = 'Area Code', region_convert = False)
tbs_transform.dummy_variable(data = churn_exp, column =  'State Area Code', 
                             in_column=True)

tbs_transform.dummy_variable(data = churn_exp, column =  ["Int'l Plan", "VMail Plan"], 
                             in_column=True, transform_keys={'yes' : 1, 'no' : 0})
tbs_transform.dummy_variable(data = churn_exp, column =  ["Churn"], 
                             in_column=True, transform_keys={'True.' : 1, 'False.' : 0})
tbs_transform.stratified_sampling(data = churn_exp, prediction = 'Churn',
                                  sample_split = [.80, .20], class_imbal_tol= .10)
churn_exp.drop(['Phone'], axis=1, inplace=True)
churn_exp.drop(['State Area Code'], axis=1, inplace=True)
print(churn_exp.head())

## Model Development and Error Analysis

In [None]:
# Eezzy ML

# Need to see if client-side monitoring is possible. This would make is easier to discern what data should go where
# or, if we can see the predictor variable specified in one of our functions, then we can fill in stuff like that. 
# If we see multiple data sources, then we can map the predictor variable to the data source if they've used an Eezzy 
# function and update it on the fly by re-displaying the output.
# It may also allow us to automatically discern the training_testing split
# Maybe call it eezzy_monitor and have it act as a separate service or something like that.
# For AUC, tell the user that the skewed data may have a negative effect on the validity of the AUC
# Need a way to do error analysis. Need to first decrease the dimensionality of the data, then plot the points with each 
# class having a different color.
# Need a way for the tool to help with ensembling the methods if the user wants
# Need to also graph the clusters and add them as a feature
# Need to show the dimensionality reduction features as well
# Need to add an elbow curve method to the clustering part
# Optimize only the best hyperparameters for the job first. This is if the user specifies. We should also show the progress
# of each job as a plotly plot.
# Need to add a distribution of the predicted probabilities
# Need Cross Validation Scores vs. Test Set Scores. Maybe just have the differences between the 2 as a faster method.
# Also, should probably plot the average prediction times.
# Need to think about adding an exploratory data analysis method method
# Aggregations will be key. Faster Aggregations to data mapping is what will be the key to faster analysis
# While going through Random Forests or Decision Trees, I wonder if it would be beneficial to 
# Add something like "Maybe adding more categorical features would increase the accuracy of the model if you're using 
# Decision Trees" or something like that. Also, give hints like ability to tune the precision, recall tradeoff
# Maybe let the user pass in their own training and testing splits.
# 
tbs_ml.eezzy_ml(X=churn_exp, prediction_feature='Churn')