# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [2]:
'''Load the cleaned data into a dataframe. As per discussion in previous assignments, 
I am dropping a feature I created because the logic behind the feature was unsound.'''
import pandas as pd
df=pd.read_csv(r'C:\Users\Madeline\Desktop\School\Regis\IDS\cleaned_churn_data', index_col='customerID')
df_1=df.drop('num_phone_times_num_churn', axis=1)
df_1

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_charges_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,29.850000
5575-GNVDE,34,1,1,1,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,1,53.85,108.15,1,54.075000
7795-CFOCW,45,0,1,2,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,0,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,82.937500
2234-XADUH,72,1,1,3,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,0,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,1,74.40,306.60,1,76.650000


In [None]:
##only need to run if pycaret is not yet installed. Did have to roll scikit back to version 0.23.2 for functionality.
!conda install -c conda-forge pycaret -y

In [3]:
##import needed items from pyvaret
from pycaret.classification import setup, compare_models, predict_model, load_model, save_model

In [5]:
##Set up automl and verify data types
automl = setup(df_1, target='Churn', fold_shuffle=True,)

Unnamed: 0,Description,Value
0,session_id,7882
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 8)"
5,Missing Values,0
6,Numeric Features,4
7,Categorical Features,3
8,Ordinal Features,0
9,High Cardinality Features,0


In [8]:
##visually verify the preproccessed data
automl[8]

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_charges_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,29.850000
5575-GNVDE,34,1,1,1,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,1,53.85,108.15,1,54.075000
7795-CFOCW,45,0,1,2,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,0,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,82.937500
2234-XADUH,72,1,1,3,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,0,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,1,74.40,306.60,1,76.650000


In [9]:
##Run models to find the best.
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7946,0.8364,0.5284,0.6438,0.5797,0.4456,0.4499,0.667
gbc,Gradient Boosting Classifier,0.792,0.8392,0.4981,0.6476,0.5618,0.4286,0.4357,0.165
ada,Ada Boost Classifier,0.7909,0.8376,0.4867,0.6509,0.5551,0.4223,0.4311,0.089
lda,Linear Discriminant Analysis,0.7897,0.8271,0.5345,0.6271,0.5764,0.4379,0.4408,0.011
ridge,Ridge Classifier,0.7877,0.0,0.4693,0.6437,0.542,0.4083,0.4175,0.041
catboost,CatBoost Classifier,0.7875,0.8326,0.4943,0.6348,0.5552,0.4184,0.4245,1.343
lightgbm,Light Gradient Boosting Machine,0.783,0.8267,0.5041,0.6174,0.5544,0.413,0.4171,0.096
xgboost,Extreme Gradient Boosting,0.7753,0.8156,0.5019,0.597,0.5446,0.3971,0.4001,0.21
knn,K Neighbors Classifier,0.7615,0.7479,0.4435,0.5715,0.4984,0.3455,0.3508,0.027
rf,Random Forest Classifier,0.7554,0.7913,0.4655,0.5528,0.5051,0.3443,0.3467,0.171


LR came in as the best performer in 3 out of 7 categories. NB shows much higher recall, so this model could be run as well if Recall has more importance than Accuracy. For the sake of the assignment, I moved forward with LR.

In [10]:
##verify best model matched the computation above
best_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=7882, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [12]:
predict_model(best_model, df_1.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_charges_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1,76.65,1,0.5339


In [13]:
##Save model for future use
save_model (best_model, 'LR')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs', 'passthrough'), ('pca', 'passthrough'),
                 ['trained_model',
                  LogisticRegression(C=1.0, class_weight=None, dual=False,
                 

In [14]:
loaded_LR = load_model('LR')

Transformation Pipeline and Model Successfully Loaded


In [15]:
##Using saved model to run a prediction on data already used in the model, to verify everything is working
new_data = df_1.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
predict_model(loaded_LR, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,tenure_charges_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,76.65,1,0.5339


In [16]:
## Developed model as a .py, verifying code is correct and logical
from IPython.display import Code
Code('LR_mod.py')

In [17]:
## Tested model on new data
%run LR_mod.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC    No Churn
1452-KNGVK       Churn
6723-OKKJM    No Churn
7832-POPKP    No Churn
6348-TACGU       Churn
Name: Churn_prediction, dtype: object


Since the true results should have been [1, 0, 0, 1, 0] and I recieved [0, 1, 0, 0, 1], this model did not do a great job predicting results on the new data set. One option would be going back to the compare models and picking another model to run. Another would be going back to the original data set to make sure nothing was overlooked or changed during cleaning. 

Further areas of the notebook were optional additional exploration in another autoML. I noted that this other model required the input of training and testing data, which is not advisable with the new_churn_data as the data input is so small. I have not yet worked out how to supply the model with training data from the cleaned_churn_data set and use target data from the new_churn_data. 

In [None]:
conda install -c conda-forge tpot

In [None]:
from tpot import TPOTClassifier

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

In [None]:
df_1 = pd.read_csv(r'C:\Users\Madeline\Downloads\cleaned_churn_data', index_col='customerID')

In [None]:
features = df_1.drop('Churn', axis=1)
targets = df_1['Churn']
x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=42)

In [None]:
model_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5, random_state=42, verbosity=2)

In [None]:
model_optimizer.fit(x_train, y_train)
print(model_optimizer.score(x_test, y_test))

In [None]:
model_optimizer.export('tpot_model.py')

In [None]:
from IPython.display import Code
Code('tpot_model.py')

In [None]:
%run tpot_model.py

# Summary

I began by bringing in my cleaned data into a pandas df. I chose to drop a feature I had created bcause upon further review the logic was unsound; the feature included the target data, which was falsely boosting the accuracy of my models. I then needed to install pycaret to run an autoML. I did have to roll scikit learn back to version 0.23.2 for functionality. I chose to do this in a virtual environment to prevent and issues with by base environment. Once working, pycaret prepocessed my data through running the autoML detected that the Logistic Regression model was the best fit. This matches similar finding from earlier assignments. I then saved the model and wrote a .py file to be able to load and use the model with any new churn data that could be supplied. Upon running the model, I actually found that it did not return very accurate predictions compared to the true values; additional work on either the model or the data is likely required for further refinement. 
I spent some optional time exploring a TPOT autoML. What I found here was that running the optimizer required splitting the data into training and testing sets. The saved exported model also required train and test sets, which I feel is inadvisable to use on the new_churn_data, as it is so small. More exploration and understanding of how to use TPOT would be required to develop a usable, deployable model. 