# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

## Load data

In [1]:
import pandas as pd

df = pd.read_csv('data/prepared_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,NoContract,ElectronicCheck,Difference,MonthlyRatio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7590-VHVEG,1.0,0,0,0,29.85,29.85,0,1,1,0.00,29.850000
5575-GNVDE,34.0,1,1,1,56.95,1889.50,0,0,0,-46.80,55.573529
3668-QPYBK,2.0,1,0,1,53.85,108.15,1,1,0,0.45,54.075000
7795-CFOCW,45.0,0,1,2,42.30,1840.75,0,0,0,-62.75,40.905556
9237-HQITU,2.0,1,0,0,70.70,151.65,1,1,1,10.25,75.825000
...,...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,24.0,1,1,1,84.80,1990.50,0,0,0,-44.70,82.937500
2234-XADUH,72.0,1,1,3,103.20,7362.90,0,0,0,-67.50,102.262500
4801-JZAZL,11.0,0,0,0,29.60,346.45,0,1,1,20.85,31.495455
8361-LTMKD,4.0,1,0,1,74.40,306.60,1,1,0,9.00,76.650000


Removing custom features

In [2]:
df.drop('NoContract', axis=1, inplace=True)
df.drop('ElectronicCheck', axis=1, inplace=True)
df.drop('Difference', axis=1, inplace=True)
df.drop('MonthlyRatio', axis=1, inplace=True)

## AutoML with pycaret

In [3]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [4]:
automl = setup(df, target = 'Churn', session_id = 3)

Unnamed: 0,Description,Value
0,session_id,3
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 7)"
5,Missing Values,False
6,Numeric Features,3
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


Preprocessed data is on 19 this time

In [5]:
automl[19]

Unnamed: 0_level_0,tenure,MonthlyCharges,TotalCharges,PhoneService_1,Contract_0,Contract_1,Contract_2,PaymentMethod_0,PaymentMethod_1,PaymentMethod_2,PaymentMethod_3
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7590-VHVEG,1.0,29.850000,29.850000,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
5575-GNVDE,34.0,56.950001,1889.500000,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3668-QPYBK,2.0,53.849998,108.150002,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
7795-CFOCW,45.0,42.299999,1840.750000,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
9237-HQITU,2.0,70.699997,151.649994,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,24.0,84.800003,1990.500000,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2234-XADUH,72.0,103.199997,7362.899902,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4801-JZAZL,11.0,29.600000,346.450012,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
8361-LTMKD,4.0,74.400002,306.600006,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


The best model is LDA for accuracy on this session number

In [6]:
best_model = compare_models(sort = 'Accuracy')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7854,0.8192,0.5113,0.6226,0.5608,0.4208,0.4248,0.005
ada,Ada Boost Classifier,0.7846,0.828,0.4683,0.6352,0.5384,0.4022,0.4106,0.021
lr,Logistic Regression,0.7842,0.8275,0.4978,0.6244,0.5531,0.4134,0.4185,0.264
gbc,Gradient Boosting Classifier,0.7836,0.8295,0.4993,0.6218,0.5531,0.4127,0.4175,0.045
ridge,Ridge Classifier,0.7828,0.0,0.4411,0.639,0.5213,0.387,0.3984,0.004
lightgbm,Light Gradient Boosting Machine,0.7815,0.8188,0.5114,0.6125,0.5568,0.4135,0.4169,0.079
catboost,CatBoost Classifier,0.7795,0.8267,0.4955,0.6119,0.5468,0.4033,0.4078,0.428
xgboost,Extreme Gradient Boosting,0.7682,0.8072,0.494,0.5798,0.5327,0.3801,0.3826,0.182
knn,K Neighbors Classifier,0.7584,0.7369,0.4419,0.5639,0.4944,0.3392,0.344,0.173
rf,Random Forest Classifier,0.757,0.7831,0.4592,0.5576,0.5029,0.3443,0.3475,0.051


In [7]:
best_model

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

### Trying different shapes to see how the data changes

In [8]:
df.iloc[-2:-1]

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8361-LTMKD,4.0,1,0,1,74.4,306.6,1


In [9]:
df.iloc[-1]

tenure              66.00
PhoneService         1.00
Contract             2.00
PaymentMethod        2.00
MonthlyCharges     105.65
TotalCharges      6844.50
Churn                0.00
Name: 3186-AJIEK, dtype: float64

Running prediction on the last line

In [10]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4.0,1,0,1,74.4,306.6,1,1,0.5107


## Saving model to disk

In [11]:
save_model(best_model, 'BEST')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(target='Churn')),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs

### Writing to another file with Pickle

In [12]:
import pickle

with open('BEST_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [13]:
with open('LDA_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [14]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)

##### This line crashes - ValueError: query data dimension must match training data dimension

loaded_model.predict(new_data)

## Load new data

In [15]:
loaded_best = load_model('BEST')

Transformation Pipeline and Model Successfully Loaded


In [16]:
predict_model(loaded_best, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4.0,1,0,1,74.4,306.6,1,0.5107


## Load python script

In [17]:
from IPython.display import Code

Code('predict_churn.py')

## Run prediction on new data

In [21]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
           Churn_prediction  Percentage
customerID                             
9305-CKSKC            Churn      0.7980
1452-KNGVK         No Churn      0.7268
6723-OKKJM         No Churn      0.8433
7832-POPKP            Churn      0.5727
6348-TACGU         No Churn      0.9252


# Summary


Load data: I loaded the data from Week 2, however my prepared churn data was not a perfect match for the modified data. I removed my custom features so that it wouldn't interfere with the churn/nochurn prediction

AutoML pycaret: I loaded pycaret and set the target to Churn. I also was able to find the random setting which is session_id. Even with a set nonrandom variable, the automl[] data still moved locations. That's a very weird process to move around. I chose a nonrandom variable that set the highest accuracy to LDA

Saving model: I saved the model to disk as BEST since it seemed pointless to name a model, as the script seems to change the top model frequently based on the random variable. I was able to use pickle, but I wasn't able to get the loaded_model.predict(new_data) command to work. It kept giving me a value error.

Load new data: I loaded the saved model, and tested the prediction on some of the last entries.

Python script: I modified the python script from the example, and modified it to use the unmodified churn data. I set it to match the same integer values that I used in my week 2 prepared data. I actually got reverse predictions because some of my integer values (1/2/3/4) don't match the new data that was already modified.

I successfully ran the predictions, and when the LDA model is the highest accuracy, they seem to match the assignment. I also added the score to the output, because I wanted to see the accuracy.

