# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

Data Load

To begin, the prepped churn data is loaded into the notebook.

In [1]:
import pandas as pd

df = pd.read_csv('/Users/aaron/Documents/Jupyter/data/prepped_churn_data.csv', index_col='customerID')
df.head(10)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_totalcharge_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,1,29.85,29.85,0,0.033501
5575-GNVDE,34,1,1,0,56.95,1889.5,0,0.017994
3668-QPYBK,2,1,0,0,53.85,108.15,1,0.018493
7795-CFOCW,45,0,1,2,42.3,1840.75,0,0.024447
9237-HQITU,2,1,0,1,70.7,151.65,1,0.013188
9305-CDSKC,8,1,0,1,99.65,820.5,1,0.00975
1452-KIOVK,22,1,0,3,89.1,1949.4,0,0.011286
6713-OKOMC,10,0,0,0,29.75,301.9,0,0.033124
7892-POOKP,28,1,0,1,104.8,3046.05,1,0.009192
6388-TABGU,62,1,1,2,56.15,3487.95,0,0.017775


AutoML with Pycaret

I am saving the cell below to be able to refer to this notebook later for reference. 

***FTE***

Our next step is to use pycaret for autoML. 

PyCaret only supports up to Python 3.10 and I'm using Python 3.11, so I'm going to create a **virtual environment**. Instructions are at https://pycaret.gitbook.io/docs/get-started/installation.

**NOTE:** I suggest doing all this from a command line, <u>not from inside your notebook</u>.

```
# create a conda environment
conda create --name <yourenvname> anaconda python=3.10

# activate conda environment
conda activate <yourenvname>

# install pycaret
pip install pycaret

# create notebook kernel
python -m ipykernel install --user --name <yourenvname> --display-name "<display-name-for yourenvname>"
```


The following allows checking the kernels available.

In [2]:
!jupyter kernelspec list

0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off
0.00s - to python to disable frozen modules.
0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
Available kernels:
  py310      /Users/aaron/Library/Jupyter/kernels/py310
  python3    /Users/aaron/anaconda3/share/jupyter/kernels/python3


The new kernel is available and it has been selected. 

In [3]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   tenure                    7032 non-null   int64  
 1   PhoneService              7032 non-null   int64  
 2   Contract                  7032 non-null   int64  
 3   PaymentMethod             7032 non-null   int64  
 4   MonthlyCharges            7032 non-null   float64
 5   TotalCharges              7032 non-null   float64
 6   Churn                     7032 non-null   int64  
 7   tenure_totalcharge_ratio  7032 non-null   float64
dtypes: float64(3), int64(5)
memory usage: 494.4+ KB


In [4]:
from pycaret.classification import ClassificationExperiment #setup, compare_models, predict_model, save_model, load_model

***Note for future reference, when looking back at this notebook.

First install corrupted the whole environment, it was not known till after a system restart the next morning. Ananconda navigator would not start again. Fix was removing and installing navigator: 

conda remove -n base anaconda-navigator
conda install -n base anaconda-navigator

Next remove the new environment:

conda deactivate
conda remove --name ENV_NAME --all

recreate environment per instructions above, it will fail on pycaret install. Issue is with lightgbm module install. Can't find library, it does not install it either

Direct lightgbm install does't work either until you add additional channel:

conda config --add channels conda-forge
conda config --set channel_priority strict

Reinstall lightgbm:

conda install lightgbm

Redo pycaret install, install still fails. Two dependencies are still needed, blosc2 and fuzzyTM. Install with:

pip install fuzzytm
pip install blosc2

Rerun pycaret install. It finishes now. Finish step 4 from pycaret install above. Install complete. 


Setting up for automl

In [5]:
automl = ClassificationExperiment() #setup(df, target='Churn')

In [6]:
automl.setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,4680
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 8)"
4,Transformed data shape,"(7032, 8)"
5,Transformed train set shape,"(4922, 8)"
6,Transformed test set shape,"(2110, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


<pycaret.classification.oop.ClassificationExperiment at 0x2b17ed610>

Automl is now setup

***FTE***

This will ask us to check if the datatypes of the input data are correct. In this case, they seem fine. There are a huge number of parameters we can set that we can see in the docs or if we run ?setup in a cell. For now, we are leaving everything else at the default. However, relating it to last week, we can see there is a feature_selection option we could set.

By default, it preprocesses data (converts categorical columns into numeric). We can see what the preprocessed data looks like from one of the elements in the automl object. It seems like the index of the object (6 for unmodified data and 14 for preprocessed here) may change sometimes (possibly a bug or peculariaty with pycaret).

Running automl to find the best model

In [7]:
best_model = automl.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7989,0.8392,0.4999,0.6602,0.5683,0.4406,0.4482,0.175
ada,Ada Boost Classifier,0.7966,0.8362,0.5121,0.6488,0.5716,0.4409,0.4466,0.146
lr,Logistic Regression,0.7956,0.8366,0.526,0.6406,0.5772,0.4442,0.4482,0.229
ridge,Ridge Classifier,0.7936,0.0,0.464,0.6594,0.544,0.4159,0.427,0.004
lightgbm,Light Gradient Boosting Machine,0.793,0.8274,0.5206,0.6354,0.5718,0.4371,0.4412,1.151
lda,Linear Discriminant Analysis,0.7899,0.8228,0.5046,0.6321,0.5604,0.4248,0.4299,0.004
rf,Random Forest Classifier,0.7714,0.8018,0.4648,0.5912,0.5188,0.372,0.3776,0.04
et,Extra Trees Classifier,0.7702,0.7821,0.4893,0.5822,0.5305,0.3802,0.3834,0.032
knn,K Neighbors Classifier,0.7668,0.7493,0.4426,0.5813,0.5017,0.3533,0.3593,0.137
qda,Quadratic Discriminant Analysis,0.7395,0.8227,0.7637,0.5078,0.6096,0.4262,0.4464,0.003


With a quick glance, logical regression appears to be the best model in this scenario. My metric choice would be accuracy, that is what we have been working with the week or so and I read somewhere that accuracy is the default.

The best model is now GBC after a kernel restart.


***FTE***

Within the notebook, this updates in real time as it's fitting. We can see the boosting algorithms like xgboost and catboost take the longest to run. Often xgboost will be near the top. To get xgboost and lightgbm working, we either need to allow preprocessing (which converts categorical columns into numeric columns) or we need to set our categorical columns as numeric with automl = setup(df, target='Diabetes', preprocess=False, numeric_features=['Gender']).

Our best_model object now holds the highest-scoring model. We can also set an argument sort in compare_models to choose another metric as our scoring metric. By default, it uses accuracy (and we can see the table above is sorted by accuracy). We could set this to sort='Precision' to use precision (TP / (TP + FN)), for example.


Find the best model

In [8]:
best_model

AutoML Evaluation

In [9]:
automl.evaluate_model(best_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

The best model was Logistic Regression, now it is GBC. 



Plotting and Predictions

This is not part of the instructions for this assignment instructions, but a good thing to be exposed to. Following FTE examples. 

I have since pulled those additional FTE plots and predictions out for brevity. 

Save the model to disk


save the model using pycaret to use again later. 

In [10]:
automl.save_model(best_model, 'pycaret_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'tenure_totalcharge_ratio'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'...
                                             criterion='friedman_mse', init

The model is saved, now it is time to load it and test it making predictions. 

In [11]:
new_pycaret = ClassificationExperiment()
loaded_model = new_pycaret.load_model('pycaret_model')

Transformation Pipeline and Model Successfully Loaded


In [12]:
new_pycaret.predict_model(loaded_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,tenure_totalcharge_ratio,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,0,74.400002,306.600006,0.013046,1,1,0.573


The save model was successfully loaded and it was good to see it make a prediction. Successful or not, it was good to see it make a prediction. 


Create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe.

This is to be done against the new_churn_data.csv

In [13]:
from IPython.display import Code

Code('predict_churn_pycaret.py')

The new predict file is loaded and ready to go. 

In [14]:
%run predict_churn_pycaret.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
7590-VHVEG    No churn
5575-GNVDE    No churn
3668-QPYBK       Churn
7795-CFOCW    No churn
9237-HQITU       Churn
                ...   
6840-RESVB    No churn
2234-XADUH    No churn
4801-JZAZL    No churn
8361-LTMKD       Churn
3186-AJIEK    No churn
Name: Churn, Length: 7032, dtype: object


<Figure size 800x550 with 0 Axes>

# Summary

This homework assignment was very interesting to say the least. It got off to a rocky start and then the whole environment died the next morning as was described in a cell above. After fixing that, things seemed to be going ok until the script portion. 

Going through the assignment, it was observered that the model can make predictions (both good and bad) and actually give that prediction in the prediction_label column and supplement that with the prediction_score that the prediction was based upon. That is the way how I thought the script was to work as well but that is not the case. 

Another observation that I made is that this model is wholy dependent on the specified target column in the beginning, that makes sense. So building the model based on a churn column and saving it to run against the new 5 entry new_churn_data.csv doesn't work because of target issues with not having the churn column in the new dataset. This results in the Key errors just like a mysql/sql script. That makes sense. 

The next step was starting with the original raw churn data csv. This is what I started with before everything died. Choosing this one gives the correct target column, but trying to run the script on the prepped churn still fails to run even though there are no key errors. I suspect this is because the pycaret_model that is called in the script will only work with the original dataset. I tried this with two different evironments and the same results were observed. This only works if the input dataset for the model and script are same/same. 

Is this expected behaviour or this an error that could be attributed to environment failure I had at the beginning? If this indeed expected bahaviour, which I am beginning to think it is, then that would make the script to be less generalised than expected because it completely dependent to the pycaret model and that only works in one scenario.

Overall for this exercise, It was good to see predictions happen in the model. I got to learn how to do some of them and I feel I learned a decent amount on the dependencies both in and outside of the notebook in this scenario. 