# DS_Automation_Assignment_Oussama_Ennaciri

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

## Preparing the Environment and Loading the Data

We started by preparing the virtual environment msds. We created the virtual environment via terminal using the following commands.

In [68]:
#!conda create -n msds python=3.10.14 -y
#!conda activate msds

Then, we loaded the prepared data.

In [45]:
import pandas as pd

df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,3,29.85,29.85,0,29.850000
5575-GNVDE,34,1,1,2,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,2,53.85,108.15,1,54.075000
7795-CFOCW,45,0,1,1,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,3,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,2,84.80,1990.50,0,82.937500
2234-XADUH,72,1,1,0,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,3,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,2,74.40,306.60,1,76.650000


## AutoML with pycaret

In order to use pycart for autoML, we need to install it and then import the required packages.

In [3]:
conda install -c conda-forge pycaret -y

Channels:
 - conda-forge
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [46]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

We start by preparing the autoML and defining the target "Churn".

In [47]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,8110
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 8)"
4,Transformed data shape,"(7032, 8)"
5,Transformed train set shape,"(4922, 8)"
6,Transformed test set shape,"(2110, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


Next, we run the autoML to compare the ML models and find the best one.

In [48]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7903,0.8359,0.5137,0.6287,0.5649,0.4288,0.4328,0.034
ridge,Ridge Classifier,0.7897,0.8216,0.4433,0.6538,0.5278,0.3991,0.4118,0.006
gbc,Gradient Boosting Classifier,0.7891,0.8322,0.4823,0.6359,0.5478,0.4139,0.421,0.116
lda,Linear Discriminant Analysis,0.7891,0.8216,0.4953,0.6312,0.5547,0.4193,0.4247,0.014
lightgbm,Light Gradient Boosting Machine,0.7852,0.823,0.4915,0.6224,0.5485,0.4103,0.4156,0.02
ada,Ada Boost Classifier,0.7846,0.8314,0.48,0.6243,0.5416,0.4042,0.4108,0.036
rf,Random Forest Classifier,0.7765,0.8007,0.4862,0.5981,0.5357,0.3907,0.3948,0.087
knn,K Neighbors Classifier,0.7696,0.7468,0.4479,0.5885,0.5075,0.361,0.3673,0.013
et,Extra Trees Classifier,0.7653,0.7804,0.4808,0.5685,0.5204,0.3667,0.3693,0.061
qda,Quadratic Discriminant Analysis,0.7444,0.818,0.7385,0.5133,0.6055,0.4254,0.441,0.006


We notice that the logistic regression model has the highest accuracy, AUC, and kappa measures.

In [49]:
best_model

Using our AutoML we deduct that the best model to use is the Logistic Regression model.

In [50]:
df.iloc[-2:-1].shape

(1, 8)

In [51]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,2,74.400002,306.600006,76.650002,1,1,0.5782


We use the model to make predictions on our data. 
We are selecting the second to last row, but using the indexing `[-2:-1]`
The model predicted churn as indicated by the prediction label and the score of 57.82% which is good because in that specific case the customer did churn.

## Saving and Loading our Model

Next, we will save our trained model so we can use it in a Python file later.

In [52]:
save_model(best_model, 'LR')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'charge_per_tenure'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('c...
                                                         

In [53]:
import pickle

with open('LR_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

Here, we open a file with the name `LR_model.pk` for writing, and we save that file object in the variable `f`. After that, we load it.

In [54]:
with open('LR_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [55]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1], dtype=int8)

In [79]:
loaded_LR = load_model('LR')

Transformation Pipeline and Model Successfully Loaded


In [74]:
predict_model(loaded_LR, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,2,74.400002,306.600006,76.650002,1,0.5782


## Making a Python module to make predictions

In [75]:
from IPython.display import Code

Code('predict_churn.py')

We finally run the python script which will predict churn based on the chosen pre-trained model.

In [78]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


Index(['tenure', 'PhoneService', 'Contract', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'charge_per_tenure', 'prediction_label',
       'prediction_score'],
      dtype='object')
predictions:
customerID
9305-CKSKC    No Churn
1452-KNGVK       Churn
6723-OKKJM    No Churn
7832-POPKP    No Churn
6348-TACGU       Churn
Name: churn_prediction, dtype: object


# Summary

We started by preparing the virtual environment ‘msds’. We created the virtual environment via terminal and activated it. Then, we loaded the prepared data.

In order to use pycart for autoML, we installed it via Honda and then imported the required packages ‘setup, compare_models, predict_model, save_model, load_model’.

We prepared the autoML and defined the target "Churn".
Next, we ran the autoML to compare the ML models and find the best one. We notice that the logistic regression model had the highest accuracy, AUC, and kappa measures.

Using our AutoML we deducted that the best model to use is the Logistic Regression model.

We used the model to make predictions on our data. We are selecting a row ‘the second to last row’, but using the indexing `[-2:-1]`. The model predicted churn as indicated by the prediction label and the score of 57.82%. This is good because in that specific case the customer did churn. We will save our trained model so we can use it in a Python file later.

We finally ran the python script which will predict churn based on the chosen pre-trained model. The model predicted the following results:

9305-CKSKC    No Churn ; 
1452-KNGVK       Churn ; 
6723-OKKJM    No Churn ; 
7832-POPKP    No Churn ; 
6348-TACGU       Churn ; 
