# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [2]:
import pandas as pd 
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [4]:
# Loaded the dataset
df = pd.read_csv('cleaned_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Charges_per_month
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,3,29.85,29.85,0,14.925000
5575-GNVDE,34,1,1,2,56.95,1889.50,0,53.985714
3668-QPYBK,2,1,0,2,53.85,108.15,1,36.050000
7795-CFOCW,45,0,1,1,42.30,1840.75,0,40.016304
9237-HQITU,2,1,0,3,70.70,151.65,1,50.550000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,2,84.80,1990.50,0,79.620000
2234-XADUH,72,1,1,0,103.20,7362.90,0,100.861644
4801-JZAZL,11,0,0,3,29.60,346.45,0,28.870833
8361-LTMKD,4,1,0,2,74.40,306.60,1,61.320000


In [5]:
# checking the info of datset
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   tenure             7043 non-null   int64  
 1   PhoneService       7043 non-null   int64  
 2   Contract           7043 non-null   int64  
 3   PaymentMethod      7043 non-null   int64  
 4   MonthlyCharges     7043 non-null   float64
 5   TotalCharges       7043 non-null   float64
 6   Churn              7043 non-null   int64  
 7   Charges_per_month  7043 non-null   float64
dtypes: float64(3), int64(5)
memory usage: 495.2+ KB


In [6]:
df = df.drop(['Charges_per_month'], axis=1)


In [7]:
# displaying top 5 rows in the dataset
df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,3,29.85,29.85,0
5575-GNVDE,34,1,1,2,56.95,1889.5,0
3668-QPYBK,2,1,0,2,53.85,108.15,1
7795-CFOCW,45,0,1,1,42.3,1840.75,0
9237-HQITU,2,1,0,3,70.7,151.65,1


In [8]:
# setup of automl
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,8417
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 7)"
4,Transformed data shape,"(7043, 7)"
5,Transformed train set shape,"(4930, 7)"
6,Transformed test set shape,"(2113, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


In [9]:
# comparing for the best models
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7959,0.8398,0.5253,0.6415,0.577,0.4444,0.4486,1.795
ridge,Ridge Classifier,0.7933,0.8264,0.4534,0.6619,0.5377,0.4106,0.4231,0.037
lda,Linear Discriminant Analysis,0.7929,0.8264,0.5085,0.6389,0.5657,0.4321,0.4373,0.038
ada,Ada Boost Classifier,0.7923,0.8397,0.5062,0.6371,0.5637,0.4298,0.435,0.191
gbc,Gradient Boosting Classifier,0.7909,0.8424,0.5169,0.6297,0.567,0.4311,0.4352,0.361
lightgbm,Light Gradient Boosting Machine,0.7826,0.8322,0.5283,0.6038,0.5624,0.4188,0.4212,0.176
rf,Random Forest Classifier,0.7718,0.8042,0.4969,0.5826,0.5357,0.3859,0.3884,0.463
et,Extra Trees Classifier,0.7639,0.7851,0.503,0.564,0.5308,0.3739,0.3756,0.328
knn,K Neighbors Classifier,0.7635,0.7503,0.4426,0.5705,0.4978,0.3465,0.3517,0.07
qda,Quadratic Discriminant Analysis,0.7507,0.8288,0.7439,0.5218,0.6129,0.4375,0.4528,0.029


In [11]:
# printing the best mmodel 
best_model

In [12]:
# selecting the complete row
df.iloc[-2:-1].shape

(1, 7)

In [13]:
# predicting the selected row with the trained model
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,2,74.400002,306.600006,1,1,0.5764


In [14]:
# saving the best model 
save_model(best_model, 'GBC')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('categorical_imputer',...
                                                               fill_value=None,
                        

In [15]:
# saving the model in pickle format
import pickle

with open('GBC_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [17]:
# displaying the code from predict_churn.py
from IPython.display import Code

Code('predict_churn.py')

In [18]:
# running the python file
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


Index(['tenure', 'PhoneService', 'Contract', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'prediction_label', 'prediction_score'],
      dtype='object')
            tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
customerID                                                                  
9305-CKSKC      22             1         0              2       97.400002   
1452-KNGVK       8             0         1              1       77.300003   
6723-OKKJM      28             1         0              0       28.250000   
7832-POPKP      62             1         0              2      101.699997   
6348-TACGU      10             0         0              1       51.150002   

            TotalCharges  prediction_label  prediction_score  
customerID                                                    
9305-CKSKC    811.700012                 0            0.5018  
1452-KNGVK   1701.949951                 1            0.5681  
6723-OKKJM    250.899994                 0

# Summary

Write a short summary of the process and results here.

- Loaded the preprocessed data 
- After loading dataset used the pycaret for finding the best model for our data. then compared all the models with different metrics
- I have chose AUC for selecting the best model.
- Saved the model with model name in the system so that we can load the trained model and predict the unseen/new data.
- Then in the predict_churn.py file which loads the save model and make the predictions on the model
