# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

LOAD DATA:

In [1]:
import pandas as pd

# reading csv file into data frame and naming the index column
df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,yj_tenure,Total_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1.0,0,0,0,29.85,29.85,0,-1.644343,29.850000
5575-GNVDE,34.0,1,1,1,56.95,1889.50,0,0.297205,55.573529
3668-QPYBK,2.0,1,0,1,53.85,108.15,1,-1.495444,54.075000
7795-CFOCW,45.0,0,1,2,42.30,1840.75,0,0.646327,40.905556
9237-HQITU,2.0,1,0,0,70.70,151.65,1,-1.495444,75.825000
...,...,...,...,...,...,...,...,...,...
6840-RESVB,24.0,1,1,1,84.80,1990.50,0,-0.078084,82.937500
2234-XADUH,72.0,1,1,3,103.20,7362.90,0,1.342198,102.262500
4801-JZAZL,11.0,0,0,0,29.60,346.45,0,-0.725121,31.495455
8361-LTMKD,4.0,1,0,1,74.40,306.60,1,-1.265130,76.650000


In [2]:
del df['yj_tenure']

In [3]:
del df['Total_tenure_ratio']

AutoML WITH PYCARET:

In [4]:
pip install pycaret






In [5]:
pip install scikit-plot

Note: you may need to restart the kernel to use updated packages.


In [6]:
#importing specific functions from pycaret
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [7]:
#setup function must be called before execution of any other function
automl = setup(df, target='Churn', fold_shuffle=True, preprocess=False)#setup() takes two important parameters like data and target

Unnamed: 0,Description,Value
0,session_id,4219
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 7)"
5,Missing Values,False
6,Numeric Features,3
7,Categorical Features,3
8,Transformed Train Set,"(4930, 6)"
9,Transformed Test Set,"(2113, 6)"


In [23]:
automl[7]

False

In [9]:
#compare_models() evaluates the performance of all models using cross-validation
best_model = compare_models(sort='F1')#here sort is used to consider 'F1' as the scoring metric(by default accuracy)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
qda,Quadratic Discriminant Analysis,0.7454,0.8182,0.7351,0.5242,0.6116,0.4302,0.4441,0.01
nb,Naive Bayes,0.7146,0.8034,0.7537,0.4852,0.5901,0.3867,0.409,0.01
lr,Logistic Regression,0.7864,0.8298,0.5178,0.6342,0.5687,0.4289,0.4337,0.568
gbc,Gradient Boosting Classifier,0.7858,0.8345,0.5029,0.6364,0.561,0.4221,0.4277,0.115
catboost,CatBoost Classifier,0.7868,0.8324,0.4984,0.6402,0.5592,0.4218,0.4283,1.034
lda,Linear Discriminant Analysis,0.7832,0.8189,0.5,0.629,0.5561,0.4154,0.4207,0.012
ada,Ada Boost Classifier,0.783,0.8307,0.4903,0.6337,0.5509,0.4113,0.4183,0.141
ridge,Ridge Classifier,0.786,0.0,0.4531,0.658,0.5349,0.4024,0.4151,0.01
rf,Random Forest Classifier,0.7661,0.8017,0.4791,0.5879,0.5275,0.3743,0.3781,0.151
et,Extra Trees Classifier,0.7531,0.7775,0.4918,0.5542,0.5203,0.3551,0.3567,0.135


In [10]:
#used to display the best model
best_model

QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
                              store_covariance=False, tol=0.0001)

In [11]:
#iloc() function is used to fetch records based on the index values from the dataset
df.iloc[-1].shape #df.shape() gives the number of rows and columns(returns 1-D array)

(7,)

In [12]:
#returns 2-D array because -2:-1 represents a range(it represents line)
df.iloc[-2:-1].shape

(1, 7)

In [13]:
#predicts label and score using a trained model
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4.0,1,0,1,74.4,306.6,1,1,0.9089


SAVING AND LOADING OUR MODEL:

In [14]:
#save_model() is used to save our trained model( or best model)
save_model(best_model, 'QDA')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ['trained_model',
                  QuadraticDiscriminantAnalysis(priors=None, reg_param=0.0,
                                                store_covariance=False,
                                                tol=0.0001)]],
          verbose=False),
 'QDA.pkl')

In [15]:
#pickle module is used for saving and loading the data
import pickle
#opens the file specified for writing and in binary format
with open('QDA_model.pk', 'wb') as f: #this file object is saved in f variable
    pickle.dump(best_model, f)

In [16]:
#opens specified file for reading and in binary format
with open('QDA_model.pk', 'rb') as f: #this file object is saved in f variable
    loaded_model = pickle.load(f)

In [17]:
new_data = df.iloc[-2:-1].copy()#copy() is used to generate a copy of the object
new_data.drop('Churn', axis=1, inplace=True)#inplace=True returns nothing and drops the Churn column and updates the data
loaded_model.predict(new_data)

array([1])

In [18]:
#load_model() is used for loading the trained model
loaded_nb = load_model('QDA')

Transformation Pipeline and Model Successfully Loaded


In [19]:
#predict_model() is used for making predictions on new_data
predict_model(loaded_nb, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4.0,1,0,1,74.4,306.6,1,0.9089


MAKING A PYTHON MODULE TO MAKE PREDICTIONS:

In [20]:
from IPython.display import Code
#python file is created in visual studio for predicting churn for each row on the new_churn_data
Code('predict_churn.py')

In [21]:
#predictions on the new data can be seen by using this magic command  %run
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC    Yes
1452-KNGVK     No
6723-OKKJM     No
7832-POPKP     No
6348-TACGU     No
Name: churn_prediction, dtype: object


# Summary

Write a short summary of the process and results here.

Week-5 Assignment is about using prepped churn data for setting up autoML with pycaret which shows the summary of the entire project, comparing best metric among many using cross-validation, predicting label and score for the best model, saving and loading our model using python built-in module pickle, making predictions on the new data created by using iloc() function and a python file is created for making predictions on the new churn data. I have imported package like pandas and read prepped churn data into dataframe by using read_csv function.

1.Before setting an autoML with pycaret, I installed python libraries like pycaret and scikit-plot using pip and imported specific functions like setup, compare_models, predict_model, save_model, load_model from pycaret. setup() function initializes the training environment and creates the transformation pipeline. Pycaret setup is initialized by using setup() function which takes two mandatory parameters like data and target. Here, data is prepped_churn_data which is read into pandas dataframe and target is Churn. All other parameters are optional. When Preprocess is set to False, no transformations are applied except for train_test_split and data must be ready for modeling. fold_shuffle=True controls the shuffle parameter of cross validation(CV). fold_shuffle parameter is by default false. The preprocessing and data transformations are configured within the setup function.

2.AutoML is simply run for finding the best model using compare_models() function. This function evaluates the performance of all metrics available in the model library using cross validation and finds out the best model. sort argument in compare_models is used to choose any other scoring metric, by default it is accuracy. I sorted the table based on the F1 score. Based on the output, the best model is QuadraticDiscriminantAnalysis(qda) which has the highest F1 score(i.e., 61%) and the time taken to run this model is the least while xgboost and lightgbm take longest time to run. best_model is used to display the best model. I tried other metrics too for finding the best model but the score(0.9) and the predictions for the new_churn_data are better using F1 score as the metric.

3.iloc() function is used to fetch records based on the index values from the datasets. It enables us to retrieve a particular value belonging to a row and column using the index values assigned to it. iloc() function only accepts integer type values as the index values. df.shape() gives the number of rows and columns. df.iloc[start row:end row, start_col:end_col] where end row and end col are excluded. The index values [-2:-1] represents a range, while [-1] is a scalar and a range is a line while a scalar is a point. df.iloc[-1].shape returns 1-D array and df.iloc[-2:-1].shape returns 2-D array. predict_model() function is used to make predictions on the new data (i.e., which is fetched by using iloc() function) by using trained model(i.e., best model(QDA)). This function creates two new columns Label and Score. Label column has its predicted label(1) and Score if it is greater than or equal to 0.5 it is rounded up. The Score is 0.9 and the label is 1 for the new data(i.e., fetched by using iloc()).

4.The function save_model() in pycaret is used to save our trained model(i.e., QDA). The python built-in module pickle is also used for saving and loading of binary data. I imported pickle and the built-in open function opens a file with the name QDA_model.pk for writing with 'w' and in a binary format using 'b'. The file object is saved in the variable f. The with statement automatically closes the file after the with statement exit , otherwise the function close from the file object f should be called. The data is saved to the file using pickle.

5.The built-in open function opens a file with the name QDA_model.pk for reading with 'r' and in a binary format using 'b'. The file object is saved in the variable f. The with statement automatically closes the file after the with statement exit , otherwise the function close from the file object f should be called. The pickle's load function(i.e.,pickle.load()) is used for loading the saved data. The new_data is fetched by using iloc() function and the churn column is dropped by using drop() method. axis=0 represents rows and axis=1 represents columns. axis is set to 1 to delete Churn column. copy() function is used to generate a copy of the object. inplace=True does not return anything does the specified operation(deletes the Churn column) and updates the data. inplace=False(by default) returns the copy of the object, the specific operation is performed and then it should be saved to some file. The QDA model which is loaded using pickle load function is used for making predictions on the new_data.

6.Saving and loading of data can also be done by using pycaret. save_model() in pycaret is used for saving the trained model. load_model() function is used for loading the trained model. The function predict_model() is used for making predictions on new_data by using trained model(i.e., QDA). The score for the new_data is around 0.9 which is good and label is 1(two new columns created by using predict_model function).

7.I imported code from Ipython.display class for displaying source code. A python file 'predict_churn.py' is created in visual studio for predicting churn for each row on the new_churn_data. The source code in this file is displayed by using Code() from Ipython display module. The magic command  %run is used to run the file predict_churn.py and display the predictions on the new_churn_data. The predictions for the new_churn_data is Yes, No, No, No, No(1,0,0,0,0). The true values for the new_churn_data is 1,0,0,1,0. There is one false negative in the new data. The model(QDA) is working average but not perfect.