# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# Load Data 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('clean_churn_data.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,AvgMonthlyCharges
0,0,7590-VHVEG,1,0,0,1,29.85,29.85,0,29.85
1,1,5575-GNVDE,34,1,1,0,56.95,1889.5,0,55.573529
2,2,3668-QPYBK,2,1,0,0,53.85,108.15,1,54.075
3,3,7795-CFOCW,45,0,1,2,42.3,1840.75,0,40.905556
4,4,9237-HQITU,2,1,0,1,70.7,151.65,1,75.825


In [4]:
df.isna().sum()

Unnamed: 0            0
customerID            0
tenure                0
PhoneService          0
Contract              0
PaymentMethod         0
MonthlyCharges        0
TotalCharges          0
Churn                 0
AvgMonthlyCharges    11
dtype: int64

In [5]:
df_copy = df.copy()

In [6]:
# removing avgmonthly charge
# column as a whole as its not part of the orginal data provided and its similar to monthly charges
df.drop(['AvgMonthlyCharges', 'Unnamed: 0','customerID'], axis = 1, inplace = True)

In [7]:
df

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1,0,0,1,29.85,29.85,0
1,34,1,1,0,56.95,1889.50,0
2,2,1,0,0,53.85,108.15,1
3,45,0,1,2,42.30,1840.75,0
4,2,1,0,1,70.70,151.65,1
...,...,...,...,...,...,...,...
7038,24,1,1,0,84.80,1990.50,0
7039,72,1,1,3,103.20,7362.90,0
7040,11,0,0,1,29.60,346.45,0
7041,4,1,0,0,74.40,306.60,1


# use pycaret to find an ML algorithm that performs best on the data

In [8]:
from pycaret.classification import *

In [9]:
automl = setup(df, target = 'Churn')

Unnamed: 0,Description,Value
0,Session id,4843
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 7)"
4,Transformed data shape,"(7043, 7)"
5,Transformed train set shape,"(4930, 7)"
6,Transformed test set shape,"(2113, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


# Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.

In [10]:
best_model = compare_models( fold = 2, sort = 'Recall')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.713,0.8044,0.7508,0.4742,0.5813,0.3794,0.4026,0.455
qda,Quadratic Discriminant Analysis,0.7529,0.8223,0.737,0.5245,0.6128,0.4389,0.4526,0.01
svm,SVM - Linear Kernel,0.6523,0.0,0.6988,0.4687,0.5154,0.2902,0.3418,0.465
lightgbm,Light Gradient Boosting Machine,0.7757,0.8223,0.5145,0.5885,0.549,0.4006,0.4022,0.305
lr,Logistic Regression,0.7921,0.8319,0.5069,0.6359,0.5639,0.4297,0.4346,0.69
dt,Decision Tree Classifier,0.7325,0.6614,0.5023,0.496,0.4991,0.3166,0.3166,0.015
lda,Linear Discriminant Analysis,0.7822,0.8168,0.4809,0.6146,0.5394,0.3995,0.4048,0.01
gbc,Gradient Boosting Classifier,0.788,0.8349,0.4794,0.6327,0.5454,0.4106,0.4174,0.09
rf,Random Forest Classifier,0.7677,0.8051,0.4778,0.5747,0.5215,0.37,0.3729,0.08
et,Extra Trees Classifier,0.7529,0.78,0.4725,0.5393,0.5036,0.3401,0.3415,0.075


# save the model to disk

In [11]:
save_model(best_model, 'best_model_recall')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               missing_values=nan,
                                                               strategy='mean',
                                                               verbose='deprecated'))),
                 ('categorical_imputer',
                  TransformerWrapper(exclude=None, include=[],
                                     transformer=

# create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
## your Python file/function should print out the predictions for new data (new_churn_data.csv)
## the true values for the new data are [1, 0, 0, 1, 0] if you're interested

In [12]:
import pickle

def prob_churn(data):
    
    with open ('best_model_recall.pk','rb') as f:
            loaded_model = pickle.load(f)
    
    loaded_lda = load_model('best_model_recall')
    
    
    prediction = predict_model(loaded_lda, data)
    
    return prediction

data = pd.read_csv('new_churn_data.csv')

output = prob_churn(data)
#output_with_id = pd.concat([df_copy['customerID'], output], axis=1)

print(output)
        


Transformation Pipeline and Model Successfully Loaded


   customerID  tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
0  9305-CKSKC      22             1         0              2       97.400002   
1  1452-KNGVK       8             0         1              1       77.300003   
2  6723-OKKJM      28             1         0              0       28.250000   
3  7832-POPKP      62             1         0              2      101.699997   
4  6348-TACGU      10             0         0              1       51.150002   

   TotalCharges  charge_per_tenure  prediction_label  prediction_score  
0    811.700012          36.895454                 1            0.8036  
1   1701.949951         212.743744                 0            0.7563  
2    250.899994           8.960714                 0            0.5168  
3   3106.560059          50.105808                 0            0.7582  
4   3440.969971         344.096985                 1            0.5606  


In [13]:

print("True values for new data:")
print(output['prediction_label'] )

print("\n Expected True values for new data:")
true_values = [1, 0, 0, 1, 0]
print("\n ",true_values)

True values for new data:
0    1
1    0
2    0
3    0
4    1
Name: prediction_label, dtype: int64

 Expected True values for new data:

  [1, 0, 0, 1, 0]


# Summary

For this assignment, I started by loading my data and checking for any missing values in case I missed any from the week 2 assignment file. It seemed that I had some missing files, so I decided to drop the column "AvgMonthlyCharges" as it had missing values as well as having similar content as the "MonthlyCharges" column. After completing that step, I went on to drop additional columns like "uname: 0" as it wasn't needed, as well as "CustomerID" as sometimes it causes issues in my code when wanting to only deal with integer values but the customer ID contains both character and integer values.

After having a clean dataset, I went on to use PyCaret's AutoML to run my machine learning algorithm. Using the setup function, I specified my data and my target variable which was "Churn." Afterwards, I ran a compare model test with the metric set as recall because recall is the best model when looking at churn because it can best prioritize the identification of churn cases, aiming to minimize false negatives and capture as many actual churners as possible.

Once completing that step, I went on to save my model to disk for later use. In the end, to use it on new data, I retrieved it in my function called "prob_churn" using the open statement as a readable file only. Once it did that, I loaded the model to be used using the "load_model" function. Afterward, in my function, I was able to use the "predict_model" function to use my loaded model to predict the outcome of the new given data.

You can see from the result of using "new_churn_data.csv" as my data, I get a true value of 10001 instead of the proposed true value of 10001. My guess for the reason being is that "fold = 5" might be affecting the true positive, but I did try changing it and still got an outcome of 10101 instead using "fold = 20."

Write a short summary of the process and results here.