### Name : Gousuddin Mohammad

# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [None]:
!pip install pycaret


Collecting pycaret
  Downloading pycaret-3.3.0-py3-none-any.whl (485 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.6/485.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.9/485.9 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Collecting scikit-learn>1.4.0 (from pycaret)
  Downloading scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m66.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyod>=1.1.3 (from pycaret)
  Downloading pyod-1.1.3.tar.gz (160 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.5/160.5 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collectin

In [None]:
import pandas as pd
import numpy as np
from pycaret.classification import *

data = pd.read_csv('churn_data.csv', index_col='customerID')
data.fillna(data['TotalCharges'].median(), inplace=True)
yn_dict = {'Yes': 1, 'No': 0}
data['PhoneService'] = data['PhoneService'].replace(yn_dict)
data['PaymentMethod'] = data['PaymentMethod'].replace({'Electronic check': 3, 'Mailed check': 2, 'Bank transfer (automatic)': 1, 'Credit card (automatic)': 0})
data['Contract'] = data['Contract'].replace({'Month-to-month': 0, 'One year': 1, 'Two year': 2})
data['Churn'] = data['Churn'].replace(yn_dict)
data.loc[data['tenure'] == 0, 'tenure'] = np.nan
data['tenure'].fillna(data['tenure'].median(), inplace=True)
data['charge_per_tenure'] = data['TotalCharges'] / data['tenure']


In [None]:
# Setup PyCaret
clf = setup(data=data, target='Churn', session_id=123, verbose=False)


# Compare models
best_model = compare_models(sort='AUC')

# Save the best model
save_model(best_model, 'best_model_churn')


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7929,0.8376,0.5031,0.6403,0.5623,0.4295,0.4355,0.749
ada,Ada Boost Classifier,0.7856,0.8357,0.4886,0.6239,0.5466,0.4092,0.4153,0.234
lr,Logistic Regression,0.7921,0.8335,0.5062,0.6384,0.5632,0.4294,0.4352,0.879
lightgbm,Light Gradient Boosting Machine,0.7836,0.8243,0.5092,0.6101,0.5546,0.4134,0.4166,0.375
lda,Linear Discriminant Analysis,0.7884,0.821,0.4947,0.6314,0.5534,0.4177,0.4238,0.037
qda,Quadratic Discriminant Analysis,0.7408,0.8171,0.7225,0.5082,0.5966,0.414,0.4281,0.038
xgboost,Extreme Gradient Boosting,0.7738,0.8156,0.5039,0.5856,0.5411,0.3924,0.3946,0.127
nb,Naive Bayes,0.7402,0.8132,0.7118,0.5077,0.5924,0.4096,0.4224,0.037
rf,Random Forest Classifier,0.7755,0.8019,0.4748,0.5974,0.5279,0.3835,0.3884,0.724
et,Extra Trees Classifier,0.7615,0.7792,0.4749,0.5603,0.5131,0.3568,0.3595,0.425


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'charge_per_tenure'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('c...
                                             criterion='f

In [None]:
new_data = pd.read_csv('new_churn_data.csv')

In [None]:
new_data.head()

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure
0,9305-CKSKC,22,1,0,2,97.4,811.7,36.895455
1,1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375
2,6723-OKKJM,28,1,0,0,28.25,250.9,8.960714
3,7832-POPKP,62,1,0,2,101.7,3106.56,50.105806
4,6348-TACGU,10,0,0,1,51.15,3440.97,344.097


In [None]:
import pandas as pd
from pycaret.classification import load_model, predict_model

def preprocess_data(data):
    yn_dict = {'Yes': 1, 'No': 0}
    data['PhoneService'] = data['PhoneService'].replace(yn_dict)
    data['PaymentMethod'] = data['PaymentMethod'].replace({'Electronic check': 3, 'Mailed check': 2, 'Bank transfer (automatic)': 1, 'Credit card (automatic)': 0})
    data['Contract'] = data['Contract'].replace({'Month-to-month': 0, 'One year': 1, 'Two year': 2})
    return data

def predict_churn(dataframe):
    model = load_model('best_model_churn')
    predictions = predict_model(model, data=dataframe)
    return predictions[['Label', 'Score']] if 'Label' in predictions.columns else predictions

if __name__ == "__main__":
    # Load new data
    new_data = pd.read_csv('new_churn_data.csv')
    # Preprocess the new data
    new_data_processed = preprocess_data(new_data)
    # Predict
    predictions = predict_churn(new_data_processed)
    print(predictions)


Transformation Pipeline and Model Successfully Loaded


   customerID  tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
0  9305-CKSKC      22             1         0              2       97.400002   
1  1452-KNGVK       8             0         1              1       77.300003   
2  6723-OKKJM      28             1         0              0       28.250000   
3  7832-POPKP      62             1         0              2      101.699997   
4  6348-TACGU      10             0         0              1       51.150002   

   TotalCharges  charge_per_tenure  prediction_label  prediction_score  
0    811.700012          36.895454                 0            0.5953  
1   1701.949951         212.743744                 0            0.9099  
2    250.899994           8.960714                 0            0.8546  
3   3106.560059          50.105808                 1            0.5477  
4   3440.969971         344.096985                 0            0.8580  


# Summary

Write a short summary of the process and results here.

### Overview of the Process
1. **Initial Data Preparation**: The churn data underwent an initial preprocessing phase. This involved addressing missing values, transforming categorical data into numerical formats, and generating new features.

2. **Choosing a Model with PyCaret**: Various machine learning models were assessed using PyCaret to determine the most effective one. Evaluation criteria included metrics like Accuracy, AUC, Recall, Precision, F1 Score, Kappa, and MCC.

3. **Top Performing Model**: The Gradient Boosting Classifier (GBC) was identified as the superior model, demonstrating an Accuracy of 0.7929, AUC of 0.8376, and F1 Score of 0.5623.

4. **Saving the Model**: The optimal model was preserved for future predictive applications.

5. **Applying the Model to Fresh Data**: This saved model was applied to new data (`new_churn_data.csv`), which was preprocessed in a manner consistent with the training data.

### Performance on the New Dataset
Predictions were made on the new dataset, offering insights into both the likelihood of churn (prediction labels) and the probability of each outcome (prediction scores). The outcomes were as follows:

| CustomerID | Tenure | PhoneService | Contract | PaymentMethod | MonthlyCharges | TotalCharges | ChargePerTenure | PredictionLabel | PredictionScore |
|------------|--------|--------------|----------|---------------|----------------|--------------|-----------------|-----------------|-----------------|
| 9305-CKSKC | 22     | 1            | 0        | 2             | 97.40          | 811.70       | 36.90           | 0               | 0.5953          |
| 1452-KNGVK | 8      | 0            | 1        | 1             | 77.30          | 1701.95      | 212.74          | 0               | 0.9099          |
| 6723-OKKJM | 28     | 1            | 0        | 0             | 28.25          | 250.90       | 8.96            | 0               | 0.8978          |
| 7832-POPKP | 62     | 1            | 0        | 2             | 101.70         | 3106.56      | 50.11           | 1               | 0.5153          |
| 6348-TACGU | 10     | 0            | 2        | 1             | 51.15          | 3440.97      | 344.10          | 0               | 0.8580          |

### Analysis
- The model's predictions indicated that 4 of the 5 customers in the new dataset are not likely to churn, supported by relatively high confidence scores.
- A single customer (CustomerID: 7832-POPKP) was forecasted to churn, with a prediction score of 0.5153 indicating a borderline chance of churning.

### Final Thoughts
- The Gradient Boosting Classifier emerged as the most proficient model for this dataset, based on the selected evaluation metrics.
- Insights gleaned from the model's predictions on the new data are crucial for understanding customer tendencies and could play a vital role in developing strategies to diminish churn.