# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# Mellors MSDS 600 Week 5 Assignment

# Section 1: Loading and Cleaning the Data

In [1]:
import pandas as pd

In [2]:
raw_churn_df = pd.read_csv('churn_data.csv', index_col='customerID')

In [3]:
raw_churn_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,No,Month-to-month,Electronic check,29.85,29.85,No
5575-GNVDE,34,Yes,One year,Mailed check,56.95,1889.50,No
3668-QPYBK,2,Yes,Month-to-month,Mailed check,53.85,108.15,Yes
7795-CFOCW,45,No,One year,Bank transfer (automatic),42.30,1840.75,No
9237-HQITU,2,Yes,Month-to-month,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...
6840-RESVB,24,Yes,One year,Mailed check,84.80,1990.50,No
2234-XADUH,72,Yes,One year,Credit card (automatic),103.20,7362.90,No
4801-JZAZL,11,No,Month-to-month,Electronic check,29.60,346.45,No
8361-LTMKD,4,Yes,Month-to-month,Mailed check,74.40,306.60,Yes


In [4]:
raw_churn_df['PhoneService'].unique().tolist()

['No', 'Yes']

In [5]:
raw_churn_df['Contract'].unique().tolist()

['Month-to-month', 'One year', 'Two year']

In [6]:
raw_churn_df['PaymentMethod'].unique().tolist()

['Electronic check',
 'Mailed check',
 'Bank transfer (automatic)',
 'Credit card (automatic)']

In [7]:
raw_churn_df['PhoneService'] = raw_churn_df['PhoneService'].replace({'Yes':1, 'No':0})
raw_churn_df['Contract'] = raw_churn_df['Contract'].replace({'Month-to-month':0, 'One year':1, 'Two year':2})
raw_churn_df['PaymentMethod'] = raw_churn_df['PaymentMethod'].replace({'Electronic check':0, 'Mailed check':1,'Bank transfer (automatic)':2 ,'Credit card (automatic)':3})

In [8]:
raw_churn_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,No
5575-GNVDE,34,1,1,1,56.95,1889.50,No
3668-QPYBK,2,1,0,1,53.85,108.15,Yes
7795-CFOCW,45,0,1,2,42.30,1840.75,No
9237-HQITU,2,1,0,0,70.70,151.65,Yes
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,No
2234-XADUH,72,1,1,3,103.20,7362.90,No
4801-JZAZL,11,0,0,0,29.60,346.45,No
8361-LTMKD,4,1,0,1,74.40,306.60,Yes


In [9]:
churn_df = raw_churn_df.copy()

In [10]:
churn_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,No
5575-GNVDE,34,1,1,1,56.95,1889.50,No
3668-QPYBK,2,1,0,1,53.85,108.15,Yes
7795-CFOCW,45,0,1,2,42.30,1840.75,No
9237-HQITU,2,1,0,0,70.70,151.65,Yes
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,No
2234-XADUH,72,1,1,3,103.20,7362.90,No
4801-JZAZL,11,0,0,0,29.60,346.45,No
8361-LTMKD,4,1,0,1,74.40,306.60,Yes


In [11]:
churn_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,No
5575-GNVDE,34,1,1,1,56.95,1889.50,No
3668-QPYBK,2,1,0,1,53.85,108.15,Yes
7795-CFOCW,45,0,1,2,42.30,1840.75,No
9237-HQITU,2,1,0,0,70.70,151.65,Yes
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,No
2234-XADUH,72,1,1,3,103.20,7362.90,No
4801-JZAZL,11,0,0,0,29.60,346.45,No
8361-LTMKD,4,1,0,1,74.40,306.60,Yes


In [12]:
churn_df.tenure.unique()

array([ 1, 34,  2, 45,  8, 22, 10, 28, 62, 13, 16, 58, 49, 25, 69, 52, 71,
       21, 12, 30, 47, 72, 17, 27,  5, 46, 11, 70, 63, 43, 15, 60, 18, 66,
        9,  3, 31, 50, 64, 56,  7, 42, 35, 48, 29, 65, 38, 68, 32, 55, 37,
       36, 41,  6,  4, 33, 67, 23, 57, 61, 14, 20, 53, 40, 59, 24, 44, 19,
       54, 51, 26,  0, 39], dtype=int64)

In [13]:
churn_df.tenure.value_counts(ascending=True)

0      11
36     50
44     51
39     56
28     57
     ... 
4     176
3     200
2     238
72    362
1     613
Name: tenure, Length: 73, dtype: int64

In [14]:
churn_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,No
5575-GNVDE,34,1,1,1,56.95,1889.50,No
3668-QPYBK,2,1,0,1,53.85,108.15,Yes
7795-CFOCW,45,0,1,2,42.30,1840.75,No
9237-HQITU,2,1,0,0,70.70,151.65,Yes
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,No
2234-XADUH,72,1,1,3,103.20,7362.90,No
4801-JZAZL,11,0,0,0,29.60,346.45,No
8361-LTMKD,4,1,0,1,74.40,306.60,Yes


In [15]:
churn_df = churn_df[churn_df['tenure'] != 0]

In [16]:
churn_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,No
5575-GNVDE,34,1,1,1,56.95,1889.50,No
3668-QPYBK,2,1,0,1,53.85,108.15,Yes
7795-CFOCW,45,0,1,2,42.30,1840.75,No
9237-HQITU,2,1,0,0,70.70,151.65,Yes
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,No
2234-XADUH,72,1,1,3,103.20,7362.90,No
4801-JZAZL,11,0,0,0,29.60,346.45,No
8361-LTMKD,4,1,0,1,74.40,306.60,Yes


In [17]:
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tenure          7032 non-null   int64  
 1   PhoneService    7032 non-null   int64  
 2   Contract        7032 non-null   int64  
 3   PaymentMethod   7032 non-null   int64  
 4   MonthlyCharges  7032 non-null   float64
 5   TotalCharges    7032 non-null   float64
 6   Churn           7032 non-null   object 
dtypes: float64(2), int64(4), object(1)
memory usage: 439.5+ KB


The data is now clean!

In [18]:
churn_df.to_csv('churn_df.csv')

# Section 2: AutoML

## Section 2 Part 1: Setting up AutoML

In [19]:
from pycaret.classification import ClassificationExperiment, setup, compare_models, predict_model, save_model, load_model

In [20]:
automl = ClassificationExperiment()

In [21]:
automl.setup(churn_df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1561
1,Target,Churn
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(7032, 7)"
5,Transformed data shape,"(7032, 7)"
6,Transformed train set shape,"(4922, 7)"
7,Transformed test set shape,"(2110, 7)"
8,Numeric features,6
9,Preprocess,True


<pycaret.classification.oop.ClassificationExperiment at 0x270309d1a80>

## Section 2 Part 2: Running and Identifying the Best Model

In [22]:
best_churn_model = automl.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7924,0.8281,0.7924,0.7827,0.7837,0.4297,0.4363,0.499
lda,Linear Discriminant Analysis,0.7891,0.8161,0.7891,0.7785,0.7795,0.4174,0.4248,0.017
ridge,Ridge Classifier,0.7887,0.0,0.7887,0.7754,0.7729,0.3926,0.4081,0.016
ada,Ada Boost Classifier,0.7848,0.8273,0.7848,0.7724,0.7738,0.4003,0.4084,0.064
gbc,Gradient Boosting Classifier,0.7836,0.828,0.7836,0.7704,0.7713,0.3921,0.4017,0.143
lightgbm,Light Gradient Boosting Machine,0.7814,0.8195,0.7814,0.7697,0.7724,0.3991,0.4045,0.65
rf,Random Forest Classifier,0.7682,0.7902,0.7682,0.7573,0.7604,0.3705,0.3743,0.174
knn,K Neighbors Classifier,0.7643,0.739,0.7643,0.7483,0.7516,0.3402,0.3475,0.574
et,Extra Trees Classifier,0.7534,0.7671,0.7534,0.7439,0.7472,0.3392,0.3415,0.129
qda,Quadratic Discriminant Analysis,0.7432,0.8156,0.7432,0.787,0.7551,0.4227,0.4381,0.017


In [23]:
best_churn_model

**Best Model = Logistic Regression OR ADA OR GBC** it changes based on each time I run it. At the time this is being saved, the best model is 'Logistic Regression'.

In [24]:
automl.evaluate_model(best_churn_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [25]:
automl.predict_model(best_churn_model, churn_df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,1,74.400002,306.600006,Yes,Yes,0.5575


In [26]:
churn_predictions = automl.predict_model(best_churn_model, data=churn_df)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.7922,0.8357,0.7922,0.7811,0.7835,0.4279,0.4335


In [27]:
churn_predictions.head(10)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,No,Yes,0.539
5575-GNVDE,34,1,1,1,56.950001,1889.5,No,No,0.9196
3668-QPYBK,2,1,0,1,53.849998,108.150002,Yes,No,0.5489
7795-CFOCW,45,0,1,2,42.299999,1840.75,No,No,0.9467
9237-HQITU,2,1,0,0,70.699997,151.649994,Yes,Yes,0.6001
9305-CDSKC,8,1,0,0,99.650002,820.5,Yes,Yes,0.7195
1452-KIOVK,22,1,0,3,89.099998,1949.400024,No,No,0.5831
6713-OKOMC,10,0,0,1,29.75,301.899994,No,No,0.6087
7892-POOKP,28,1,0,0,104.800003,3046.050049,Yes,Yes,0.6435
6388-TABGU,62,1,1,2,56.150002,3487.949951,No,No,0.9759


## Section 2 Part 3: Identifying Incorrect Predictions

**Below:** I am creating a dataframe that only shows me the rows where the churn predictions did not match the actual outcome. I created an object (incorrect_predictions) with a rule that states that I only want the data where the prediction result did not match the target outcome. I then created another dataframe (incorrect_rows_df) where I can view only the incorrect predictions from my prediction model (churn_predictions). From here I can evaluate what some of the similarites are between in the incorrect predictions.

In [28]:
incorrect_predictions = churn_predictions['prediction_label'] != churn_predictions['Churn']

In [29]:
incorrect_rows_df = churn_predictions[incorrect_predictions]

In [30]:
incorrect_rows_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1,0,0,0,29.850000,29.850000,No,Yes,0.5390
3668-QPYBK,2,1,0,1,53.849998,108.150002,Yes,No,0.5489
0280-XJGEX,49,1,0,2,103.699997,5036.299805,Yes,No,0.6017
5129-JLPIS,25,1,0,0,105.500000,2686.050049,No,Yes,0.6617
4190-MFLUW,10,1,0,3,55.200001,528.349976,Yes,No,0.7042
...,...,...,...,...,...,...,...,...,...
1699-HPSBG,12,1,1,0,59.799999,727.799988,Yes,No,0.7901
2823-LKABH,18,1,0,2,95.050003,1679.400024,No,Yes,0.5349
8775-CEBBJ,9,1,0,2,44.200001,403.350006,Yes,No,0.7125
2235-DWLJU,6,0,0,0,44.400002,263.049988,No,Yes,0.5735


**Below:** Since in this scenario, we would want to decrease the number of false negatives (did churn, but predicted not to) I am going to only look at the rows where the model predicted false negatives.

In [31]:
false_negatives = (incorrect_rows_df['Churn'] == 'Yes') & (incorrect_rows_df['prediction_label'] == 'No')

In [32]:
churn_false_negatives_df = incorrect_rows_df[false_negatives]

In [33]:
churn_false_negatives_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
3668-QPYBK,2,1,0,1,53.849998,108.150002,Yes,No,0.5489
0280-XJGEX,49,1,0,2,103.699997,5036.299805,Yes,No,0.6017
4190-MFLUW,10,1,0,3,55.200001,528.349976,Yes,No,0.7042
1066-JKSGK,1,1,0,1,20.150000,20.150000,Yes,No,0.7221
6467-CHFZW,47,1,0,0,99.349998,4749.149902,Yes,No,0.5225
...,...,...,...,...,...,...,...,...,...
1980-KXVPM,3,1,0,3,75.050003,256.250000,Yes,No,0.5270
5482-NUPNA,4,1,0,1,60.400002,272.149994,Yes,No,0.5263
1699-HPSBG,12,1,1,0,59.799999,727.799988,Yes,No,0.7901
8775-CEBBJ,9,1,0,2,44.200001,403.350006,Yes,No,0.7125


In [34]:
churn_false_negatives_df.tenure.value_counts()

1     139
2      45
3      34
4      27
15     20
     ... 
44      5
62      5
23      5
63      4
64      4
Name: tenure, Length: 72, dtype: int64

In [35]:
churn_false_negatives_df.PhoneService.value_counts()

1    826
0    108
Name: PhoneService, dtype: int64

In [36]:
churn_false_negatives_df.Contract.value_counts()

0    721
1    165
2     48
Name: Contract, dtype: int64

In [37]:
churn_false_negatives_df.PaymentMethod.value_counts()

0    329
3    207
1    204
2    194
Name: PaymentMethod, dtype: int64

In [38]:
churn_false_negatives_df.prediction_score.value_counts()

0.7205    4
0.7219    4
0.6746    3
0.5408    3
0.5034    3
         ..
0.8221    1
0.6449    1
0.5990    1
0.5342    1
0.7485    1
Name: prediction_score, Length: 834, dtype: int64

**Above:** We can see from the value counts that we can see some correlation to the features. For example, we can see that nearly all of the false negatives had a phone service (788/898) and/or a month-to-month contract (685/898). I am not going to do any further manipulation at this point, but I wanted take the opportunity to point out that we can seperate our false data and look for correlations between the true data and false predictions.

# Section 3: Saving and Loading the Model

## Section 3 Part 1: Saving & Testing the Model

In [39]:
automl.save_model(best_churn_model, 'churn_pycaret_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,...
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='most_frequent',
                             

## Section 3 Part 2: Testing the Model

In [40]:
new_churn_pycaret = ClassificationExperiment()
loaded_model = new_churn_pycaret.load_model('churn_pycaret_model')

Transformation Pipeline and Model Successfully Loaded


In [54]:
new_churn_pycaret.predict_model(loaded_model, churn_df.iloc[-5:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
6840-RESVB,24,1,1,1,84.800003,1990.5,No,No,0.7754
2234-XADUH,72,1,1,3,103.199997,7362.899902,No,No,0.911
4801-JZAZL,11,0,0,0,29.6,346.450012,No,No,0.5718
8361-LTMKD,4,1,0,1,74.400002,306.600006,Yes,Yes,0.5575


# Section 4: Running and Testing My Python Script

## Section 4 Part 1: My Python Script Against Model DataFrame

This command and code runs against the churn_df.csv dataframe that was used to train the model.

In [42]:
%run Mellors_MSDS_600_Week_5_pycaret_script_model

Transformation Pipeline and Model Successfully Loaded
Churn Predictions:
           Prediction
customerID           
7590-VHVEG      Churn
5575-GNVDE   No Churn
3668-QPYBK   No Churn
7795-CFOCW   No Churn
9237-HQITU      Churn
...               ...
6840-RESVB   No Churn
2234-XADUH   No Churn
4801-JZAZL   No Churn
8361-LTMKD      Churn
3186-AJIEK   No Churn

[7032 rows x 1 columns]


<Figure size 800x550 with 0 Axes>

In [43]:
from IPython.display import Code

Code('Mellors_MSDS_600_Week_5_pycaret_script_model.py')

## Section 4 Part 2: My Python Script Against Assignment DataFrame

This command and code runs against the new_churn_data.csv dataframe is used to test the model.

In [44]:
%run Mellors_MSDS_600_Week_5_pycaret_script_test

Transformation Pipeline and Model Successfully Loaded
Churn Predictions:
           Prediction
customerID           
9305-CKSKC   No Churn
1452-KNGVK      Churn
6723-OKKJM   No Churn
7832-POPKP   No Churn
6348-TACGU      Churn


In [45]:
from IPython.display import Code

Code('Mellors_MSDS_600_Week_5_pycaret_script_test.py')

My script worked against the new data! It seems like it performed very poorly, though (and I am not sure why). The target mapping of my ClassificationExperiment, labeled No (did not churn) = 0 and Yes (did churn) = 1. The expected results from the new_churn_data were [10010] and the results from my testing were [01001], only predicting one customer correctly. I spent quite a bit of time trying to figure out why, but couldn't come up with a good answer or solution. 

The only educated guess I could come up with was that the initial ClassificationExperiment setup on the new_churn_data set the target mapping the opposite from what my ClassificationExperiment set up. I consider this, because I noticed in the Week 5 FTE that the ClassificationExperiment setup (automl.setup) listed 'Diabetes:0, No Diabetes:1', which is backwards from my setup 'Churn:1, No Churn:0'. If this did happen to be the case, than my data predicted all but 1 of the customers correctly [01101]/[01001]! <- I sure hope this is correct and what happened.

# Section 5: Summary

In this assignment, I had to show my ability in using Machine Learning to test my data against a number of different data models, save the best model, and then write a python script that could take new data and run it against my saved model.

First, I imported my packages and my data. I decided to bring in the raw churn data I have been working with and re-clean it up. I didn't realize until much later that it was expected I would clean the data for the new_churn_data dataframe. Regardless, the first thing I did was demonstrate that I can still clean up my data. 

Next, I had to setup the ClassificationExperiment, so that the data can be prepped and reviewed for the machine so it knew what the target was and how it would classify the target data so it was numerical, among other settings. Once I was done with that I did a model comparison and the results varied each time I reran the model, but at the time of writing this summary (and as can be viewed in the pdf version of this notebook) the best model was the Logistic Regression model. Once the model was done, I evaluated the model with a Confusion Matrix to ensure it was performing well and then chose a customer to view to ensure that the output was creating a prediction along with a prediction score. This all went well.

After that, I did some dataframe manipulation to look at the incorrect predictions and to assess what - if any - correlations could be made. I noticed from the data that most of the incorrectly predicted customers had a phone service and/or a month-to-month contract. I did not attempt to manipulate the data, and did this portion just to see if I was capable of isolating incorrect predictions and running there feature values to see where most of the incorrect predictions lay. 

After viewing the incorrect predictions, I moved on and saved my model (which created the .pkl file that I could use in my Python code to run current and future data against). I then did test that the model was working by creating a new ClassificationExperiment to run against my saved model and then I tested if it worked by viewing a couple of the customers from the dataframe.

**Note:** While this did show that the model was working, I took it upon myself to also do additional testing of my model utilzing the Python script in the Section 4: that is why Section 4 has 2 parts, testing my script and model against my own churn data and then doing the testing against the assignment dataframe.

Once I was able to verify that my model worked, I began working on my Python script that could run data against my model I did this in VSCode. First, I created a script that I could run against my churn data (that code ends in 'model.py' and once I was happy with the results (which I added into this notebook) I made a copy of my script and set it up to specifically test against the assignment dataframe (new_churn_data.csv), this script ends with 'test.py'. It functioned properly, but I had an issue with interpreting the effectiveness of my model against the expected outcomes (this is explained in depth right above in Section 4 Part 2).

Final thoughts: It was really cool to be able to create my first data science script. I look forward to making more! Also, now that I can do this, I can practice creating prediction scripts for other dataframes that I can get from online. Learning how we can use ML to aid in data science and how to use it in practicum was a neat experience.

And Finally, I got my code up on GitHub!

**Reference: The script code template I used to create my Python script was obtained from the Week_5_FTE_pycaret.ipynb** 