### DAT303 - Module 6.2 Notebook
---
Name:    
Date:

> It is assumed you are using the module6 conda environment specified in the *module6.yaml* file downloaded from Canvas. Be sure to read all cells in this notebook. You are only to provide code in cells that contain `##### YOUR CODE HERE #####` and written responses in cells that contain `YOUR WRITTEN RESPONSE HERE`. Ensure that code cells are executed sequentially to prevent unexpected errors.

In this notebook, you will fit a number of classification models and evaluate the results on the *fraud-claims.csv* dataset available on Canvas. There are three sub-sections to this assignment:

- Part I: Data Preparation  
- Part II: Classification Models
- Part III: Evaluation  

**BE SURE TO READ THE INSTRUCTIONS FOR ALL SECTIONS!!!**

<br>


## Part I: Data Preparation
---

You will first pre-process the dataset. You can use the example code provided [here](https://github.com/jtrive84/DMACC/blob/master/DAT303/Demos/preprocessing-pipeline-demo.ipynb) to give you an idea of how to handle imputation, scaling and one-hot encoding for categorical features. 

The objective is to determine whether an insurance claim is suspicious based on the available features. The target variable is "suspicious", where 0 represents claims that are not suspicious and 1 represents claims that are suspicious and require further investigation. For a description of the each of the columns, refer to *fraud-claims-data-dictionary.csv*. The required steps are:

- Determining which features are categorical and which are continuous.

- Imputing missing values (remember that imputation is handled differently for continuous and categorical features). 

- Scaling continuous features.

- One-hot encoding categorical features.

---                              

- 1.a Read *fraud-claims.csv* into a Pandas DataFrame. 
- 1.b Drop any records in which the target is missing. 
- 1.c Display the first 10 rows of the DataFrame.

In [None]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.set_printoptions(suppress=True, precision=8, linewidth=1000)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

##### YOUR CODE HERE #####




<br>

2. Determine the total number of positive and negative instances in the suspicious column.

In [None]:

##### YOUR CODE HERE #####


<br>

3. What proportion of the data belong to the suspicious class? Would this dataset be considered balanced or imbalanced?



YOUR WRITTEN RESPONSE HERE


<br>

4. Create a barplot of the proportion (not the count) of suspicous samples by the legalrep column. 

In [None]:

##### YOUR CODE HERE #####


<br>

5. Is the proportion of suspicious claims greater for individuals with or without legal representation?


YOUR WRITTEN RESPONSE HERE


<br>

6. Inspect the columns of the DataFrame, and create categorical and continuous feature lists. 

In [None]:

target = "suspicious"

##### YOUR CODE HERE #####


<br>

7. For each feature identified as categorical, perform a check to consolidate values into an "OTHER" group if they appear less than 20 times. There may not be any such cases, but add logic to handle any instances.

In [None]:

##### YOUR CODE HERE #####


<br>

8. Implement the preprocessing pipeline. Be sure to create train, validation and test subsets. We will use train and validation sets for modeling, but the test set will not be used until Part III to compare all models on unseen data. Print the number of rows and columns in each split. Refer to [this](https://github.com/jtrive84/DMACC/blob/master/DAT303/Demos/preprocessing-pipeline-demo.ipynb) link for an example on how to handle pre-processing in scikit-learn. 

<br>


> Note that this section should be no different from your work in *module-05-2.ipynb*. Be sure to name your datasets `dftrain`, `dfvalid` and `dftest` and your responses `ytrain`, `yvalid` and `ytest` for compatability with tests in Part III.

In [None]:

##### YOUR CODE HERE #####



print(f"dftrain.shape: {dftrain.shape}")
print(f"dfvalid.shape: {dfvalid.shape}")
print(f"dftest.shape : {dftest.shape}")


<br>

## Part II: Fitting Classification Models
---
In this section, you will fit 4 separate classification models, and answer any additional questions about each. In parituclar, you will fit the following:

- [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)
- [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn-tree-decisiontreeclassifier)
- [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn-ensemble-gradientboostingclassifier)
- Classification model of your choice (list of models available [here](https://scikit-learn.org/stable/supervised_learning.html))

Follow the instructions that accompany each model. Remember that in this section, **We are only working with the training and validation sets, not the test set!**.

---


### i. LogisticRegression

1. Fit a `LogisticRegression` model. After fitting the model, report the accuracy, precision, recall and f1-score using the default classification threshold of .50 on the validation set. Name the resulting model `mdl1`.


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

##### YOUR CODE HERE #####


<br>

2. Using one of the approaches outlined in the classifier threshold notebook, update the classifier threshold and recalculate 
accuracy, precision, recall and f1-score on the validation set. 

In [None]:

##### YOUR CODE HERE #####


<br>

3. Which method did you choose to adjust the classification threshold? What is the value of the new threshold you selected?


YOUR WRITTEN RESPONSE HERE


<br>

4. How did recall and precision change as the threshold changed?


YOUR WRITTEN RESPONSE HERE


<br>

5. Create a DataFrame consisting of the LogisticRegression coefficients along with the feature names, and sort them in decreasing absolute order. 

In [1]:

##### YOUR CODE HERE #####



<br>

6. Which five features have the highest absolute coefficient values?



YOUR WRITTEN RESPONSE HERE



<br>


### ii. DecisionTreeClassifier

1. Using `GridSearchCV`, create a parameter grid with at least three hyperparameters, and fit a `DecisionTreeClassifier` which will be identified as `mdl2`. Determine which metric to optimize against. Print the best set of parameters, and report the accuracy, precision, recall and f1-score on the validation set using the default threshold. 

In [None]:

from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier


##### YOUR CODE HERE #####


<br>

2. Recall that in Module 5 validation curves were discussed. For this question, you will vary the DecisionTreeClassifier's max_depth hyperparameter, and monitor how train f1-score and validation f1-score vary for each value against the default classification threshold. 


    1. For each max_depth in `np.arange(1, 51)`, do:  
    
        - Fit a DecisionTreeClassifier with that particular max_depth.
        - Compute the training and validation f1-score. 

    2. Create a DataFrame of your results with columns max_depth, train_mse, and valid_mse. Display all rows.


In [None]:

##### YOUR CODE HERE #####


<br>

3. Plot the validation curve comparing train_mse and valid_mse with max_depth on the x-axis. Draw a verical black line at the value of max_depth where overfitting is starting to occur. Be sure to label your axes.

In [None]:

##### YOUR CODE HERE #####



<br>

### iii. GradientBoostingClassifier

1. Using `GridSearchCV`, create a parameter grid with at least four hyperparameters, and fit a `GradientBoostingClassifier` which will be identified as `mdl3`. Determine which metric to optimize against. Print the best set of parameters, and report the accuracy, precision, recall and f1-score on the validation set using the default threshold. 

In [None]:

from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier


##### YOUR CODE HERE #####


<br>

2. Display the ROC curve using `mdl3` predicted probabilities and validation labels.

In [None]:

from sklearn.metrics import RocCurveDisplay

##### YOUR CODE HERE #####



<br>

3. Display `mdl3` feature importances as a bar plot in descending order of importance.


In [None]:

##### YOUR CODE HERE #####


<br>

4. What are the top-5 features according to `mdl3`? Are any of these the same as the top-5 coefficients from `mdl1`?


YOUR WRITTEN RESPONSE HERE


<br> 

### iv. Model of Your Choice



1. Select a scikit-learn classification model not already covered in this notebook. Use `GridSearchCV`, with a parameter grid associated with your model which will be identified as `mdl4`. Determine which metric to optimize against. Print the best set of parameters, and report the accuracy, precision, recall and f1-score on the validation set using the default threshold. 

In [None]:

##### YOUR CODE HERE #####


<br>

2. Generate the precision-recall plot for `mdl4`.

In [None]:

from sklearn.metrics import PrecisionRecallDisplay

##### YOUR CODE HERE #####


<br>

3. Roughly which precision-recall pair is associated with the optimal threshold?


YOUR WRITTEN RESPONSE HERE


<br>

### Part III: Evaluation

Execute the next cell, which computes accuracy, precision, recall and f1-score on the final test set for the four models created. Recall that:

- `mdl1` = LogisticRegression model
- `mdl2` = DecisionTreeClassifier model with cross-validated hyperparmeter selection 
- `mdl3` = GradientBoostingClassifier model with cross-validated hyperparmeter selection 
- `mdl4` = Selected model of your choice with cross-validated hyperparmeter selection  

In [None]:

# Run this cell as-is, no updates necessary.

yhat_mdl1 = mdl1.predict(dftest)
yhat_mdl2 = mdl2.predict(dftest)
yhat_mdl3 = mdl3.predict(dftest)
yhat_mdl4 = mdl4.predict(dftest)


metrics = [
    {
        "model": f"{repr(mdl1)}",
        "precision": precision_score(yhat_mdl1, ytest),
        "recall": recall_score(yhat_mdl1, ytest),
        "accuracy": accuracy_score(yhat_mdl1, ytest),
        "f1": f1_score(yhat_mdl1, ytest)
    },
    {
        "model": f"{repr(mdl2.estimator)}",
        "precision": precision_score(yhat_mdl2, ytest),
        "recall": recall_score(yhat_mdl2, ytest),
        "accuracy": accuracy_score(yhat_mdl2, ytest),
        "f1": f1_score(yhat_mdl2, ytest)
    },
   {
        "model": f"{repr(mdl3.estimator)}",
        "precision": precision_score(yhat_mdl3, ytest),
        "recall": recall_score(yhat_mdl3, ytest),
        "accuracy": accuracy_score(yhat_mdl3, ytest),
        "f1": f1_score(yhat_mdl3, ytest)
    },
    {
        "model": f"{repr(mdl4.estimator)}",
        "precision": precision_score(yhat_mdl4, ytest),
        "recall": recall_score(yhat_mdl4, ytest),
        "accuracy": accuracy_score(yhat_mdl4, ytest),
        "f1": f1_score(yhat_mdl4, ytest)
    },
]


pd.DataFrame().from_dict(metrics).head(5)


<br>

1. Which model exhibited the best performance in terms of your preferred metric? Which model exhibited the worst performance in terms of your preferred metric? Why do you think the best performing model out-performed the others?


YOUR WRITTEN RESPONSE HERE



<br>

2. Select the best model in terms of your preferred metric, and plot the confusion matrix using the default threshold.

In [None]:

from sklearn.metrics import ConfusionMatrixDisplay

#### YOUR CODE HERE #####


<br>

3. Select the worst performing model in terms of your preferred metric and create the confusion matrix using the default classification threshold.

In [None]:

from sklearn.metrics import ConfusionMatrixDisplay

#### YOUR CODE HERE #####


<br>

4. Compare the two confunsion matrices. How do the number of TP, TN, FP and FN differ between the two models?


YOUR WRITTEN RESPONSE HERE



<br>

5. Why did you choose the metric you selected to evaluate the classification models?


YOUR WRITTEN RESPONSE HERE
