## CASE STUDY ASSIGNMENT 3 - DETECTING SPAM EMAIL

#### Team: Nicole Wittlin, Joe Schueder, Steven Hayden and Kevin Mendonsa

## Introduction

Despite numerous methods to filter out unwanted email, spam still presents a number of challenges to organizations. While ordinary spam is simply considered a nuisance, the true danger lies in the spoofed emails or the malware that it delivers. They fill up inboxes, waste time, and often are carriers of unwanted viruses and malware. Unwanted emails interrupt the workflow of the whole team, and they affect work productivity. Furthermore, they can pose a considerable risk to the security of the company.

Spam includes phishing emails designed to appear legitimate from sources like banks or online merchants. This increases the chance for unwitting users to download suspicious files. The costs to businesses around the world is in billions of dollars resulting from hampered productivity, security breaches, and other issues.
92.4% of spam email messages contain malware attachments for Ransomware, Malware, Spoofing etc. disrupting IT systems around the world, including those of large enterprises. This can also cause great reputational damage, especially if the victims are customers. 

Additionally, security experts say that spam is becoming an increasingly successful attack vector, with cybercriminals now aiming to gain access to a computer network and damage it. Organizations have to allocate resources for recovering and securing compromised employee and customer data and pay forensic and legal fees to deal with regulatory bodies and disgruntled customers.

We have been asked to examine a set of emails that have been classified either as spam or not and devise an explanation for what drives the classification. We will use a Decision Tree model because it is simple to understand, visualize, and interpret. This model also illustrates feature importance very well, which will be crucial for explaining the spam classification process.       

(References: https://www.graphus.ai/the-impact-of-spam-and-spoofed-emails-on-your-business/, https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052)

## Data Processing 

We received the data in an "R data format" file. Since we opted to do this analysis in Python, the Python package "pyreadr" was used to read the data into a Python Pandas dataframe; we then converted the dataframe to a comma separated file (CSV) to store on GitHub for ease of use. Each execution of this Jupyter notebook reads the data from the CSV file on GitHub. 

In [None]:
#import all packages needed
import pandas as pd
import pyreadr
from pandas_profiling import ProfileReport

In [None]:
# Importing the rda dataset using pyreadr and then converting to a dataframe and a csv file - ONLY DONE ONCE, for reference

# import the dataset
# result = pyreadr.read_r(r'C:\Users\jjsch\Downloads\Week_5_Materials_2\data.Rda')

# convert to a dataframe
# df = result["emailDFrp"]

# export dataframe to CSV
# df.to_csv(r'C:\Users\jjsch\Downloads\Week_5_Materials_2\data.csv')

In [None]:
# read and view the data 
df = pd.read_csv(r'https://raw.githubusercontent.com/jjschueder/7333QTW/master/Case%20Study%203/data.csv')
df.head()

Our initial review of the data indicated that many columns are formatted as "logical" data type of TRUE or FALSE. Machine learning classificaton algorithms require numeric variables. Therefore, we converted the TRUE/FALSE variables to numeric class type: TRUE == 1; FALSE == 0. The remaining variables are numeric as they represent percentages, counts, and hours of the day.

In [None]:
# Transform logical data to numeric data
# df['isSpam'] = df['isSpam'].map({'T': 1, 'F': 0})
df['isRe'] = df['isRe'].map({'T': 1, 'F': 0})             
df['underscore'] = df['underscore'].map({'T': 1, 'F': 0})       
df['priority'] = df['priority'].map({'T': 1, 'F': 0})         
df['isInReplyTo'] = df['isInReplyTo'].map({'T': 1, 'F': 0})      
df['sortedRec'] = df['sortedRec'].map({'T': 1, 'F': 0})        
df['subPunc'] = df['subPunc'].map({'T': 1, 'F': 0})          
df['multipartText'] = df['multipartText'].map({'T': 1, 'F': 0})    
df['hasImages'] = df['hasImages'].map({'T': 1, 'F': 0})        
df['isPGPsigned'] = df['isPGPsigned'].map({'T': 1, 'F': 0})       
df['subSpamWords'] = df['subSpamWords'].map({'T': 1, 'F': 0})      
df['noHost'] = df['noHost'].map({'T': 1, 'F': 0})            
df['numEnd'] = df['numEnd'].map({'T': 1, 'F': 0})            
df['isYelling'] = df['isYelling'].map({'T': 1, 'F': 0})         
df['isOrigMsg'] = df['isOrigMsg'].map({'T': 1, 'F': 0})         
df['isDear'] = df['isDear'].map({'T': 1, 'F': 0})            
df['isWrote'] = df['isWrote'].map({'T': 1, 'F': 0})    

## Exploratory Data Analysis 

After confirming the proper type of data frame, we conducted exploratory data analysis to better understand the dataset. Our first step was to understand the variables and their associated meaning in relation to an email message. The data dictionary below provides a summary. Note, the target variable to predict - **IsSpam** - is also included. There is a unique record id - Unnamed: 0 - which is not necessary for our analysis but we retained it for reference as needed.

In [None]:
type(df)

In [None]:
df.columns

### Data Dictionary

| VARIABLE | DEFINITION |
|:-|:-|
| isSpam | TRUE if email classified as spam |
| isRe | TRUE if Re: appears at the start of subject line |
| numLines | number of lines in body of message |
| bodyCharCt | number of characters in the body of message |
| underscore | TRUE if email address in From field of header contains underscore |
| subExcCt | number of exclamation marks in subject |
| subQuesCt | number of question marks in subject |
| numAtt | number of attachments in message |
| priority | TRUE if Priority key is present in header |
| numRec | number of recipients of message, including CCs |
| perCaps | percentage of capitals among all letters in message |
| isInReplyTo | TRUE if the In-Reply-To key is present in header |
| sortedRec | TRUE if recipients' email addresses are sorted |
| subPunc | TRUE if words in subject have punctuation or numbers embedded (i.e. w!se) |
| hour | hour of the day in the Date field |
| multipartText | TRUE if MIME type is multipart/text |
| hasImages | TRUE if message contains images |
| isPGPsigned | TRUE if message contains a PGP (encryption) signature |
| perHTML | percentage of characters in HTML tags in message compared to all characters |
| subSpamWords | TRUE if subject contains one of the words in spam word vector |
| subBlanks | percentage of blanks in subject |
| noHost | TRUE if there is no hostname in Message-Id key in header |
| numEnd | TRUE if sender's email address (before @) ends in number |
| isYelling | TRUE if subject is in all capital letters |
| forwards | number of forward symbols in a line of the body (i.e. >>> xxx contains 3) |
| isOrigMsg | TRUE if message body contains phrase "original message" |
| isDear | TRUE if message body contains word "dear" |
| isWrote | TRUE if message contains phrase "wrote:" |
| avgWordLen | average length of words in message |
| numDlr | number of dollar signs in message |

In [None]:
profile = ProfileReport(df, title='SpamData_EDA_Report', plot={'histogram': {'bins': 8}}, explorative=True)

profile.to_notebook_iframe()

profile.to_file("SpamData_EDA_Report.html")

## Data Profile Analysis

Next, we dug deeper into the exploratory analysis to understand the descriptive statistics, data distributions, and correlations between variables. Using standard Python operators and the **Pandas Profiling** package, we found some notable characteristics about select variables. 

-------------------------------------------------------------------------------------------------------------------------------

#### Missing data: 
* subSpam (7 missing)
* noHost (1 missing)
* isYelling (7 missing)
* subExcCt (20 missing)
* subQuesCt (20 missing) 
* numRec (282 missing)
* subBlanks (20 missing)

We devised an imputation strategy to address the missing data; this is outlined in detail later in the document. The charts below visually reinforce the missing information.

#### Missing Values
![title](img/MissingValues2.png) 

The graphic above illustrates the variables with missing values as indicated in the table. In the count graph, we can visualize the count of data points in each feature. We can conclude that some of the features have missing data points.The attributes that have missing values have been highlighted. 



#### Missing Values - Heatmap
![title](img/MissingValuesHeatMap2.png) 

The correlation heat map helps identify if there is any relation between missing values. The higher the value (closer to 1) the more values missing in connection to the corresponding variables in the same records.  

#### Data Issues and Warnings
![title](img/DataWarnings.png) 

-------------------------------------------------------------------------------------------------------------------------------

### Correlations and Outliers
Next, we look at two types of correlation to understand the relationships between variables and noted outliers that might impact our analysis.

**Pearson's Correlation**

The Pearson correlation coefficient is used to measure the strength of a linear association between two variables, where the value r = 1 means a perfect positive correlation and the value r = -1 means a perfect negataive correlation. It inofrms whether a statistically significant linear relationship exists between two continuous variables. The strength of a linear relationship (i.e., how close the relationship is to being a perfectly straight line) 

(Reference: https://www.dummies.com/education/math/statistics/how-to-interpret-a-correlation-coefficient-r/)

Interpretation of Pearson's coefficient.
* Exactly –1. A perfect downhill (negative) linear relationship

* –0.70. A strong downhill (negative) linear relationship

* –0.50. A moderate downhill (negative) relationship

* –0.30. A weak downhill (negative) linear relationship

* 0. No linear relationship

* +0.30. A weak uphill (positive) linear relationship

* +0.50. A moderate uphill (positive) relationship

* +0.70. A strong uphill (positive) linear relationship

* Exactly +1. A perfect uphill (positive) linear relationship

![title](img/PearsonCorrelation.png) 

It is evident from the correlation map above that some of the variables are correlated with others. perHTML is correlated with isSpam (dependen variable), bodyCharCt, numLines while perCaps is also correlated with isSpam. subQuestCt is also correlated with numDlr.

-------------------------------------------------------------------------------------------------------------------------------

**PHI Correlation**

The phi correlation coefficient (phi) is one of a number of correlation statistics used to measure the strength of association between two variables. It is a nonparametric statistic used in cross-tabulated data where both variables are dichotomous. The phi coefficient (or mean square contingency coefficient and denoted by φ or rφ) is a measure of association for two binary variables. 

Interpretation of the Phi coefficient.
* -1.0 to -0.7 strong negative association. 
* -0.7 to -0.3 weak negative association. 
* -0.3 to +0.3 little or no association.

![title](img/PhikCorrelation.png)

**Outliers and Odd Data**

A simple Python describe function, revealed a few outliers and data anomalies. Since our dataset includes the target variable **Is Spam**, we can see if these outliers or oddities are known to be associated with spam. While it may not directly impact our predictions, it provides insights into our data.

- **subQesCt** outliers of 8 and 12 question marks in subject were not necessarily spam.
- **numAtt** outlier of 18 email attachment is not necessarily spam.
- **numRec** outlier of 311 recipients on an email; this was spam.
- **perCaps** outlier where 100% of email in capital letters; this was spam. It appears a threshold exists where perCaps > 80% is spam.
- **perHTML** outlier where 100% of characters are HTML; this was spam. Threshold where perHTML > 97% is spam.
- **subBlanks** outlier where 86% of email subject line is blank; this is spam. Threshold of subBlanks > 25% is spam.
- **forwards** outlier of 99 foward symbols is not necessarily spam.
- **avgWordLen** outlier of average word length equalt to 26 is spam. 
- **numDlr** outlier of 1977 dollar signs in a message is not necessarily spam.



-------------------------------------------------------------------------------------------------------------------------------

In [None]:
df.describe()

### Missing Value Imputation Steps

In [None]:
# Missing Data Summary
countOfNan = pd.Series(df.isnull().sum()) 
DataType = pd.Series(df.dtypes) 

# Assemble into a single dataframe for viewing
frame = { 'datatype': DataType, 'count of Nan': countOfNan } 
result = pd.DataFrame(frame) 
print(result)

As observed above, we have missing values in the dataset.  However, to improve the classification results of Spam we adopted the following approaches to address missing values in our dataset .

**Imputation approach for Variables: subExcCt, subQuesCt, subBlanks, isYelling, subSpamWords**
* Given the large dataset of 9348 observations, we opted to drop the 20 rows that were missing data in the variables “subExcCt”, “subQuesCt”, and “subBlanks”. 
* The missing values associated with these three variables were all absent in the same observations/rows; therefore, only 20 observations were excluded in the dataset. This raised concerns that there could have been an error recording these observations. 
* We felt this was an appropriate step given it represented only **0.2%** of the overall dataset. 
* Dropping these observations/records also eliminated the missing values in the variables “isYelling” and “subSpamWords.”



**Imputation approach for Variable: numRec**
* The variable number of recipients (numRec) has the most missing values at 282. 
* We decided to impute the missing values with the mean number of recipients. 
* This imputation was done after splitting the data into training and testing datasets. 
* The mean was calculated using only the training dataset. 
* We felt this would most closely mimic a production environment and limit "data leakage" between the training and testing datasets. 
* We used the Pipeline approach as indicated below to address this imputation as one of the steps in the pipeline.


**Imputation approach for Variable: noHost**
* The remaining missing data point was one value for “noHost"; we imputed this with the mean. 
* We used the Pipeline approach as indicated below to address this imputation as one opf the steps in the pipeline.

In [None]:
# Create a new dataset after filtering out NaNs for variable subQuesCt 
# which also addressed NaNs in subExcCt, subBlanks, isYelling and subSpamWords

dfNoNa = df[~df['subQuesCt'].isnull()]

## Splitting the Data

#### Using Python's Data Split method from scikit learn which is equivalent to R's split() function

We used the ShuffleSplit method which randomly samples the entire dataset during each iteration to generate a training set and a test set. The test_size and train_size parameters determines how the percentage size of the test and training test for each iteration. Since we are sampling from the entire dataset during each iteration, values selected during one iteration, could be selected again during another iteration. (Reference: https://stackoverflow.com/questions/34731421/whats-the-difference-between-kfold-and-shufflesplit-cv)

The settings we used for the ShuffleSplit are:
* n_splits=10
* random_state=101 
* test_size=0.2 (20% testing set)
* train_size=None (80% training set)


In [None]:
# Split the data into test and train using scikit learn built-ins
from sklearn.model_selection import ShuffleSplit
cvx = ShuffleSplit(n_splits=10, test_size=0.20, random_state=101)
print (cvx)

In [None]:
# Determine the features to be used for the independent variables and those for the dependent or response variable
features = ['isRe', 'underscore', 'priority', 'isInReplyTo',
            'sortedRec', 'subPunc', 'multipartText', 'hasImages', 'isPGPsigned',
            'subSpamWords', 'noHost', 'numEnd', 'isYelling', 'isOrigMsg', 'isDear',
            'isWrote', 'numLines', 'bodyCharCt', 'subExcCt', 'subQuesCt', 'numAtt',
            'numRec', 'perCaps', 'hour', 'perHTML', 'subBlanks', 'forwards',
            'avgWordLen', 'numDlr']

X = dfNoNa[features].copy()
Y = dfNoNa[['isSpam']].copy()
y = Y.values

## Analysis

#### INITIAL GRID SEARCH MODEL PARAMETERS AND APPROACH 
To begin our more comprehensive analysis, we used GridSearchCV as part of the Python package "sklearn" to obtain an exhaustive compilation of specified parameter values for an estimator. For the GridSearch, we defined a range of options for min_samples_leaf, which indicates the number of sample observations required at a leaf node split. We also specified a range of options for max_depth, which is the maximum depth allowed for the Decision Tree. The range of values selected are considered standard.

We used all the features and the followed these steps:

- The data is split into random training and testing datasets using the ShuffleSplit method and 80/20 split
- The missing values are imputed as per our imputation approach above.  
- A grid search mechanism is then leveraged to train models based on the following parameters:
      - classify__criterion':['gini','entropy']
      - classify__max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150]
      - classify__min_samples_leaf': min_samples_leaf_range
      - classify__max_features': ['auto','auto', 'log2']
- The model is scored using the accuracy metric
- The model with the best accuracy is then selected as the baseline

In [None]:
# grid search to determine optimal model and parameters for baselining
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier

min_samples_leaf_range = [2,3,4,5,6,7,8,9,10,11,15,20,30,40,50,75,100,150]

# Create a dictionary of all the parameter options 
# Note has you can access the parameters of steps of a pipeline by using '__’

parameters ={'classify__criterion':['gini','entropy'],
             'classify__max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150],
             'classify__min_samples_leaf': min_samples_leaf_range,
             'classify__max_features': ['auto','auto', 'log2']}

Pipeline = Pipeline([("imputer", SimpleImputer(missing_values = np.nan, strategy = 'mean')),
                     ("classify",DecisionTreeClassifier())])

In [None]:
# the cv=cvx parameter sets the grid search to split the training and testing data 10 times. 
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, TimeSeriesSplit, StratifiedShuffleSplit
from sklearn import metrics as mt
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size = 0.2, random_state = 101)
dt_clf = GridSearchCV(Pipeline, param_grid = parameters, cv = cvx)

# train the decision tree algorithm
%time dt_clf.fit(X,y)
yhat = dt_clf.best_estimator_.predict(X1_test)
print ('accuracy:', mt.accuracy_score(y1_test,yhat))
print (dt_clf.best_params_)

### SUMMARY OF GRID SEARCH 

The GridSearch for our Decision Tree model returned 1944 options with outputs related to time to run the model, recommended parameters, and mean test score (which is accuracy). To quickly assess these options, we explored the results based on accuracy. 

The GridSearch identified a model with close to 98% accuracy. The model had a depth of 150 and required only two samples at each split. The large depth and small samples for a split resulted in very complex model. This complexity makes interpretability quite difficult.  It did not meet our goal to have a model that is easily understood by non-technical audiences. We reviewed and analyzed the estimators to determine and develop a more manageable Decision Tree. 


**RESULTS:**
* Wall time: 8min 48s

* Accuracy: 0.9844587352625938 (99%)

* Parameters: {'classify__criterion': 'entropy', 'classify__max_depth': 150, 'classify__max_features': 'auto', 'classify__min_samples_leaf': 2}



**ANALYSIS:**

The "OPTIMAL" parameters determined by GRID SEARCH algorithm for the BEST model are:

* Criterion: The GridSearch algorithm selected the "Entropy" coefficient as the criterion being optimal versus "Gini" 

* Max_depth: The decision tree produced is much deeper at depth=150 branches/levels

* Max_features: The feature selection was left to the GRID SEARCH algorithm using the 'auto' option

* Min_samples_leaf: 2

* Accuracy of identifying SPAM while very high at 99% makes the model interpretability more complex


In [None]:
# Capture the cv results/estimators in a dataframe for review and further analysis
estimatorsdf = pd.DataFrame(dt_clf.cv_results_)

# the estimators can be downloaded as a csv for further analysis to determine optimal pruning opportunities
# estimatorsdf.to_csv(r'C:\Users\kevinm\Documents\GitHub\7333QTW\Case Study 3\Estimators.csv')

estimatorsdf

### Parameters
To fully understand the results of the GridSearch, it is important to understand the parameters used by the Decision Tree classifier.  Pruning the tree involves selecting the most optimal combinations to tune the model that will deliver a good balance of model accuracy and interpretability. 

A summary of definitions of the parameters we used for the GridSearch are below. In addition, there are also other parameters that we did not use.


| Parameter                | Definition                                                                                                                                                                                  |
|:-------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Criterion                | function to measure the quality of a split; supported criteria are "gini" for Gini impurity and "entropy" for information gain                                                              |
| Splitter                 | strategy used to choose the best split at each node; supported strategies are "best" to choose best split and "random"  to choose best random split                                              |
| Max_depth                | maximum depth of tree; if "None", then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples                                           |
| Min_samples_leaf         | minimum number of samples required at a leaf node; split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each left and right branches |
| Max_features             | number of features to consider when looking for best split                                                                                                                                  |


### PRUNING THE MODEL (Q1)

The grid search found a model with very good accuracy of 98%. However, the model is extremely complex and has a depth of 120 branches. An evaluation of all the models evaluated shows that tree with a much smaller branch depth is also capable of producing a fairly accurate tree, with fewer branches.  This simpler tree better describes the important features in a more compact manner and is easier to interpret.

#### Parameters selected for pruning (Q3)

* criterion: 'gini'
* max_depth': 7
* max_features: 'auto'
* min_samples_leaf: 2

As noted above, our goal is to define an interpretable Decision Tree with strong accuracy. To narrow down the 1944 GridSearch results, we looked at models with accuracy greater then 0.90. Acknowledging that a model with higher accuracy would likely require more depth, we felt like 0.90 was an appropriate threshold. To achieve a simplier model, we will need to trade off some accuracy for a tree with less depth.  

##### The "Optimized" Decision Tree parameters we ultimately selected for the model can be summarized as follows:
                                      
                                        
| Parameter | Value | Explanation |
|:-|:-|:-|
| Criterion | Gini | function to measure the quality of a split; tells likelihood of incorrect classification of new data, if new data were randomly classified according to distribution of data set class labels. |
| Splitter | Best | strategy used to choose best split at each node; best is default  |
| Max_depth | 7 | tree has maximum depth of 7 layers |
| Min_samples_leaf | 2  | 2 samples required for each leaf node |
| Max_features | Auto  | number of features to consider when looking for best split, Auto equals sqrt of n_features |

In [None]:
# PRUNED Model using max_dept=7 and min_samples_leaf=2 with "GINI" as a criterion
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, TimeSeriesSplit, StratifiedShuffleSplit
from sklearn import metrics as mt
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size = 0.2, random_state = 101)
from sklearn.pipeline import Pipeline
clf = Pipeline([("imputer", SimpleImputer(missing_values = np.nan, strategy = 'mean')),
                ("classify",DecisionTreeClassifier(**{'criterion': 'gini', 
                                                      'max_depth': 7, 
                                                      'max_features': 'auto', 
                                                      'min_samples_leaf': 2}))])

clf.fit(X1_train,y1_train)
yhat2 = clf.predict(X1_test)
print ('accuracy:', mt.accuracy_score(y1_test,yhat2))
print (clf)

### ANALYSIS OF VARIABLES CONTRIBUTING THE MOST TO THE MODEL (Q4)

The final part of our analysis was to explore feature importance.  This illustrates which variables in the dataset are most likely predictors of spam. The list below captures the results from our model and aligns with the Decision Tree path we outlined above.

The features determined to be the most important to our model are:
* perCaps
* perHTML
* bodyCharCt
* forwards
* numDlr
* isInReplyTo
* avgWordLen
* isWrote

The bar-chart below further illustrates the feature importance and their order of importance to the Decision Tree from left to right.

In [None]:
# Merge field names and feature importance to display them together
fi = pd.DataFrame(clf.steps[1][1].feature_importances_, columns =['featimp'])
featuresnames = pd.DataFrame(X1_test.columns.values.tolist(), columns =['fields'])

featimpdf = pd.merge(featuresnames, fi, left_index=True, right_index=True)
featimpdf = featimpdf.sort_values(by='featimp', ascending=False)
featimpdf

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(15,8))
ax = sns.barplot(x=featimpdf.fields, y=featimpdf.featimp)
ax.set_title('FEATURE IMPORTANCE')
plt.xlabel("Feature")
plt.ylabel("Importance")
ax.set_xticklabels(ax.get_xticklabels(), rotation=65, horizontalalignment='right')

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import tree
#fig, axes = plt.subplots(nrows = 1, ncols = 1, figsize = (50,20), dpi = 300)

#tree.plot_tree(clf.steps[1][1]);

In [None]:
cn = ['Spam', 'NotSpam']
#fig, axes = plt.subplots(nrows = 1, ncols = 1, figsize = (50,20), dpi = 300)

#tree.plot_tree(clf.steps[1][1],
#               feature_names = features, 
#               class_names=cn,
#               filled = True);
# fig.savefig(r'C:\Users\jjsch\Downloads\Week_5_Materials_2\plottreefncn.png')

In [None]:
# export the decision tree to a dot file
from graphviz import Digraph
from graphviz import Source
from io import StringIO
import pydot
import graphviz

tree.export_graphviz(clf.steps[1][1],
                     out_file=(r"C:\Users\kevinm\Documents\GitHub\7333QTW\Case Study 3\tree.dot"),
                               #(r"C:\Users\jjsch\Downloads\Week_5_Materials_2\tree.dot"),
                     feature_names = features, 
                     class_names=cn,
                     filled = True)

dotfile = StringIO()

In [None]:
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'
# export to PDF for easier zooming and viewing of the decision tree. 
dot_data = tree.export_graphviz(clf.steps[1][1], out_file=None, feature_names=features, class_names=cn,
               filled = True)
graph = graphviz.Source(dot_data) 
graph.render("case3chosentree",view = True)

### ANALYZE PATHS THROUGH TREE (Q2)

Explain a path or two
what did we learn?
what do the splits tell us? 

Some images are shown below that represent the tree chosen for usage.  The images are for the purpose of explanation. To see the complete tree that was used open the file at this location: https://github.com/jjschueder/7333QTW/blob/master/Case%20Study%203/case3chosentree.pdf



**High Level Visual of the Chosen Tree:**
The tree shows the eight layers of the tree to classify an email as spam or not spam.  Each node in the tree has values to compare the email attributes to. Based on the value in the data if it meets the criteria it will either be classified or moved onto another node with more decsions to be made. At the top level of the decsion visual there are two lines extending downwards from the first decsion node the left side means the email meets the criteria displayed in the node, while an arrow to the right means it did not meet the criteria. The same is true as the tree branches downward to further layers of the tree. Left branches are true that it meets the criteria while right means it does not. 

Each node contains the classifcation decision, a gini index, samples, value and a class value. 

* gini - "The gini score is a metric that quantifies the purity of the node/leaf. A gini score greater than zero implies that samples contained within that node belong to different classes. A gini score of zero means that the node is pure, that within that node only a single class of samples exist." - Source - https://towardsdatascience.com/scikit-learn-decision-trees-explained-803f3812290d
* samples - the number of samples used to determine this decisions weight on the model.
* value - the breakup of the counts of true or false for this decsion
* class value - the current direction the decision is moving towards(.e.g spam or not spam)

![title](img/ChosenTreehighlevelgraph.png)

**Left Side of Tree:**
This side of the tree is the part of the tree that meets the first nodes criteria. The first node's criteria is numEnd is less than or equal to 0.5. Of the sample size, 5,574 rows of data are dispositioned to the left side of the tree while the remaining 1,888 will be further analyzed in the right branch. In the next node an evaluation is performed on the perCaps column and if it less than 12.61. We will continue to follow the branch where this is true. The data has 1301 values that do  meet this criteria. Next, subExcCt is evaluated if the value is greater than 0.5 than it moves right to an evaluator on forwards being less than or greater than 11.84. If it has more than 11.84 forwards it classified as spam. Of course this right side of the branch has many other evaluations with termination leaves with various different classifications of either spam or not spam. 
![title](img/TreeLeftBranchZoom.png)

**Right Side of Tree:** 
This side of the tree does not meet the first nodes criteria. Next, an evaluation is performed on whether bodyCharct is less than 11,726.  Approximately 157 email of the evaluated 891 do not meet this criteria and are transferred to the next node where subBlanks are evaluated. Those that are less than the value evaluted are classified as spam. 
![title](img/RightBranchTreeZoom.png)


Both sides of the tree have many decision points in the 8 layers. As might be expected these decsions are based on the fields that were identified in the models as important features. The features that are near the highest values for feature importance are used more often in the tree to make the decisions.  The nodes also, contain a 'gini' number, number of samples, and how many of the samples satisfied the selection, ultimately each node also contains a label that indicates whether the particular evaluation would point the value more towards spam or not spam. 

### Q5: EVALUATE PERFORMANCE OF MODEL

###  DECISION TREE MODEL EVALUATION AND SELECTION BASED ON METRICS (Q5)

In order to evaluate the model, we reviewed several standard metrics, which are defined below. The aim is to maximize our precision, recall, and accuracy scores in our models.


The model had the following evaluation profile calcualated based on averages of the metrics by evaluating them on 10 times cross validation:  
 * Accuracy: 90.1 
 * Precsion: 90.1  
 * Recall: 90.1
 * F1:  89.9  

 

* **Confusion Matrix** - The confusion matrix shows that this model categorized 55 emails incorrectly as spam when they were not and 123 emails as not spam when they actually were spam.  
 [1315   55]   
 [ 123  373]


  
* **ACCURACY** - total number of correct predictions (True Positives/TP; True Negatives/TN) over total number of predictions made. <br>
Accuracy = (TP + TN)/(TP + FP + FN + TN)


* **PRECISION** - proportion of true positives over total number of positive outcomes, whether accurately predicted (TP) or inaccurately predicted (FP); helps illuminate which model is accurately picking correct classes or correctly classifying observations.<br>
Precision = (TP) / (TP + FP)


* **RECALL** - proportion of positive outcomes that were correctly classified by model; tells how many values were incorrectly predicted; a good pair with precision to determine if modeling is overfitting or selecting a single class; also known as sensitivity.  
Recall/Sensitivity = (TP) / (TP + FN) 


* **F1 SCORE** - measure of accuracy that accounts for true negatives and false positives.<br>
F1 score = 2(True Positive Rate * True Negatives)/(True Positives + True Negatives) 


We further plotted an **ROC curve** to visualize the Decision Tree's performance. An ROC curve, also known as a Reciever Operation Characteristic Curve, plots and compares classifiers based on the True Positive Rates (TPR) and False Positive Rates (FPR) for each classifier. An AUC (Area Under the Curve) score of 1.0 denotes a perfect classifier and an area of 0.5 represents a model which is no better than a random guess. Higher the AUC the better the classifier.

In additon to the chosen model, a more complex model produced by the grid search had reportedly higher accuracy scores of close to 98%.  This model was discarded due to concerns on the number of layers that it contained(more than 130 layers). This was thought to pose a few problems, one being lack of ability to explain the tree as well as concern that this model was overfitted and would not expand cleanly to new data sets with different data profiles. Finally, a simple Guassian model was also analyzed, thought it did produce good results for a fairly simple model those results were not as high as the chosen simplified decision tree. The Gaussian model can be viewed here: https://github.com/jjschueder/7333QTW/blob/master/Case%20Study%203/Case_Study_3_Gaussian.ipynb



In [17]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn import metrics as mt
totalacc = 0
totalprec = 0
totalrec= 0
totalf1  = 0
for fold, (train, test) in enumerate(cvx.split(X,y)):
#    print ('Next Evaluation:')
    # train the decision tree algorithm
    clf.fit(X.iloc[train],y[train])
    yhat = clf.predict(X.iloc[test])
#    print ('accuracy:', mt.accuracy_score(y[test],yhat),'\n')
    conf = mt.confusion_matrix(y[test],yhat)
#    print("Confusion Matrix\n",conf,'\n')
#    print("Precision Score is: {}" .format(precision_score(y[test],yhat, average='weighted')),'\n')
#    print("Recall Score is: {}" .format(recall_score(y[test],yhat, average='weighted')),'\n')
#    print("F1 Score is: {}" .format(f1_score(y[test],yhat, average='weighted')),'\n')
    acc = mt.accuracy_score(y[test],yhat)
    prec = precision_score(y[test],yhat, average='weighted')
    rec = recall_score(y[test],yhat, average='weighted')
    f1 = f1_score(y[test],yhat, average='weighted')
    totalacc += acc
    totalprec += prec
    totalrec += rec
    totalf1 += f1

avgaccuracy = 100*totalacc / cvx.n_splits
avgprec = 100*totalprec / cvx.n_splits
avgrec = 100*totalrec / cvx.n_splits
avgf1 = 100*totalf1 / cvx.n_splits
print("Ten Time Split Average Metrics: \n")
print("Accuracy:", avgaccuracy,'\n Precsion:',avgprec,'\n Recall:', avgrec, '\n F1: ', avgf1, '\n')
print("Last evaluation confusion Matrix: \n")
print("Confusion Matrix\n",conf,'\n')

Ten Time Split Average Metrics: 

Accuracy: 89.20685959271168 
 Precsion: 89.42257343833556 
 Recall: 89.20685959271168 
 F1:  88.7554777319813 

Last evaluation confusion Matrix: 

Confusion Matrix
 [[1348   22]
 [ 208  288]] 



In [18]:
import numpy as np
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, auc, roc_curve
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

plt.figure(figsize=(10, 6))
sns.set(font_scale=2)
plt.title('CONFUSION MATRIX')
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                conf.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     conf.flatten()/np.sum(conf)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(conf, annot=labels, fmt='', cmap='Blues')

<matplotlib.axes._subplots.AxesSubplot at 0x2046d1ff320>

In [20]:
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

#https://graphviz.gitlab.io/_pages/Download/Download_windows.html

ylist = Y.values.astype('int64')
ylist
from sklearn.preprocessing import label_binarize
ybinary = label_binarize(ylist, classes=[0, 1])
n_classes = ybinary.shape[1]
X_train2, X_test2, y_train2, y_test2 = train_test_split(X,ybinary, test_size=0.2)
y_score = cross_val_predict(clf, X, ybinary, cv=10 ,method='predict')
from sklearn.preprocessing import label_binarize
ybinary = label_binarize(ylist, classes=[0, 1])
n_classes = ybinary.shape[1]
X_train2, X_test2, y_train2, y_test2 = train_test_split(X,ybinary, test_size=0.2)
y_score = cross_val_predict(clf, X, ybinary, cv=10 ,method='predict')

In [21]:
#X1_train, X1_test, y1_train, y1_test
#y_score = classifier.fit(X_train3, y_train3).decision_function(X_test3)
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):

    #y1_test,yhat    
    fpr[i], tpr[i], _ = roc_curve(ybinary, y_score)
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(ybinary.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

#Plot of a ROC curve for a specific class
plt.figure()
lw = 2
plt.plot(fpr[i], tpr[i], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[0])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Spam')
plt.legend(loc="lower right")
plt.show()



### SUMMARY/CONCLUSION/FINAL THOUGHTS

Our mission is reducing the time spent and the threat with spam email so that you and your customers can be more productive. We created two models as discussed above, one with max depth of 150 for accuracy and the other with max depth of 7 for simplicity and generalization. We chose the less complex model for implantation because spam is always changing, and the hop is that a more generalized model will stay more relevant to a wide range of attacks without being retrained. This idea is supported with the 10 fold cross validation that was performed. The average scores did not change more than a percentage point from the original score at 89.2%. The model as stated above is also more interpretable allowing for more easy explanation to users. An easy to read chart can be provide to users to understand what attributes are most likely to indicate that an email is spam. 

Future implementation can take a variation of the pipeline built above to take feedback from the user to maintain fresh and accurate model. The pipeline is built to handle missing values that users or emails may have left out. Spammers are always adapting to their environment and our model along with feedback from the users will adapt along with them. 


## APPENDIX

#### Backup evaluation of complex model
The evaluation below shows cross validation of the deeper tree model with more branches.  Accuracy and other metrics are better.

In [None]:
dt_clf=DecisionTreeClassifier(**dt_clf.best_params_)
dt_clf.fit(X1_train,y1_train)
yhat = dt_clf.predict(X1_test)
print ('accuracy:', mt.accuracy_score(y1_test,yhat))


fi_dt = pd.DataFrame(dt_clf.feature_importances_, columns =['featimp'])
featuresnames_dt = pd.DataFrame(X1_test.columns.values.tolist(), columns =['fields'])
featimpdt_df = pd.merge(featuresnames_dt, fi_dt, left_index=True, right_index=True)
featimpdt_df = featimpdf.sort_values(by='featimp', ascending=False)
featimpdt_df

In [None]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
totalacc = 0
totalprec = 0
totalrec= 0
totalf1  = 0
for fold, (train, test) in enumerate(cvx.split(X,y)):
    print ('Next Evaluation:')
    # train the decision tree algorithm
    dt_clf.fit(X.iloc[train],y[train])
    yhat = dt_clf.predict(X.iloc[test])
    print ('accuracy:', mt.accuracy_score(y[test],yhat),'\n')
    conf = mt.confusion_matrix(y[test],yhat)
    print("confusion matrix\n",conf,'\n')
    print("Precision Score is: {}" .format(precision_score(y[test],yhat, average='weighted')),'\n')
    print("Recall Score is: {}" .format(recall_score(y[test],yhat, average='weighted')),'\n')
    print("F1 Score is: {}" .format(f1_score(y[test],yhat, average='weighted')),'\n')
    acc = mt.accuracy_score(y[test],yhat)
    prec = precision_score(y[test],yhat, average='weighted')
    rec = recall_score(y[test],yhat, average='weighted')
    f1 = f1_score(y[test],yhat, average='weighted')
    totalacc += acc
    totalprec += prec
    totalrec += rec
    totalf1 += f1

avgaccuracy = 100*totalacc / cvx.n_splits
avgprec = 100*totalprec / cvx.n_splits
avgrec = 100*totalrec / cvx.n_splits
avgf1 = 100*totalf1 / cvx.n_splits
print("Ten Time Split Average Metrics: \n")
print("Accuracy:", avgaccuracy,'\n Precsion:',avgprec,'\n Recall:', avgrec, '\n F1: ', avgf1, '\n')
print("Last evaluation confusion Matrix: \n")
print("Confusion Matrix\n",conf,'\n')

In [None]:
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

#https://graphviz.gitlab.io/_pages/Download/Download_windows.html

ylistd = Y.values.astype('int64')
ylistd
from sklearn.preprocessing import label_binarize
ybinaryd = label_binarize(ylistd, classes=[0, 1])
n_classes = ybinaryd.shape[1]
X_train2d, X_test2d, y_train2d, y_test2d = train_test_split(X,ybinaryd, test_size=0.2)
y_scored = cross_val_predict(dt_clf, X, ybinaryd, cv=10 ,method='predict')
from sklearn.preprocessing import label_binarize
ybinaryd = label_binarize(ylistd, classes=[0, 1])
n_classes = ybinaryd.shape[1]
X_train2d, X_test2d, y_train2d, y_test2d = train_test_split(X,ybinary, test_size=0.2)
y_scored = cross_val_predict(dt_clf, X, ybinaryd, cv=10 ,method='predict')

In [None]:
#X1_train, X1_test, y1_train, y1_test
#y_score = classifier.fit(X_train3, y_train3).decision_function(X_test3)
# Compute ROC curve and ROC area for each class
fprd = dict()
tprd = dict()
roc_aucd = dict()
for i in range(n_classes):

#y1_test,yhat    
    fprd[i], tprd[i], _ = roc_curve(ybinaryd, y_scored)
    roc_aucd[i] = auc(fprd[i], tprd[i])

# Compute micro-average ROC curve and ROC area
fprd["micro"], tprd["micro"], _ = roc_curve(ybinaryd.ravel(), y_scored.ravel())
roc_aucd["micro"] = auc(fprd["micro"], tprd["micro"])

#Plot of a ROC curve for a specific class
plt.figure()
lw = 2
plt.plot(fprd[i], tprd[i], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_aucd[0])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Spam')
plt.legend(loc="lower right")
plt.show()

# Excluded explanations but saved for reference if needed

In the Python sklearn package, "Stratified Shuffle Split" provides train/test indices to split data into the train and test data sets.
This cross-validation object is a combination of the StratifiedKFold and ShuffleSplit methods, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class. (Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)

The settings we used for training are 1 fold using **80%** of the data for training and **20%** for testing in our initial model. 

** DID WE USE STRATIFIED SHUFFLE SPLIT OR JUST SHUFFLE SPLIT ?? It looks like we commented out Strat Shuffle Split. Why did we decide to do that? Why was Shuffle Split better than Strat Shuffle Split??


# now divide the data into test and train using scikit learn built-ins
from sklearn.model_selection import StratifiedShuffleSplit 
from sklearn.model_selection import ShuffleSplit
cvx = ShuffleSplit(n_splits=10, test_size=0.20, random_state=101)
#cv = StratifiedShuffleSplit(n_splits=10,train_size=0.8)
print (cvx)


# features
features = ['isRe', 'underscore', 'priority', 'isInReplyTo',
            'sortedRec', 'subPunc', 'multipartText', 'hasImages', 'isPGPsigned',
            'subSpamWords', 'noHost', 'numEnd', 'isYelling', 'isOrigMsg', 'isDear',
            'isWrote', 'numLines', 'bodyCharCt', 'subExcCt', 'subQuesCt', 'numAtt',
            'numRec', 'perCaps', 'hour', 'perHTML', 'subBlanks', 'forwards',
            'avgWordLen', 'numDlr']

X = dfNoNa[features].copy()

#scaler = StandardScaler()
#scaler.fit(X2)

#This makes our model's coefficients take on the same scale for accurate feature importance analysis
#Notice we scaled the data before the cross validation
#X = scaler.transform(X2)

Y= dfNoNa[['isSpam']].copy()
y = Y.values


* RAN BASELINE MODEL -- with accuracy of 96%. can we get other stats about the baseline tree: how many leaves/levels, what were the results of the other metrics. 

** We could define the metrics to evaluate models here (before we run baseline) to be able to show the full change in results between base and our defined model.


In [None]:
# Create a new dataset after filtering out NaNs for variable subQuesCt 
# which also addressed NaNs in subExcCt, subBlanks, isYelling and subSpamWords
# dfNoNa = df[~df['subQuesCt'].isnull()]

# Imputation for variable: numRec - Replace with mean
# import numpy as np
# from sklearn.impute import SimpleImputer
# imp = SimpleImputer(missing_values="NaN", strategy='mean')

# X = dfNoNa.iloc[:, 23].values
# imp = SimpleImputer(missing_values=-1, strategy='mean')
# imp.fit(X)

# # dfNoNa = imp.fit(dfNoNa.iloc[:, 23].values)
# # Imputation for variable: noHost - Replace with mode
# imp = SimpleImputer(strategy="most_frequent")

# # Data Summary post imputations
# countOfNan = pd.Series(dfNoNa.isnull().sum()) 
# DataType = pd.Series(dfNoNa.dtypes) 

# # Assemble into a single dataframe for viewing
# frame = { 'datatype': DataType, 'count of Nan': countOfNan } 
# result = pd.DataFrame(frame) 
# print(result)

In [None]:
from sklearn.metrics import confusion_matrix
cf_matrix = confusion_matrix(y1_test,yhat)
print(cf_matrix)

### train the decision tree algorithm
%time dt_clf.fit(X1_train,y1_train)
yhat = dt_clf.best_estimator_.predict(X1_test)
print ('accuracy:', mt.accuracy_score(y1_test,yhat))
print (dt_clf.best_params_)
Wall time: 1min 2s
accuracy: 0.9463792150359315
{'criterion': 'gini', 'max_depth': 120, 'max_features': 'auto', 'min_samples_leaf': 2}




In [None]:
#Pipeline.fit(X1_train,y1_train)
#yhat = Pipeline.predict(X1_test)
#print ('accuracy:', mt.accuracy_score(y1_test,yhat))
#print (dt_clf)

In [None]:
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.model_selection import train_test_split, GridSearchCV, KFold, TimeSeriesSplit, StratifiedShuffleSplit
# from sklearn import metrics as mt
# X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size = 0.2, random_state = 101)

# impute missing values for noHost and numRec using the mean from the training set
# X1_train.loc[X1_train['numRec'].isnull(),'numRec'] = X1_train.numRec.mean(skipna = True)
# X1_test.loc[X1_test['numRec'].isnull(),'numRec'] = X1_train.numRec.mean(skipna = True)
# X1_train.loc[X1_train['noHost'].isnull(),'noHost'] = X1_train.noHost.mean(skipna = True)
# X1_test.loc[X1_test['noHost'].isnull(),'noHost'] = X1_train.noHost.mean(skipna = True)

# run baseline decision tree classifier 
#dt_clf = DecisionTreeClassifier()
#dt_clf.fit(X1_train,y1_train)
#yhat = dt_clf.predict(X1_test)
#print ('accuracy:', mt.accuracy_score(y1_test,yhat))
#print (dt_clf)

### Q3: EXPLAIN PARAMETERS INVOLVED IN "PRUNING" THE MODEL

#### Parameters
To fully understand the results of the GridSearch, it is important to understand the parameters used by the Decision Tree classifier.  Pruning the tree involves selecting the most optimal combinations to tune the model that will deliver a good balance of model accuracy and interpretability. 

A defintion summary of these parameters are below. 


| Parameter                | Definition                                                                                                                                                                                  |
|:-------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Criterion                | function to measure the quality of a split; supported criteria are "gini" for Gini impurity and "entropy" for information gain                                                              |
| Splitter                 | strategy used to choose the best split at each node; supported strategies are "best" to choose best split and "random"  to choose best random split                                              |
| Max_depth                | maximum depth of tree; if "None", then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples                                           |
| Min_samples_split        | minimum number of samples required to split an internal node                                                                                                                                |
| Min_samples_leaf         | minimum number of samples required at a leaf node; split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each left and right branches |
| Min_weight_fraction_leaf | minimum weighted fraction of sum total of weights (for all input samples) required to be at a leaf node; samples have equal weight when sample_weight is not provided                       |
| Max_features             | number of features to consider when looking for best split                                                                                                                                  |
| Random_state             | controls randomness of the estimator                                                                                                                                                            |
| Max_leaf_nodes           | grow a tree with max_leaf_nodes in best-first fashion; best nodes defined as relative reduction in impurity                                                                                 |
| Min_impurity_decrease    | node will be split if this split induces a decrease of impurity greater than/equal to this value                                                                                            |
| Class_weight             | weights associated with classes                                                                                                                                                             |
| Ccp_alpha                | complexity parameter used for Minimal Cost-Complexity Pruning                                                                                                                               |


### PRUNING THE MODEL

Explain the rationale for our decisions, how that impacted from base model
The grid search found a model with very good accuracy of 98%. However, the model is extremely complex and has a depth of 120 branches. An evaluation of all the models evaluated shows that tree with a much smaller branch depth is also capable of producing a fairly accurate tree, with fewer branches.  This simpler tree better describes the important features in a more compact manner and is easier to interpret.

#### Parameters selected for pruning

* criterion: 'gini'
* max_depth': 7
* max_features: 'auto'
* min_samples_leaf: 2

As noted above, our goal is to define an interpretable Decision Tree with strong accuracy. To narrow down the 1944 GridSearch results, we looked at models with accuracy greater then 0.90. Acknowledging that a model with higher accuracy would likely require more depth, we felt like 0.90 was an appropriate threshold. To achieve a simplier model, we will need to trade off some accuracy for a tree with less depth.  

##### The "Optimized" Decision Tree parameters we ultimately selected for the model can be summarized as follows:
                                      
                                        
| Parameter | Value | Explanation |
|:-|:-|:-|
| Criterion | Gini | function to measure the quality of a split; supported criteria are "gini" for Gini impurity and "entropy" for information gain |
| Splitter | Best | strategy used to choose best split at each node; supported strategies are "best" to choose best split and "random  to choose best random split |
| Max_depth | 7 | maximum depth of tree; if None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples |
| Min_samples_split | 2  | minimum number of samples required to split an internal node |
| Min_samples_leaf | 2  | minimum number of samples required at a leaf node; split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each left and right branches |
| Min_weight_fraction_leaf |  | minimum weighted fraction of sum total of weights (for all input samples) required to be at a leaf node; samples have equal weight when sample_weight is not provided |
| Max_features | Auto  | number of features to consider when looking for best split |
| Random_state | None | controls randomness of estimator |
| Max_leaf_nodes | None  | grow a tree with max_leaf_nodes in best-first fashion; best nodes defined as relative reduction in impurity |
| Min_impurity_decrease | 0.0 | node will be split if this split induces a decrease of impurity greater than/equal to this value |
| Class_weight | None | weights associated with classes |
| Ccp_alpha | 0.0 | complexity parameter used for Minimal Cost-Complexity Pruning |