# Homework Assignment 5: Model Evaluation
As in the previous assignments, in this homework assignment you will continue your exploration of the [SWAN-SF Dataset](https://doi.org/10.7910/DVN/EBCFKM), described in the paper found [here](https://doi.org/10.1038/s41597-020-0548-x).


This assignment will utilize a copy of the extracted feature dataset we have been working with. The dataset has been processed by performing outlier clipping, z-score and range scaling, and forward feature selection to select 20 features. We are now going to utilize more than one partition worth of data, so for the z-score and range scaling, the mean, standard deviation, minimum, and maximum were calculated using data from both partitions so that a global scaling can be performed on each partition. 

---

## Step 1: Downloading the Data

This assignment will continue to only use [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB) and will add the use of [Partition 2](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/TCRPUD) as a testing set. 

---

For this assignment, cleaning, transforming, and normalization of the data has been completed using both partitions to find the various minimum, maximum, standard deviation, and mean values needed to perform these operations. Recall from lecture that we should not perform these operations on each partition individually, but as a whole as there may(will) be different values for these in different partitions. 

For example, if we perform simple range scaling on each partition individually and we see a range of 0 to 100 in one partition and 0 to 10 in another. After individual scaling the values with 100 in the first would be mapped to 1 just like the values that had 10 in the second. This can cause serious performance problems in your model, so I have made sure that the normalization was treated properly for you. 

Below you will find the full partitions and `toy` sampled data from each partition, where only 20 samples from each of our 5 classes have been included in the data.  

#### Full
- [Full Normalized Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition1ExtractedFeatures.csv)
- [Full Normalized Partition 2 feature dataset](http://dmlab.cs.gsu.edu/solar/data/normalized_partition2ExtractedFeatures.csv)

#### Toy
- [Toy Normalized Partition 1 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_normalized_partition1ExtractedFeatures.csv)
- [Toy Normalized Partition 2 feature dataset](http://dmlab.cs.gsu.edu/solar/data/toy_normalized_partition2ExtractedFeatures.csv)

Now that you have the two files, you should load each into a Pandas DataFrame using the [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method. 

---

### Evaluation Metric

For each of the models we evaluate in this assignmnet, you will calculate the True Skill Statistic score using the test data from Partition 2 to determine which model performs the best for classifying the positive flaring class.

    True skill statistic (TSS) = TPR + TNR - 1 = TPR - (1-TNR) = TPR - FPR

Where:

    True positive rate (TPR) = TP/(TP+FN) Also known as recall or sensitivity
    True negative rate (TNR) = TN/(TN+FP) Also known as specificity or selectivity
    False positive rate (FPR) = FP/(FP+TN) = (1-TNR) Also known as fall-out or false alarm ratio


**Recall**

    True positive (TP)
    True negative (TN)
    False positive (FP)
    False negative (FN)
    
See [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) for more information.

Below is a function implemented to provide your score for each model.

In [1]:
import os
import itertools
import pandas as pd
from pandas import DataFrame 
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import chi2
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

In [2]:
def calc_tss(y_true=None, y_predict=None):
    """
    Calculates the true skill score for binary classification based on the output of the confusion
    table function
    """
    scores = confusion_matrix(y_true, y_predict).ravel()
    TN, FP, FN, TP = scores
    print('TN={0}\tFP={1}\tFN={2}\tTP={3}'.format(TN, FP, FN, TP))
    tp_rate = TP / float(TP + FN) if TP > 0 else 0  
    fp_rate = FP / float(FP + TN) if FP > 0 else 0
    
    return tp_rate - fp_rate

As in the previous assignment, we will be utilizing a binary classification of our 5 class dataset. So, below is the helper function to change our class labels from the 5 class target feature to the binary target feature. The function is implemented to take a dataframe (e.g. our `abt`) and prepares it for a binary classification by merging the `X`- and `M`-class samples into one group, and the rest (`NF`, `B`, and `C`) into another group, labeled with `1`s and `0`s, respectively.

In [3]:
def dichotomize_X_y(data: pd.DataFrame):
    """
    dichotomizes the dataset and split it into the features (X) and the labels (y).
    
    :return: two np.ndarray objects X and y.
    """
    data_dich = data.copy()
    data_dich['lab'] = data_dich['lab'].map({'NF': 0, 'B': 0, 'C': 0, 'M': 1, 'X': 1})
    y = data_dich['lab'].copy()
    X = data_dich.copy().drop(['lab'], axis=1)
    return X.values, y.values

In [4]:
data_dir = 'data/FDS'
data_file = "normalized_partition1ExtractedFeatures.csv"
data_file2 = "normalized_partition2ExtractedFeatures.csv"

In [5]:
abt = pd.read_csv(os.path.join(data_dir, data_file).replace('\\', '/'))
abt2 = pd.read_csv(os.path.join(data_dir, data_file2).replace('\\', '/'))

In [6]:
abt

Unnamed: 0.1,lab,TOTUSJH_var,TOTUSJH_difference_of_vars,TOTBSQ_min,TOTBSQ_max,TOTBSQ_median,TOTBSQ_mean,TOTBSQ_var,TOTBSQ_difference_of_mins,TOTBSQ_difference_of_maxs,...,TOTUSJZ_slope_of_longest_mono_decrease,TOTUSJZ_gderivative_stddev,MEANPOT_max,MEANPOT_gderivative_mean,TOTFX_stddev,SAVNCPP_slope_of_longest_mono_decrease,TOTPOT_avg_mono_decrease_slope,USFLUX_stddev,TOTBSQ_dderivative_stddev,Unnamed: 0
0,NF,0.703435,0.691661,0.878752,0.886776,0.880269,0.880795,0.838756,0.731461,0.849843,...,0.988157,0.019057,0.000029,0.322544,0.049426,0.932794,0.999939,0.051082,0.017805,0.000000
1,NF,0.536687,0.369924,0.827467,0.837925,0.832165,0.832613,0.805319,0.805999,0.814957,...,0.974288,0.002260,0.000028,0.322542,0.014624,0.987151,0.999982,0.013241,0.003582,0.000011
2,NF,0.593047,0.551995,0.831203,0.843844,0.835858,0.837174,0.819294,0.806446,0.828374,...,0.974871,0.007661,0.000025,0.322542,0.030668,0.949625,0.999962,0.027575,0.006493,0.000023
3,NF,0.646995,0.467533,0.924507,0.925464,0.924840,0.924978,0.843487,0.825663,0.754150,...,0.999076,0.010761,0.000087,0.322539,0.055006,0.938782,0.999863,0.052563,0.018347,0.000034
4,NF,0.508972,0.470260,0.863690,0.867887,0.866987,0.866551,0.803660,0.806323,0.791700,...,0.997272,0.004687,0.000122,0.322533,0.014136,0.985678,0.999949,0.014227,0.005817,0.000045
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73487,NF,0.462685,0.411532,0.753679,0.792289,0.777925,0.780560,0.774680,0.789393,0.793877,...,0.999178,0.003487,0.000029,0.322549,0.010079,0.998463,0.999983,0.008226,0.005827,0.829836
73488,NF,0.606655,0.637593,0.809834,0.846074,0.843249,0.839888,0.841479,0.867723,0.790040,...,0.981707,0.005644,0.000066,0.322560,0.016209,0.924634,0.999961,0.022735,0.007617,0.829848
73489,C,0.711481,0.737709,0.939336,0.943771,0.941188,0.941403,0.905944,0.895820,0.900880,...,0.910137,0.021655,0.000111,0.322558,0.176565,0.999444,0.999772,0.201647,0.030702,0.829859
73490,B,0.732800,0.611349,0.926691,0.929611,0.928393,0.927970,0.879584,0.886586,0.874680,...,0.962007,0.012777,0.000035,0.322546,0.051507,0.915419,0.999905,0.110994,0.028382,0.829870


---
### Q1 (10 points)

Just like you did with the previous assignment, you will be utilizing a few different types of feature selection to find subsets of descriptive features to use in the models we will be evaluating.  For this question you will again be utilizing the [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) class from [scikit-learn Univariate Feature Selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection). You will then be using 3 diferent feature evaluation functions.

-  [scikit-learn f_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif)

- [scikit-learn mutual_info_classif](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif)

- [chi2](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2)

For each of these combinations of evaluation functions, you need to construct a 20 feature training and testing dataset. This will be done by:
<ol>
    <li>Use the `SelectKBest` class with each of the evaluation functions to perform feature selection using Partition 1 as your input data</li>
    <li>Construct a new train `DataFrame` for each instance of `SelectKBest` from 1 with the `lab` class labels using Partition 1</li>
    <li>Construct a new test `DataFrame` for each instance of `SelectKBest` from 1 with the `lab` class labels using Partition 2</li>
</ol>

After this question, you should have a total of 6 `DataFrame`s to use in later questions, a train and test pair for each feature selection method.

---

In [7]:
numFeat = 20
abt_cpy = abt.copy()
abt2_cpy = abt2.copy()
y1 = abt_cpy['lab']
y2 = abt2_cpy['lab']
x1 = abt_cpy.drop(['lab'], axis=1)
x2 = abt2_cpy.drop(['lab'], axis=1)
labels = pd.DataFrame(y1, columns = ['lab'])

In [8]:
#f_classif
selector = SelectKBest(k= numFeat)
selector.fit_transform(x1.values, y1.values)
col_index = selector.get_support(indices = True)
columns = x1.columns[col_index]
f_classif_abt = abt_cpy[columns.values].join(y1, rsuffix = " ")
f_classif_abt_2 = abt2_cpy[columns.values].join(y2, rsuffix = " ")
#m_classif
selector = SelectKBest(mutual_info_classif, k= numFeat)
selector.fit_transform(x1.values, y1.values)
col_indices = selector.get_support(indices = True)
columns = x1.columns[col_indices]
m_classif_abt = abt_cpy[columns.values].join(y1, rsuffix = " ")
mi_classif_abt2 = abt2_cpy[columns.values].join(y2, rsuffix = " ")
#chi2
selector = SelectKBest(chi2, k= numFeat)
selector.fit_transform(x1.values, y1.values)
col_indices = selector.get_support(indices = True)
columns = x1.columns[col_indices]
chi2_abt = abt_cpy[columns.values].join(y1, rsuffix = " ")
chi2_abt2 = abt_cpy[columns.values].join(y2, rsuffix = " ")

---
### Q2 (10 points)

Now that we have our training and testing datasets for each of our feature subsets, we need to attempt to perform hyperparameter tuning on our model for each of the datasets. We want to see which combination of dataset and parameter settings seem to provide the best results. 

In order to do this, we must first dichotomize the training and testing data. Lucky for you, a method has already been provided to do this. All you need to do is apply it to teach of the `DataFrame`s you constructed in Q1.  

With your binary classification dataset constructed, now it's time to start training and testing some models. We will start with the simple [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), and try several different settings to see how/if using different settings will improve our score. So, for each of your three copies of the Partition 1 training datasets that have had their `lab` columns converted to a binary label, train 4 different instances with the following settings. **(see documentation to know what these are)** In total you will train and evaluate 12 model setting and feature selected data pairings. 

|Model Number| n_neighbors | p |
|------------|-------------|---|
|1|3|1|
|2|3|2|
|3|5|1|
|4|5|2|


Once you have done that, test each of your models using your binary classification copy of the Partition 2 testing dataset that was cunstructed with the same features the model was trained on. You shall then calculate and print the TSS score for each result. **NOTE: The model does take a little while to evaluate.**

---

In [9]:
n_neighbors = [3, 5]
p = [1,2]
temp = [n_neighbors, p]
params = list(itertools.product(*temp))

# f_classif Scores

In [10]:
train1, train2 = dichotomize_X_y(f_classif_abt)
test1, test2 = dichotomize_X_y(f_classif_abt_2)
for neigh, ps in params:
    classifier = KNeighborsClassifier(n_neighbors=neigh, p=ps)
    classifier.fit(train1, train2)
    y_pred = classifier.predict(test1)
    score = calc_tss(test2, y_pred)
    print(score) 

TN=86045	FP=1111	FN=1038	TP=363
0.24635338460765865
TN=85973	FP=1183	FN=1043	TP=358
0.24195840032045718
TN=86182	FP=974	FN=1044	TP=357
0.24364262343639792
TN=86094	FP=1062	FN=1057	TP=344
0.23335385328412084


# mutual_info_classif Scores

In [11]:
train2, train3 = dichotomize_X_y(m_classif_abt)
test2, test3 = dichotomize_X_y(mi_classif_abt2)
for neigh, ps in params:
    classifier = KNeighborsClassifier(n_neighbors=neigh, p=ps)
    classifier.fit(train2, train3)
    y_pred = classifier.predict(test2)
    score = calc_tss(test3, y_pred)
    print(score)

TN=86081	FP=1075	FN=1123	TP=278
0.18609548774340784
TN=86035	FP=1121	FN=1117	TP=284
0.18985035373820336
TN=86253	FP=903	FN=1127	TP=274
0.18521385709918065
TN=86241	FP=915	FN=1127	TP=274
0.1850761729466266


# chi2 Scores

In [12]:
train3, train4 = dichotomize_X_y(chi2_abt)
test3, test4 = dichotomize_X_y(chi2_abt2)
for neigh, ps in params:
    classifier = KNeighborsClassifier(n_neighbors=neigh, p=ps)
    classifier.fit(train3, train4)
    y_pred = classifier.predict(test3)
    score = calc_tss(test4, y_pred)
    print(score)

TN=71113	FP=1218	FN=1128	TP=33
0.01158451973069399
TN=71113	FP=1218	FN=1128	TP=33
0.01158451973069399
TN=71124	FP=1207	FN=1129	TP=32
0.010875271926453621
TN=71131	FP=1200	FN=1129	TP=32
0.010972049241850895


---
### Q3 (10 points)

After evaluating the various results from Q2, you will notice that the results are not all that great with greater than 1000 false negatives for nearly all of our settings tried. But, what can be done to improve our results? If you read the documentation for the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), which you certainly should have, you will see that we were only using the `MinkowskiDistance` metric with different values of `p`. If you look into the [DistanceMetric](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric) documentation for the neighbors classifiers, you will see there are several others available to use.

So, for this question, train and evaluate two more instances of [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) for each of our different feature selection train test datsets, but this time using the `ChebyshevDistance` metric instead of the `MinkowskiDistance` metric.  For these models you will only be changing the number neighbors to 3 and 5, as the values of `p` are not used for the `ChebyshevDistance` metric. 

---

In [13]:
n_neighbors = [3, 5]
temp = [n_neighbors]
params = list(itertools.product(*temp))

# f_classif Scores

In [14]:
#----------------------------------------------
xtrain, ytrain = dichotomize_X_y(f_classif_abt)
xtest, ytest = dichotomize_X_y(f_classif_abt_2)
for neighbor in params:
    classifier = KNeighborsClassifier(n_neighbors=neighbor[0], metric="chebyshev")
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)
#----------------------------------------------

TN=85934	FP=1222	FN=1056	TP=345
0.23223184045777573
TN=86062	FP=1094	FN=1068	TP=333
0.22513516092584682


# mutual_info_classif Scores

In [15]:
xtrain, ytrain = dichotomize_X_y(m_classif_abt)
xtest, ytest = dichotomize_X_y(mi_classif_abt2)
for neighbor in params:
    classifier = KNeighborsClassifier(n_neighbors=neighbor[0], metric="chebyshev")
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)

TN=85990	FP=1166	FN=1144	TP=257
0.17006208955798865
TN=86181	FP=975	FN=1171	TP=230
0.15298161371133678


# chi2 Scores

In [16]:
xtrain, ytrain = dichotomize_X_y(chi2_abt)
xtest, ytest = dichotomize_X_y(chi2_abt2)
for neighbor in params:
    classifier = KNeighborsClassifier(n_neighbors=neighbor[0], metric="chebyshev")
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)

TN=71123	FP=1208	FN=1130	TP=31
0.010000120152960791
TN=71161	FP=1170	FN=1130	TP=31
0.010525482722260261


---
### Q4 (10 points)

After evaluating the results from Q3, you will see that the results are no better than those we found for Q2. This leads to the thought that maybe the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) is just not a good fit for the problem we are applying it to. So, let's move on to another classifier for this problem. 

In this question, you will utilize the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and try several different settings to see how/if using different settings will improve our score. So, continuing to use our training/testing pairs constructed with different feature selection methods that have had their `lab` column converted to a binary label, train 8 different instances with the following settings. **(see documentation to know what these are)**

|Model Number| criterion | max_depth | splitter |
|------------|---------|-------------|---|
|1|gini|5|best|
|2|gini|5|random|
|3|gini|None|best|
|4|gini|None|random|
|5|entropy|5|best|
|6|entropy|5|random|
|7|entropy|None|best|
|8|entropy|None|random|



Once you have done that, test each of your models using your binary classification copy of the Partition 2 testing dataset that was cunstructed with the same features the model was trained on. You shall then calculate and print the TSS score for each result.

---

In [17]:
criterion = ['gini', 'entropy']
depth = [5, None]
splitter = ['best', 'random']
temp = [criterion, depth, splitter]
params = list(itertools.product(*temp))

# f_classif Scores

In [18]:
#----------------------------------------------
xtrain, ytrain = dichotomize_X_y(f_classif_abt)
xtest, ytest = dichotomize_X_y(f_classif_abt_2)
for criteria, deep, split in params:
    classifier = DecisionTreeClassifier(criterion = criteria, max_depth = deep, splitter = split)
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)
#----------------------------------------------

TN=86976	FP=180	FN=1331	TP=70
0.04789904891797061
TN=86827	FP=329	FN=1183	TP=218
0.15182830009799061
TN=85844	FP=1312	FN=1073	TP=328
0.21906501944923784
TN=85646	FP=1510	FN=1036	TP=365
0.24320293828398765
TN=87002	FP=154	FN=1335	TP=66
0.04534226108433592
TN=86833	FP=323	FN=1088	TP=313
0.21970585023993502
TN=86131	FP=1025	FN=1155	TP=246
0.16382834373236874
TN=85757	FP=1399	FN=1041	TP=360
0.2409076373232353


# mutual_info_classif Scores

In [19]:
xtrain, ytrain = dichotomize_X_y(m_classif_abt)
xtest, ytest = dichotomize_X_y(mi_classif_abt2)
for criteria, deep, split in params:
    classifier = DecisionTreeClassifier(criterion = criteria, max_depth = deep, splitter = split)
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)

TN=86847	FP=309	FN=1313	TP=88
0.0592669100167727
TN=87028	FP=128	FN=1339	TP=62
0.04278547325070122
TN=85652	FP=1504	FN=1028	TP=373
0.24898198735526825
TN=85417	FP=1739	FN=1038	TP=363
0.2391479139573305
TN=86978	FP=178	FN=1356	TP=45
0.030077599417343465
TN=86993	FP=163	FN=1340	TP=61
0.041670118598043156
TN=85824	FP=1332	FN=1108	TP=293
0.19385339025850715
TN=85706	FP=1450	FN=1121	TP=280
0.18322040972484493


# chi2 Scores

In [20]:
xtrain, ytrain = dichotomize_X_y(chi2_abt)
xtest, ytest = dichotomize_X_y(chi2_abt2)
for criteria, deep, split in params:
    classifier = DecisionTreeClassifier(criterion = criteria, max_depth = deep, splitter = split)
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)

TN=71471	FP=860	FN=1139	TP=22
0.007059397276786134
TN=71680	FP=651	FN=1145	TP=16
0.004780932751602473
TN=71110	FP=1221	FN=1128	TP=33
0.011543043738380873
TN=71110	FP=1221	FN=1128	TP=33
0.011543043738380873
TN=71812	FP=519	FN=1154	TP=7
-0.0011460615711165424
TN=71986	FP=345	FN=1151	TP=10
0.0038435253112095655
TN=71110	FP=1221	FN=1128	TP=33
0.011543043738380873
TN=71110	FP=1221	FN=1128	TP=33
0.011543043738380873


---
### Q5 (10 points)

After evaluating results from Q4, you will see that the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) was able to accomplish a bit of an improvement over the best resutls we found for the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).  This is indeed great, but can we do better than this if we use yet another classifier? Let's move on to yet another and find out.

For this question you will be utilizing the [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifier. We won't be changing any of the default settings, just train 1 model for each of our feature selected data subsets. You will again be using your training/testing pairs constructed with different feature selection methods that have had their `lab` column converted to a binary label. You will then test each of your models using your binary classification copy of the Partition 2 testing dataset that was cunstructed with the same features the model was trained on. You shall then calculate and print the TSS score for each result.

---

# f_classif Scores

In [21]:
#----------------------------------------------
xtrain, ytrain = dichotomize_X_y(f_classif_abt)
xtest, ytest = dichotomize_X_y(f_classif_abt_2)
classifier = GaussianNB()
classifier.fit(xtrain, ytrain)
y_pred = classifier.predict(xtest)
score = calc_tss(ytest, y_pred)
score
#----------------------------------------------

TN=78609	FP=8547	FN=113	TP=1288


0.8212777885389588

# mutual_info_classif Scores

In [22]:
xtrain, ytrain = dichotomize_X_y(m_classif_abt)
xtest, ytest = dichotomize_X_y(mi_classif_abt2)
classifier = GaussianNB()
classifier.fit(xtrain, ytrain)
y_pred = classifier.predict(xtest)
score = calc_tss(ytest, y_pred)
score

TN=76146	FP=11010	FN=176	TP=1225


0.7480502361415888

# chi2 Scores

In [23]:
xtrain, ytrain = dichotomize_X_y(chi2_abt)
xtest, ytest = dichotomize_X_y(chi2_abt2)
classifier = GaussianNB()
classifier.fit(xtrain, ytrain)
y_pred = classifier.predict(xtest)
score = calc_tss(ytest, y_pred)
score

TN=62761	FP=9570	FN=999	TP=162


0.00722646824208989

---
### Q6 (10 points)

If you recall from a lecture some time back, it was shown that another way of improving the results of classification is to perform some form of sampling to balance the number of samples there are for the various classes. The reason why this works for specific classifiers, and methods for doing the sampling, are numerious and we don't have enough time to cover all of them in this course.  However, it is still beneficial to know this works and that it is something that you should be considering when you are training models.  

So, for this question, we will implement a very naive method for sampling so we can use the results for training our models again.  Below you will find a function stub, complete the function and have it return a copy of the input dataframe where each class (except for the smallest one) have been undersampled to match the size of the smallest class in the dataset. In this function you should assume the `lab` column is the class label and not the dicotomized binary classification converted label.

To do this you may want to use the [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) function of the DataFrame to get groups of rows from your DataFrame.  You may also wish to use the [sample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) function to select a number of rows from a group. You can also use the [apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method to process each group from your grouped rows. These are just hints, you can solve the problem how you see fit.

Once this function is complete, apply it to each of your training datasets that have been constructed with different feature selection methods from partition 1 (the ones with all the NF, C, .., X labels). You will not be applying this to your testing sets. After you have your sampled feature selected datasets, you will then apply your function that converts the multi-class problem to a binary problem to each of the resultant selected subsets so we can use these new undersampled data for the next several questions.

---

In [24]:
def perform_under_sample(data:DataFrame)->DataFrame:
    #----------------------------------------------
    output = pd.DataFrame()
    data.groupby('lab')
    labels_set = list(set(data['lab'].tolist()))
    values = dict(data['lab'].value_counts())
    target = values[min(values, key=values.get)]
    for label in labels_set:
        data1 = data[data['lab']==label]
        data1 = data1.sample(n = target)
        output = pd.concat([output, data1])
    return output
    #----------------------------------------------

In [25]:
    #----------------------------------------------
    training_sets = [f_classif_abt, m_classif_abt, chi2_abt]
    f_classif_train, m_classif_train, chi2_train = perform_under_sample(f_classif_abt), perform_under_sample(m_classif_abt), perform_under_sample(chi2_abt)
    #----------------------------------------------

---
### Q7

For this question repeat what you did for Q2, but with your balanced binary classification datasets constructed in Q6, uese the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), and try several different settings to see how/if using different settings will improve our score. 

So, train 4 different instances with the following settings for each of your feature selected subsets, for a total of 12 different evaluations. **(see documentation to know what these are)**

|Model Number| n_neighbors | p |
|------------|-------------|---|
|1|3|1|
|2|3|2|
|3|5|1|
|4|5|2|


Once you have done that, test each of your models using your binary classification copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (these should not have been balanced). You shall then calculate and print the TSS score for each result. **NOTE: The model now takes less time to evaluate!**

---

In [26]:
n_neighbors = [3, 5]
p = [1,2]
temp = [n_neighbors, p]
params = list(itertools.product(*temp))

# f_classif Scores

In [27]:
#----------------------------------------------
train1, train2 = dichotomize_X_y(f_classif_train)
test1, test2 = dichotomize_X_y(f_classif_abt_2)
for neigh, ps in params:
    classifier = KNeighborsClassifier(n_neighbors=neigh, p=ps)
    classifier.fit(train1, train2)
    y_pred = classifier.predict(test1)
    score = calc_tss(test2, y_pred)
    print(score)
#----------------------------------------------

TN=82433	FP=4723	FN=428	TP=973
0.6403137380579144
TN=82308	FP=4848	FN=424	TP=977
0.6417346316329783
TN=82750	FP=4406	FN=387	TP=1014
0.6732157052706104
TN=82579	FP=4577	FN=391	TP=1010
0.6683986025992134


# mutual_info_classif Scores

In [28]:
train2, train3 = dichotomize_X_y(m_classif_train)
test2, test3 = dichotomize_X_y(mi_classif_abt2)
for neigh, ps in params:
    classifier = KNeighborsClassifier(n_neighbors=neigh, p=ps)
    classifier.fit(train2, train3)
    y_pred = classifier.predict(test2)
    score = calc_tss(test3, y_pred)
    print(score)

TN=81353	FP=5803	FN=531	TP=870
0.5544032492673798
TN=81150	FP=6006	FN=571	TP=830
0.5235230573783227
TN=81476	FP=5680	FN=457	TP=944
0.6086339265348417
TN=81280	FP=5876	FN=489	TP=912
0.5835442573964448


# chi2 Scores

In [29]:
train3, train4 = dichotomize_X_y(chi2_abt)
test3, test4 = dichotomize_X_y(chi2_abt2)
for neigh, ps in params:
    classifier = KNeighborsClassifier(n_neighbors=neigh, p=ps)
    classifier.fit(train3, train4)
    y_pred = classifier.predict(test3)
    score = calc_tss(test4, y_pred)
    print(score)

TN=71113	FP=1218	FN=1128	TP=33
0.01158451973069399
TN=71113	FP=1218	FN=1128	TP=33
0.01158451973069399
TN=71124	FP=1207	FN=1129	TP=32
0.010875271926453621
TN=71131	FP=1200	FN=1129	TP=32
0.010972049241850895


---
### Q8

After evaluating the various results from Q7, you will notice that some of the results are improved over the same experiments we conducted in Q2. Additionally, you should also notice a improvement in the speed at which the results were obtained. The question now is will we continue to see these improvements for all of our experiments? So, let's move on and see.

For this question, you will repeat the experiments from Q3, but using the balanced binary classification datasets constructed in Q6. You will still be using the [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) like you did in Q7, but you will again be changing from using the `MinkowskiDistance` metric with different values of `p` to using the `ChebyshevDistance` metric. You will construct two models for each of your feature selected datasets by changing the number neighbors to 3 and 5.

Once you have done that, test each of your models using your binary classification copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (these should not have been balanced), then calculate and print the TSS score for each result. 

---

In [30]:
n_neighbors = [3, 5]
temp = [n_neighbors]
params = list(itertools.product(*temp))

# f_classif Scores

In [31]:
#----------------------------------------------
xtrain, ytrain = dichotomize_X_y(f_classif_train)
xtest, ytest = dichotomize_X_y(f_classif_abt_2)
for neighbor in params:
    classifier = KNeighborsClassifier(n_neighbors=neighbor[0], metric="chebyshev")
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)
#----------------------------------------------

TN=82114	FP=5042	FN=438	TP=963
0.6295158755920983
TN=82473	FP=4683	FN=374	TP=1027
0.6793165824493687


# mutual_info_classif Scores

In [32]:
xtrain, ytrain = dichotomize_X_y(m_classif_train)
xtest, ytest = dichotomize_X_y(mi_classif_abt2)
for neighbor in params:
    classifier = KNeighborsClassifier(n_neighbors=neighbor[0], metric="chebyshev")
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)

TN=81310	FP=5846	FN=575	TP=826
0.5225037425815415
TN=81110	FP=6046	FN=491	TP=910
0.5801661801531782


# chi2 Scores

In [33]:
xtrain, ytrain = dichotomize_X_y(chi2_train)
xtest, ytest = dichotomize_X_y(chi2_abt2)
for neighbor in params:
    classifier = KNeighborsClassifier(n_neighbors=neighbor[0], metric="chebyshev")
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)

TN=68023	FP=4308	FN=1077	TP=84
0.01279189622699578
TN=67791	FP=4540	FN=1070	TP=91
0.015613704587167349


---
### Q9

After evaluating the results of Q8 things are looking a little less encouraging, since none of those results look to be better than the results of Q7. However, the results from Q3 weren't really any better than Q2 in the first place, so not all is lost.  Let's continue on and see how things turn out with models like we used in Q4 since those were actaully an improvement over Q2 originally.

So, in this question, you will utilize the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), like you did in Q4, and try several different settings to see how/if using different settings will improve our score. The difference will again be that you are now using the balanced binary classification datasets constructed in Q6 to train 8 different instances for each of your feature selected datasets using the following settings. **(see documentation to know what these are)**

|Model Number| criterion | max_depth | splitter |
|------------|---------|-------------|---|
|1|gini|5|best|
|2|gini|5|random|
|3|gini|None|best|
|4|gini|None|random|
|5|entropy|5|best|
|6|entropy|5|random|
|7|entropy|None|best|
|8|entropy|None|random|



Once you have done that, test each of your models using your binary classification copy of copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (this should not have been balanced), then calculate and print the TSS score for each result. 

---

In [34]:
criterion = ['gini', 'entropy']
depth = [5, None]
splitter = ['best', 'random']
temp = [criterion, depth, splitter]
params = list(itertools.product(*temp))

# f_classif Scores

In [35]:
#----------------------------------------------
xtrain, ytrain = dichotomize_X_y(f_classif_train)
xtest, ytest = dichotomize_X_y(f_classif_abt_2)
for criteria, deep, split in params:
    classifier = DecisionTreeClassifier(criterion = criteria, max_depth = deep, splitter = split)
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)
#----------------------------------------------

TN=82502	FP=4654	FN=468	TP=933
0.6125543869600824
TN=80443	FP=6713	FN=201	TP=1200
0.779508239575929
TN=82448	FP=4708	FN=554	TP=847
0.5505500830773008
TN=82311	FP=4845	FN=562	TP=839
0.5432679820073053
TN=82666	FP=4490	FN=325	TP=1076
0.7165060204140097
TN=82644	FP=4512	FN=344	TP=1057
0.702691857854527
TN=82227	FP=4929	FN=525	TP=876
0.5687139002913184
TN=81913	FP=5243	FN=620	TP=781
0.497302456900487


# mutual_info_classif Scores

In [36]:
xtrain, ytrain = dichotomize_X_y(m_classif_train)
xtest, ytest = dichotomize_X_y(mi_classif_abt2)
for criteria, deep, split in params:
    classifier = DecisionTreeClassifier(criterion = criteria, max_depth = deep, splitter = split)
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)

TN=79726	FP=7430	FN=484	TP=917
0.5692830390125737
TN=77417	FP=9739	FN=336	TP=1065
0.6484291427328662
TN=80629	FP=6527	FN=573	TP=828
0.5161177186728506
TN=79835	FP=7321	FN=642	TP=759
0.4577570819136191
TN=81426	FP=5730	FN=532	TP=869
0.554527051987708
TN=80669	FP=6487	FN=502	TP=899
0.5672547529286874
TN=81537	FP=5619	FN=687	TP=714
0.44516536987063876
TN=81316	FP=5840	FN=678	TP=723
0.4490536695971476


# chi2 Scores

In [37]:
xtrain, ytrain = dichotomize_X_y(chi2_train)
xtest, ytest = dichotomize_X_y(chi2_abt2)
for criteria, deep, split in params:
    classifier = DecisionTreeClassifier(criterion = criteria, max_depth = deep, splitter = split)
    classifier.fit(xtrain, ytrain)
    y_pred = classifier.predict(xtest)
    score = calc_tss(ytest, y_pred)
    print(score)

TN=67804	FP=4527	FN=1067	TP=94
0.018377413215356228
TN=69025	FP=3306	FN=1101	TP=60
0.005973043034253556
TN=68194	FP=4137	FN=1085	TP=76
0.008265416247069071
TN=68380	FP=3951	FN=1096	TP=65
0.0013623369005425628
TN=69243	FP=3088	FN=1092	TP=69
0.016738903126836117
TN=68563	FP=3768	FN=1085	TP=76
0.013366963301582352
TN=68418	FP=3913	FN=1089	TP=72
0.007916984568894572
TN=67911	FP=4420	FN=1080	TP=81
0.008659479852474075


---
### Q10

Unlike with [KNeighborsClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), it seems that the sampling didn't really help much for the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).  Where before we saw a 3X improvement with the Decision Tree over the KNN classifier, we now see similar results for both classifiers.  Let's see how switching to the sampled data affectes our best performing classifier when we were using the full dataset.

For this question you will again be utilizing the [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifier as you did in Q5 but using your balanced binary classification dataset constructed in Q6 to train just 1 model for each feature selected dataset. Once you have done that, test the model using your binary classification copy of Partition 2 testing dataset that was cunstructed with the same features the model was trained on (this should not have been balanced), then calculate and print the TSS score. 

---

# f_classif Scores

In [38]:
#----------------------------------------------
xtrain, ytrain = dichotomize_X_y(f_classif_train)
xtest, ytest = dichotomize_X_y(f_classif_abt_2)
classifier = GaussianNB()
classifier.fit(xtrain, ytrain)
y_pred = classifier.predict(xtest)
score = calc_tss(ytest, y_pred)
score
#----------------------------------------------

TN=83160	FP=3996	FN=348	TP=1053


0.7057571729168491

# mutual_info_classif Scores

In [39]:
xtrain, ytrain = dichotomize_X_y(m_classif_train)
xtest, ytest = dichotomize_X_y(mi_classif_abt2)
classifier = GaussianNB()
classifier.fit(xtrain, ytrain)
y_pred = classifier.predict(xtest)
score = calc_tss(ytest, y_pred)
score

TN=81410	FP=5746	FN=378	TP=1023


0.6642649577714547

# chi2 Scores

In [40]:
xtrain, ytrain = dichotomize_X_y(chi2_abt)
xtest, ytest = dichotomize_X_y(chi2_abt2)
classifier = GaussianNB()
classifier.fit(xtrain, ytrain)
y_pred = classifier.predict(xtest)
score = calc_tss(ytest, y_pred)
score

TN=62761	FP=9570	FN=999	TP=162


0.00722646824208989

Unfortunately, we don't see much improvement for our [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) classifier. 

**Note:The TA would like you to turn in assignments that have been run and have results, so make sure to do a restart and run all from the kernel menu. Then make sure to save before you turn it in. You might find it necessary to use the toy dataset if you have time constraints.**