# **Classification Metrics: Tools (Aside from Accuracy) for Assessing ML Models**




In [None]:
# Import the following packages
import pandas as pd
import numpy as np

# Numpy random seed for consistency
np.set_printoptions(precision=4, suppress=True)
np.random.seed(123) #use this random "seed" so that we can all get the same synthetic data!

# To model normal distribution
from scipy.stats import norm

# To make data
from sklearn.datasets import make_blobs

## Let's begin by making synthetic data with 2 features (to be used for classification as a 1 or 0)

In [None]:
#Make data set with 3000 observations
n = 3000

In [None]:
centers = [[9.5, 0], [10.5, 0]] # Define the coordinates to center our blobs (x,y)
X, y = make_blobs(n_samples=n, centers=centers, cluster_std=0.4, random_state=7)
data = pd.DataFrame(X, columns=['feature1','feature2']) # Rename the feature columns (like x and y; coordinates to be used to classify points as 0 or 1)
data['target'] = y.astype('str') # Convert dtype to help w/ viz

data.head() #view the first few rows 

In [None]:
#View the shape of our synthetic data, and the frequencies of each class (Hint: value_counts())


As you can see, the "class frequencies" of 0 and 1 observations depict a 50-50 split, meaning that half of our data is 1's and half of our data is 0's

**Below is a pre-made classifier (common classifiers we have/may learn are regression, Decision Trees, K Nearest Neighbors, etc.). This classifier will make the predictions of 0's and 1's based upon training and testing data **

In [None]:
class BoundaryClassifier():
    def __init__(self):
        from scipy.stats import norm
        self.name = 'Classify observations on 1D boundary'
    
    def fit(self, X_train, y_train, x_boundary=None):
        self.boundary = x_boundary
        
    def predict(self, X_test):
        b = self.boundary
        x = X_test.feature1
        y_pred =  (x > b).astype(np.int) #boundary, b, a threshold we can use to determine if observation is a 0 or a 1
        return y_pred
    
    def predict_proba(self, X_test): #the predicted probability
        b = self.boundary
        x = X_test.feature1
        
        # Use the normal distribution to model probabilities
        y_pred_proba = ((x-b)/0.4).apply(norm.cdf)
        return y_pred_proba

**1. As learned, split your data into training and testing data**

**2.Employ the Classifier "BoundaryClassifier()" to fit the model to the data and predict the 0 and 1 classes. Hint: an extra input is needed in clf.fit(), called x_boundary. Set this boundary/threshold=10, which is the threshold we can use to determine if a point is a 0 or 1 (threshold determined for this specific synthetic dataset)**

**3.Create a data frame to view the actual class, predicted class (from model), and predicted probability ('y_pred_proba'), from BoundaryClassifier()**

## **Classification Metrics**

1. Compute the accuracy of the model

In [None]:
##ACCURACY SCORE


2. Create a confusion matrix to model the true positives, true negatives, false positives, and false negatives

In [None]:
##CONFUSION MATRIX


In [None]:
#code to turn outputted matrix into a dataframe
def custom_confusion_matrix(y_test_, y_pred_proba_, alpha=0.5, output='dataframe'):
    """
    Usage:
        cm = custom_confusion_matrix(y_test, y_pred_proba, output = 'dataframe')
        tn, fp, fn, tp = custom_confusion_matrix(y_test, y_pred_proba, output = 'rates')

    Params:
        alpha: Threshold probability for classification (default = 0.5)
        output: One of 'dataframe', 'rates', or 'array'
    """
    y_pred_ = (y_pred_proba_ >  alpha).map({True:1,False:0})
    cf_mat_ = confusion_matrix(y_test_, y_pred_)
    if output == 'dataframe':
        return pd.DataFrame(cf_mat_, columns=['Predicted 0', 'Predicted 1'], index=['Actual 0', 'Actual 1'])
    elif output == 'rates':
        return cf_mat_.ravel()
    else:
        return cf_mat_

In [None]:
cm = custom_confusion_matrix(y_test, y_pred_proba, output = 'dataframe')
cm

In [None]:
#assigning values to corresponding tn, fp, fn, tp measures
tn, fp, fn, tp = custom_confusion_matrix(y_test, y_pred_proba, output = 'rates')
tn, fp, fn, tp

3. Compute the Sensitivity, Specificity, Precision, and F-1 Scores.

In [None]:
##SENSITIVITY


In [None]:
##SPECIFICITY


In [None]:
##PRECISION


In [None]:
##F-1 SCORE


# What does each of these metrics mean in context of the classes in this model?
