# 90-803 Machine Learning Foundations with Python
### Spring 2025 / Lab03  Classifier Performance Metrics

#### Your name: `Joanna Chang`
#### Your Andrew Id: `joannac2`

In this exercise we will work with the Pima Indians Diabetes dataset.  This data set is 
[originally](https://archive.ics.uci.edu/ml/datasets/Diabetes) from the 
[UC-Irvine machine learning repository](https://archive.ics.uci.edu/ml/datasets.php).  
We will use a cleaned up version from 
[Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).  For convenience I've already downloaded the dataset to the exercise folder. The dataset has the following variables (columns):

- Pregnancies
- Glucose
- BloodPressure
- SkinThickness
- Insulin
- BMI
- DiabetesPedigreeFunction
- Age
- Outcome

Spend sometime on the Kaggle site familiarizing yourself with the dataset.

In [1]:
import pandas as pd
file = 'diabetes.csv'
data = pd.read_csv(file)
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


For the purposes of this exercise, we are going to explore whether we can predict the diabetes status of a patient given the following 4 health measurements?

In [2]:
features = ['Pregnancies', 'Insulin', 'BMI', 'Age']
X = data[features]
y = data.Outcome

In [3]:
total_cases = len(data)  # == len(X) == len(y)
total_cases

768

There are 768 rows in the data set.  We split them into a _Training data set_ and a _Test data set_ with a scikit function.  If we all use the same value for `random_state` our splits will be the same

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2025) # randomly split the data into training and testing sets

In [8]:
X.shape

(768, 4)

How can we know the split?
-> shift + tab

Now, lets use Logistic Regression to classifiy by training it on the `(X_train, y_train)` combo. We will study the details of Logistic Regression later in this course.

In [11]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train, y_train)

We use the fitted model to make a prediction with `X_test`.  We get the predictions as a numpy array.

In [12]:
y_pred_class = clf.predict(X_test)
y_pred_class[:5]

array([0, 1, 0, 1, 0])

For the rest of this exercise we will examine various metrics by means of which we can measure the performance of the classifier.  You will use the builtine scikit functions for these metrics and will also calculate them yourselfs.  Where ever you see a function of the form `my_<metric>` you need to define the function yourself to do the same calculation the builtin scikit function does.

##### Classification accuracy: percentage of correct predictions

In [14]:
len(X_test)

192

Of the 192 test cases how many did we get right?

In [15]:
from sklearn import metrics

metrics.accuracy_score(y_test, y_pred_class)

0.6770833333333334

In [16]:
type(y_test), type(y_pred_class)

(pandas.core.series.Series, numpy.ndarray)

In [17]:
## TODO: my_accuracy_score(y_test, y_pred_class)

def my_accuracy_score( actual, predicted): 
    return sum(actual == predicted) / len(actual)


my_accuracy_score(y_test, y_pred_class)

0.6770833333333334

#### Null accuracy:

This is defined as the accuracy that could be achieved by always predicting the most frequent class.

In [18]:
y_test.value_counts(normalize=True)
# Create a baseline. Give me some result without much effort
# As the result listed below, 0.609 is the baseline accuracy

Outcome
0    0.609375
1    0.390625
Name: proportion, dtype: float64

The null accuracy is 61.98%

### Confusion matrix

While the confusion matrix itself is not a metric, all of the metrics can be calculated from it. Read the scikit-learn [documentation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)
on the confusion matrix

In [20]:
metrics.confusion_matrix(y_test, y_pred_class)

array([[97, 20],
       [42, 33]])

**Basic terminology**

- **True Positives (TP):** we *correctly* predicted that they *do* have diabetes
- **True Negatives (TN):** we *correctly* predicted that they *don't* have diabetes
- **False Positives (FP):** we *incorrectly* predicted that they *do* have diabetes (a "Type I error")
- **False Negatives (FN):** we *incorrectly* predicted that they *don't* have diabetes (a "Type II error")

In [21]:
## TODO  In terms of the confusion matrix define the following 4 terms

cm = metrics.confusion_matrix(y_test, y_pred_class)

# Note that since our own my_confusion_matrix, returns a data frame we would have to say:
# cm = my_confusion_matrix( y_test, y_pred_class).values

TP = cm[1, 1]
TN = cm[0, 0]
FP = cm[0, 1]
FN = cm[1, 0]

#### Metrics computed from a confusion matrix

Now we will calculate the following metrics:

- accuracy
- error, misclassification rate
- recall, sensitivity, True Positive Rate (TPR)
- specificity
- false positivte rate (FPR)
- precision


**Accuracy:** Overall, how often is the classifier correct?

In [23]:
metrics.accuracy_score(y_test, y_pred_class)

0.6770833333333334

In [25]:
## TODO: Express accuracy in terms of TP, TN, FP, FN
accuracy = (TP + TN) / (TP + TN + FP + FN)
accuracy


0.6770833333333334

**Error:** Overall, how often is the classifier incorrect? Also known as "Misclassification Rate"

In [17]:
1 - metrics.accuracy_score(y_test, y_pred_class)

0.32291666666666663

**Recall:** When the actual value is positive, how often is the prediction correct?

- How "sensitive" is the classifier to detecting positive instances?
- Also known as _Sensitivity_ or _True Positive Rate_ (TPR)

In [18]:
metrics.recall_score(y_test, y_pred_class)

0.44

In [26]:
## TODO: Express recall in terms of TP, TN, FP, FN

recall = TP / (TP + FN)
recall

0.44

**Specificity:** When the actual value is negative, how often is the prediction correct?

- How "specific" (or "selective") is the classifier in predicting positive instances?

In [27]:
## TODO:  Express specificity in terms of TP, TN, FP, FN
## Interestingly there is not builtin function in scikit-learn to calculate specificity
specificity = TN / (TN + FP)
specificity



0.8290598290598291

**False Positive Rate:** When the actual value is negative, how often is the prediction incorrect?

In [28]:
## TODO:  Express FPR in terms of TP, TN, FP, FN
## Interestingly there is not builtin function in scikit-learn to calculate FPR
FPR = FP / (TN + FP)
FPR


0.17094017094017094

**Precision:** When a positive value is predicted, how often is the prediction correct?

- How "precise" is the classifier when predicting positive instances?

In [22]:
metrics.precision_score(y_test, y_pred_class)

0.6226415094339622

In [29]:
## TODO :  Express precision in terms of TP, TN, FP, FN

precision = TP / (TP + FP)
precision

0.6226415094339622

## Nothing TODO below this;  just study the material

## The ROC curve


The values of y_test are

In [30]:
y_test

82     0
152    1
393    0
691    1
14     1
      ..
538    0
569    1
482    0
30     0
412    0
Name: Outcome, Length: 192, dtype: int64

In [31]:
y_test.value_counts()

Outcome
0    117
1     75
Name: count, dtype: int64

Though we used `clf.predict(y_test)` to see what the classifier predicts, in practice we rarely do this.  Rather we use the classifier to determine the probability of each test observation being in the `0` or `1` class using the `predict_proba` method:

In [32]:
y_pred_proba = clf.predict_proba(X_test)

### Note:
Study the below and make sure you understand the correspondence between the values of `y_test` and the two probabilities

In [34]:
list(zip(y_test, y_pred_proba))[:10] # y_test shows the true label, y_pred_proba shows the probability of being 0 or 1

[(0, array([0.64767145, 0.35232855])),
 (1, array([0.40229481, 0.59770519])),
 (0, array([0.81760299, 0.18239701])),
 (1, array([0.23941319, 0.76058681])),
 (1, array([0.63275924, 0.36724076])),
 (1, array([0.39376132, 0.60623868])),
 (1, array([0.74102404, 0.25897596])),
 (0, array([0.86344877, 0.13655123])),
 (0, array([0.73107741, 0.26892259])),
 (0, array([0.75570228, 0.24429772]))]

Once the probabilities are determined it is left to the business domain to determine above what threshold will a test observation be classified as a `1`. As we saw in class as we vary this threshold the _sensitivity_ and _specificity_ of the classifier will vary.  The ROC curve graphically depicts how these two metricx vary as the classification threshold varies.  There is a builtin function that will help determine the ROC curve

In [37]:
positive = y_pred_proba[:,1]

y_pred_proba[:,1]: This slices the y_pred_proba array to extract the second column (index 1) from each row.                      y_pred_proba is typically a 2D array where each row contains the predicted probabilities for each class (e.g., class 0 and class 1).

Assignment to positive: The extracted column is assigned to the variable positive.

In [36]:
positive

array([0.35232855, 0.59770519, 0.18239701, 0.76058681, 0.36724076,
       0.60623868, 0.25897596, 0.13655123, 0.26892259, 0.24429772,
       0.125335  , 0.47696354, 0.24893877, 0.39517472, 0.27503182,
       0.10655623, 0.57870414, 0.53587282, 0.29299912, 0.48233003,
       0.46630438, 0.49725319, 0.09213436, 0.25561705, 0.23076444,
       0.54558546, 0.16360658, 0.13088116, 0.72372675, 0.01310239,
       0.68269949, 0.50471716, 0.38354189, 0.55810984, 0.60489713,
       0.63524243, 0.12555399, 0.34192092, 0.66971485, 0.50443855,
       0.45310255, 0.08625742, 0.0550422 , 0.23499985, 0.23577241,
       0.11817464, 0.50845252, 0.28221166, 0.29141993, 0.69358741,
       0.2326289 , 0.33284236, 0.11859455, 0.55250792, 0.47574205,
       0.18954509, 0.33028788, 0.13972553, 0.73936357, 0.66642644,
       0.3721631 , 0.57906886, 0.39458965, 0.26199887, 0.31629932,
       0.20276387, 0.27470307, 0.13373564, 0.23875684, 0.44992215,
       0.18747004, 0.30509344, 0.20463375, 0.38265515, 0.26823

In [40]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, positive)

In [38]:
import altair as alt
import pandas as pd

In [41]:
df = pd.DataFrame({'fpr':fpr, 'tpr':tpr, 'threshold':thresholds})
df.head()

Unnamed: 0,fpr,tpr,threshold
0,0.0,0.0,inf
1,0.0,0.013333,0.845192
2,0.008547,0.013333,0.815818
3,0.008547,0.106667,0.693587
4,0.017094,0.106667,0.68815


In [42]:
len(df)

83

In [43]:
alt.Chart(df).mark_line(point=True).encode(
    x='fpr',
    y='tpr',
    tooltip='threshold'
)

While the ROC curve is determined by the classification threshold, the threshold itself is not directly observable from the curve.  Using the interactive feature of Altair we are able to inspect the threshold.