# Classification Metrics

* K-Nearest Neighbor
* TP, FP, TN, FN.  Consequences of false predictions (FP, FN)
* Accuracy, precision, recall

#### review
* Supervised vs unsupervised learning
    + supervised - there's a target variable (y) to learn from.
    + unsupervised - there's no target variable.
* Three main types of ML problems: 
    + Unsupervise: clustering, 
    + Supervised: regression, and classification
+ Regression : the target is ordinal numerical/continuous (e.g. Rings, House costs)
+ Classification: target is categorical (e.g. Species, Outcome of a diagnosis)

In [30]:
import pandas
from sklearn.preprocessing import StandardScaler

diabetes = pandas.read_csv('../Datasets/diabetes.csv')
diabetes.sample(3)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
524,3,125,58,0,0,31.6,0.151,24,0
101,1,151,60,0,0,26.1,0.179,22,0
136,0,100,70,26,50,30.8,0.597,21,0


In [31]:
y = diabetes['Outcome']
X = diabetes.drop(columns=['Outcome'])


### Making sense of the data

Before we attempt to model y, based on X, what do we do?

* Understand the features
* Check for missing; see if the data makes sense.


In [32]:
X.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
dtype: int64

In [33]:
X.min()

Pregnancies                  0.000
Glucose                      0.000
BloodPressure                0.000
SkinThickness                0.000
Insulin                      0.000
BMI                          0.000
DiabetesPedigreeFunction     0.078
Age                         21.000
dtype: float64

In [34]:
len(diabetes[diabetes['BMI']==0])

11

In [35]:
len(diabetes[diabetes['BloodPressure']==0])

35

In [36]:
len(diabetes)

768

In [37]:
Q = diabetes['BMI']>0
Q &= diabetes['BloodPressure']>0
df = diabetes[Q]
len(df)

729

In [42]:
idx = diabetes[diabetes['Outcome']==1].sample(250).index
idx

Index([443,  84, 681, 220, 753, 715, 663, 185, 603, 569,
       ...
        72, 541, 755, 339, 170, 116, 198, 732, 214, 269],
      dtype='int64', length=250)

In [45]:
diabetes2 = diabetes[ ~diabetes.index.isin(idx)]
len(diabetes2)

518

In [49]:
diabetes2['Outcome'].value_counts(1)

Outcome
0    0.965251
1    0.034749
Name: proportion, dtype: float64

### Normalizing the data

In [55]:
from sklearn.preprocessing import StandardScaler

df = diabetes2
y = df['Outcome']
X = df.drop(columns=['Outcome'])
X = pandas.DataFrame(
    data = StandardScaler().fit_transform(X),
    columns = X.columns,
)
X.sample(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
160,2.14386,0.399601,-0.001575,-1.32274,-0.672416,0.094251,-0.584238,0.809276
488,0.202392,-1.021155,0.977438,-0.054631,-0.672416,-0.152062,-0.393615,0.213935
183,-1.091919,-0.256133,-0.219134,0.212339,0.445078,-0.34652,0.049019,-0.721601


### K Nearest Neighbor Classification

In K Nearest Neighbor learning, the knowledge is represented/captured in a multi-dimensional geometrical data structure.

This knowledge/data structure allows us/the model to identify k-nearest data points to a given data point.

**How does 7-nearest Neighbor work for predicting if a new patient A has diabetes?**
+ The model first finds 7 most similar (in terms of features) data points to A.  These are training data.
+ The predicted outcome is the majority of the outcomes of the 7 most similar data points.

In [62]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=11)
model.fit(X,y)

In [57]:
A = X.sample(3)
A

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
320,-1.091919,2.257512,0.542321,0.813023,1.88048,2.077719,2.326804,2.255105
186,2.14386,0.654608,0.433542,0.546052,0.50288,0.703551,-0.513158,0.639179
66,0.849548,-0.656859,-0.980588,0.679538,-0.055867,-0.229846,-0.26761,-0.721601


In [60]:
# model.predict(A)

In [61]:
# y.loc[A.index]

To see how good KNN is, we have to crossvalidate.

To cross-validate, what do we need?
+ A cross validator (e.g. ShuffleSplit, KFold)
+ A metric (e.g. accuracy)
+ A baseline

In [23]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier

from sklearn.model_selection import cross_validate



In [63]:
from sklearn.preprocessing import StandardScaler

y = df['Outcome']
X = df.drop(columns=['Outcome'])
X = pandas.DataFrame(
    data = StandardScaler().fit_transform(X),
    columns = X.columns,
)

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=11)
baseline = DummyClassifier()

validator = ShuffleSplit(n_splits=100, test_size=0.05)

result = cross_validate(model, X, y, cv=validator, scoring=['accuracy'])

baseline_result = cross_validate(baseline, X, y, cv=validator, scoring=['accuracy'])

In [64]:
result['test_accuracy'].mean().round(2)

0.97

In [65]:
baseline_result['test_accuracy'].mean().round(2)

0.97

In [66]:
y.value_counts()

Outcome
0    500
1     18
Name: count, dtype: int64

In [67]:
y.value_counts(1).round(2)

Outcome
0    0.97
1    0.03
Name: proportion, dtype: float64

For very imbalanced datasets (the distribution of y is very skewed), accuracy is not a meaningful metric.

### Performance metrics

##### Accuracy

+ How good a model is.
+ The percentage of correction predictions.

Key terminologies:
+ True positive
+ False positive
+ True negative
+ False negative

Conventionally, we set up a model to learn y, and predict value 1 of y (class 1).

A positive is a prediction of class 1.

A negative is a prediction of class 0.

A "True" prediction is a correction prediction.

A "False" prediction is a incorrection prediction.

```
    Accuracy = total correct predictions / total predictions
             = (TP + TN) / (TP + FP + TN + FN)
```

In [70]:
df = diabetes
y = df['Outcome']
X = df.drop(columns=['Outcome'])
X = pandas.DataFrame(
    data = StandardScaler().fit_transform(X),
    columns = X.columns,
)
result = cross_validate(model, X, y, cv=validator, scoring=['accuracy'])
result['test_accuracy'].mean()

0.7392307692307692

In [74]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.02)

In [75]:
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [80]:
predictions

array([0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0])

In [78]:
y_test.values

array([0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1])

```
TP = 3
FP = 3
TN = 8
FN = 2

Accuracy = 11/16
```


### Precision and Recall

We know that how highly imbalanced data, accuracy is not very meaningful.

It turns out that it's hard to use one metric for highly imbalanced data.

```
    Precision = TP / (TP + FP)
    Recall = TP / (TP + FN)

```

```
TP = 3
FP = 3
TN = 8
FN = 2

Precision = 3/6

Recall = 3/5
```

Precision is essentially the probability that a positive prediction is correct.

Recall is the probability that class 1 is correctly predicted.




### Precision Recall tradeoff

Predicting COVID infection


Possibile features:
1. **Fever**: A common symptom that is easy to measure and report.
2. **Cough**: Persistent coughing, particularly dry cough, is frequently associated with COVID-19.
3. **Shortness of Breath or Difficulty Breathing**: This symptom can indicate a more severe infection.
4. **Fatigue**: General tiredness is a common symptom reported by many infected individuals.
5. **Loss of Taste or Smell**: This symptom has been highlighted as particularly distinctive for COVID-19 compared to other respiratory viruses.
6. **Sore Throat**: Often reported alongside other respiratory symptoms.
7. **Congestion or Runny Nose**: Common in many respiratory infections but still relevant.
8. **Muscle or Body Aches**: General body discomfort or pain.
9. **Headache**: A frequently reported symptom that can be associated with many viral infections.
10. **Chills**: Sometimes accompanied by shaking.
11. **Nausea or Vomiting**: Gastrointestinal symptoms are less common but relevant.
12. **In contact with a COVID affected person**


What critria for making predictions?

Consider two algorithms:

+ **A** - if a person has all 12 symptoms, declare "YES" (a positive)
+ **B** - if a person has the first 4 symptoms, declare "YES" (a positive)

In terms of FP and FN,  
+ B has more FP than A has.
+ B has fewer FN than A has.
