# Classification Metrics

* K-Nearest Neighbor
* TP, FP, TN, FN.  Consequences of false predictions (FP, FN)
* Accuracy, precision, recall

#### review
* Supervised vs unsupervised learning
    + supervised - there's a target variable (y) to learn from.
    + unsupervised - there's no target variable.
* Three main types of ML problems: 
    + Unsupervise: clustering, 
    + Supervised: regression, and classification
+ Regression : the target is ordinal numerical/continuous (e.g. Rings, House costs)
+ Classification: target is categorical (e.g. Species, Outcome of a diagnosis)

In [3]:
import pandas
from sklearn.preprocessing import StandardScaler

diabetes = pandas.read_csv('../Datasets/diabetes.csv')
diabetes.sample(3)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
114,7,160,54,32,175,30.5,0.588,39,1
68,1,95,66,13,38,19.6,0.334,25,0
119,4,99,76,15,51,23.2,0.223,21,0


In [4]:
y = diabetes['Outcome']
X = diabetes.drop(columns=['Outcome'])


### Making sense of the data

Before we attempt to model y, based on X, what do we do?

* Understand the features
* Check for missing; see if the data makes sense.


In [5]:
X.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
dtype: int64

In [6]:
X.min()

Pregnancies                  0.000
Glucose                      0.000
BloodPressure                0.000
SkinThickness                0.000
Insulin                      0.000
BMI                          0.000
DiabetesPedigreeFunction     0.078
Age                         21.000
dtype: float64

In [7]:
len(diabetes[diabetes['BMI']==0])

11

In [8]:
len(diabetes[diabetes['BloodPressure']==0])

35

In [9]:
len(diabetes)

768

In [10]:
Q = diabetes['BMI']>0
Q &= diabetes['BloodPressure']>0
df = diabetes[Q]
len(df)

729

In [11]:
idx = diabetes[diabetes['Outcome']==1].sample(250).index
idx

Int64Index([291, 171,  64, 485, 424, 355, 664, 159, 195, 220,
            ...
            387, 417, 339, 143, 238, 207, 754, 618, 646, 231],
           dtype='int64', length=250)

In [12]:
diabetes2 = diabetes[ ~diabetes.index.isin(idx)]
len(diabetes2)

518

In [13]:
diabetes2['Outcome'].value_counts(1)

0    0.965251
1    0.034749
Name: Outcome, dtype: float64

### Normalizing the data

In [14]:
from sklearn.preprocessing import StandardScaler

df = diabetes2
y = df['Outcome']
X = df.drop(columns=['Outcome'])
X = pandas.DataFrame(
    data = StandardScaler().fit_transform(X),
    columns = X.columns,
)
X.sample(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
440,-1.098169,-0.139354,-0.446638,0.322823,-0.689453,-0.534953,-1.00209,-0.719382
258,0.216335,0.150547,0.207533,-1.304485,-0.689453,-0.209788,-0.196118,1.240983
99,-0.769543,0.911536,0.316562,1.95013,1.320088,0.895773,-0.119832,-0.634148


### K Nearest Neighbor Classification

In K Nearest Neighbor learning, the knowledge is represented/captured in a multi-dimensional geometrical data structure.

This knowledge/data structure allows us/the model to identify k-nearest data points to a given data point.

**How does 7-nearest Neighbor work for predicting if a new patient A has diabetes?**
+ The model first finds 7 most similar (in terms of features) data points to A.  These are training data.
+ The predicted outcome is the majority of the outcomes of the 7 most similar data points.

In [15]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=11)
model.fit(X,y)

In [16]:
A = X.sample(3)
A

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
377,-1.098169,-0.719155,0.643647,-1.304485,-0.689453,0.245443,0.550153,-0.378449
231,0.544961,-0.429254,-0.773724,0.518099,0.128154,0.453549,0.211844,-0.122749
99,-0.769543,0.911536,0.316562,1.95013,1.320088,0.895773,-0.119832,-0.634148


In [17]:
# model.predict(A)

In [18]:
# y.loc[A.index]

To see how good KNN is, we have to crossvalidate.

## To cross-validate, what do we need?
### A cross validator (e.g. ShuffleSplit, KFold)
### A metric (e.g. accuracy)
### A baseline

In [19]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier

from sklearn.model_selection import cross_validate



In [20]:
from sklearn.preprocessing import StandardScaler

y = df['Outcome']
X = df.drop(columns=['Outcome'])
X = pandas.DataFrame(
    data = StandardScaler().fit_transform(X),
    columns = X.columns,
)

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=11)
baseline = DummyClassifier()

validator = ShuffleSplit(n_splits=100, test_size=0.05)

result = cross_validate(model, X, y, cv=validator, scoring=['accuracy'])

baseline_result = cross_validate(baseline, X, y, cv=validator, scoring=['accuracy'])

In [21]:
result.keys()

dict_keys(['fit_time', 'score_time', 'test_accuracy'])

In [22]:
result['test_accuracy'].mean().round(2)

0.96

In [23]:
baseline_result['test_accuracy'].mean().round(2)

0.96

In [24]:
y.value_counts()

0    500
1     18
Name: Outcome, dtype: int64

In [25]:
y.value_counts(1).round(2)

0    0.97
1    0.03
Name: Outcome, dtype: float64

For very imbalanced datasets (the distribution of y is very skewed), accuracy is not a meaningful metric.

### Performance metrics

##### Accuracy

+ How good a model is.
+ The percentage of correction predictions.

Key terminologies:
+ True positive
+ False positive
+ True negative
+ False negative

Conventionally, we set up a model to learn y, and predict value 1 of y (class 1).

A positive is a prediction of class 1.

A negative is a prediction of class 0.

A "True" prediction is a correction prediction.

A "False" prediction is a incorrection prediction.

```
    Accuracy = total correct predictions / total predictions
             = (TP + TN) / (TP + FP + TN + FN)
```

In [26]:
df = diabetes
y = df['Outcome']
X = df.drop(columns=['Outcome'])
X = pandas.DataFrame(
    data = StandardScaler().fit_transform(X),
    columns = X.columns,
)
result = cross_validate(model, X, y, cv=validator, scoring=['accuracy'])
result['test_accuracy'].mean()

0.7282051282051283

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.02)

In [28]:
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [29]:
predictions

array([1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1])

In [30]:
y_test.values

array([1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0])

```
TP = 3
FP = 3
TN = 8
FN = 2

Accuracy = 11/16
```


### Precision and Recall

We know that how highly imbalanced data, accuracy is not very meaningful.

It turns out that it's hard to use one metric for highly imbalanced data.

```
    Precision = TP / (TP + FP)
    Recall = TP / (TP + FN)

```

```
TP = 3
FP = 3
TN = 8
FN = 2

Precision = 3/6

Recall = 3/5
```

Precision is essentially the probability that a positive prediction is correct.

Recall is the probability that class 1 is correctly predicted.




### Precision Recall tradeoff

Predicting COVID infection


Possibile features:
1. **Fever**: A common symptom that is easy to measure and report.
2. **Cough**: Persistent coughing, particularly dry cough, is frequently associated with COVID-19.
3. **Shortness of Breath or Difficulty Breathing**: This symptom can indicate a more severe infection.
4. **Fatigue**: General tiredness is a common symptom reported by many infected individuals.
5. **Loss of Taste or Smell**: This symptom has been highlighted as particularly distinctive for COVID-19 compared to other respiratory viruses.
6. **Sore Throat**: Often reported alongside other respiratory symptoms.
7. **Congestion or Runny Nose**: Common in many respiratory infections but still relevant.
8. **Muscle or Body Aches**: General body discomfort or pain.
9. **Headache**: A frequently reported symptom that can be associated with many viral infections.
10. **Chills**: Sometimes accompanied by shaking.
11. **Nausea or Vomiting**: Gastrointestinal symptoms are less common but relevant.
12. **In contact with a COVID affected person**


What critria for making predictions?

Consider two algorithms:

+ **A** - if a person has all 12 symptoms, declare "YES" (a positive)
+ **B** - if a person has the first 4 symptoms, declare "YES" (a positive)

In terms of FP and FN,  
+ B has more FP than A has.
+ B has fewer FN than A has.
