# Classification and K-Nearest Neighbors

**Classification problems:** Problems where the label y is categorical.

## Case Study: Breast Tissue Classification

Electrical signals can be used to detect whether tissue is cancerous.

![Tissue](https://i.imgur.com/MBNUHae.png)

The goal is to determine whether a sample of breast tissue is:
- Connective tissue
- Adipose tissue
- Glandular tissue
- Carcinoma
- Fibro-adenoma
- Mastopathy

In [1]:
import pandas as pd
df = pd.read_csv("https://datasci112.stanford.edu/data/BreastTissue.csv")
df

Unnamed: 0,Case #,Class,I0,PA500,HFS,DA,Area,A/DA,Max IP,DR,P
0,1,car,524.794072,0.187448,0.032114,228.800228,6843.598481,29.910803,60.204880,220.737212,556.828334
1,2,car,330.000000,0.226893,0.265290,121.154201,3163.239472,26.109202,69.717361,99.084964,400.225776
2,3,car,551.879287,0.232478,0.063530,264.804935,11888.391827,44.894903,77.793297,253.785300,656.769449
3,4,car,380.000000,0.240855,0.286234,137.640111,5402.171180,39.248524,88.758446,105.198568,493.701814
4,5,car,362.831266,0.200713,0.244346,124.912559,3290.462446,26.342127,69.389389,103.866552,424.796503
...,...,...,...,...,...,...,...,...,...,...,...
101,102,adi,2000.000000,0.106989,0.105418,520.222649,40087.920984,77.059161,204.090347,478.517223,2088.648870
102,103,adi,2600.000000,0.200538,0.208043,1063.441427,174480.476218,164.071543,418.687286,977.552367,2664.583623
103,104,adi,1600.000000,0.071908,-0.066323,436.943603,12655.342135,28.963331,103.732704,432.129749,1475.371534
104,105,adi,2300.000000,0.045029,0.136834,185.446044,5086.292497,27.427344,178.691742,49.593290,2480.592151


#### Features
- **I0:** Impedence at zero frequency (ohms)
- **PA500:** Phase angle at 500 KHz
- **HFS:** High-frequency slope
- **DA:** Distance between the spectrum points
- **AREA:** Area under the spectrum
- **A/DA:** Area normalized by DA
- **MAX IP:** Maximum point of the spectrum
- **DR:** Distance between I0 and the maximum frequency point
- **P:** Length of the spectral curve
- **Class:** Tissue class label (e.e., car, fad, mas, gla, con, adi)

In this notebook we will only focus on `I0` and `PA500`

In [14]:
import plotly.express as px

fig = px.scatter(
    df,
    x="I0",
    y="PA500",
    color="Class",
)


fig.update_layout(
    title_font_size=24,
    legend_title_text="Tissue type"
)

fig.show()

### K-Nearest Neighbors Classification

KNN Classifaction can be implemented using `KNeigborsClassifier` from Scikit-Learn

In [16]:
X_train = df[["I0", "PA500"]]
y_train = df["Class"]

x_test = pd.Series({"I0": 400, "PA500": .18})
X_test = x_test.to_frame().T

X_test

Unnamed: 0,I0,PA500
0,400.0,0.18


In [18]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    StandardScaler(),
    KNeighborsClassifier(n_neighbors=5, metric="euclidean")
)

pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

array(['car'], dtype=object)

In [20]:
pipeline.predict_proba(X_test)

array([[0. , 0.6, 0. , 0.2, 0. , 0.2]])

In [21]:
proba_dict = dict(zip(pipeline.classes_, pipeline.predict_proba(X_test)[0]))
pd.Series(proba_dict).sort_values(ascending=False)

car    0.6
fad    0.2
mas    0.2
adi    0.0
con    0.0
gla    0.0
dtype: float64

### Cross-Validation for Classification

For calculating the cross-validation of a regression model we use `scoring="neg_mean_squared_error"` this is not valid for classification. Instead we can use `scoring="accuracy"`

In [22]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    pipeline, X_train, y_train,
    scoring="accuracy",
    cv=10
)

scores

array([0.63636364, 0.81818182, 0.45454545, 0.54545455, 0.63636364,
       0.54545455, 0.5       , 0.6       , 0.4       , 0.7       ])

As before, we can get an overall estimate of test accuracy by averaging the cross-validation accuracies

In [23]:
scores.mean()

np.float64(0.5836363636363637)