# Classification Problems

- The regression models assume that the response variable Y is quantitative.
- But in some situations, the response variable would be instead qualitative. Often qualitative variables are referred to as **categorical**.
- **Predicting a qualitative response** for an observation can be referred to as **classifying that observation**, since it involves assigning the observation to a category, or class.
- Examples:
    - An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
    - On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not
    - A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions.Which of the three conditions does the individual have?

## K Nearest Neighbor

- They memorize the dataset and make predictions on the fly
- K Nearest Neighbor algorithm takes all **data** and a **distance metric**, and finds $k$ data instances nearest to data point to be predicted. The class of the **majority** of the $k$ nearest data points is assigned as the predicted class of the new data point.
 ![knn](./knn.PNG)
 
- In general, nearest neighbor classifiers are well-suited for classification tasks, where relationships among the features and the target classes are **numerous, complicated, or extremely difficult** to understand.
- KNN can be useful in the following analysis:
    - Computer vision applications, including optical character recognition and facial recognition in both still images and video
    - Predicting whether a person will enjoy a movie or music recommendation
    - Identifying patterns in genetic data, perhaps to use them in detecting specific proteins or diseases
Let's try to identify malignant cancers using kNN algorithm on Wisconsin Breast Cancer Dataset:

In [6]:
import pandas as pd
   
bc_data=pd.read_csv("./wdbc.data",header=None)
bc_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Routine breast cancer screening allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. <br>The process of early detection involves examining the breast tissue for abnormal lumps or masses. 
<br>If a lump is found, a  fine-needle aspiration biopsy is performed, which uses a hollow needle to extract a small sample of cells from the mass. 
<br>A clinician then examines the cells under a microscope to determine whether the mass is likely to be malignant(cancerous) or benign (not cancerous).
<br>We will investigate the utility of machine learning for detecting cancer by applying the k-NN algorithm to measurements of biopsied cells from women with abnormal breast masses.

We will utilize the Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml. 
<br>This data was donated by researchers of the University of Wisconsin and includes the measurements from digitized images of  fine-needle aspirate of a breast mass. 
<br>The values represent the characteristics of the cell nuclei present in the digital image.

The breast cancer data includes 569 examples of cancer biopsies, each with 32 features. 
<br> **One feature is an identification number, another is the cancer diagnosis, and 30 are numeric-valued laboratory measurements.** 
<br>**The diagnosis is coded as "M" to indicate malignant or "B" to indicate benign.**

In [7]:
# lets drop identification number and split features and target value

bc_data.drop(0,axis='columns', inplace=True)

X = bc_data.iloc[: , 1:-1] 
y = bc_data.iloc[: , 0] 

In [8]:
X.head()

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,21,22,23,24,25,26,27,28,29,30
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364


In [9]:
y.head()

0    M
1    M
2    M
3    M
4    M
Name: 1, dtype: object

The next variable, diagnosis, is of particular interest as it is the outcome we hope to predict. 
<br>This feature indicates whether the example is from a benign or malignant mass
<br> let's calculate how many data points for each diagnosis

In [10]:
bc_data.groupby(1).size()

1
B    357
M    212
dtype: int64

## Distance metric and normalization

- kNN algorithm usually employs **euclidian distance** metric: 

\\[ d(p,q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \ldots + (p_n - q_n)^2} \\]

- When we apply a distance metric similar to this directly on the data, the features that have higher magnitude will dominate the measure. This create a bias towards giving more importance to features with higher values. 

- Thus, we need to bring all features to the same scale (i.e. normalize or standardize)

- In sklearn, we can do normalization by using MinMaxScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

Let's normalize our data:

In [15]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

scaler = MinMaxScaler() 
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3, random_state=100)

X_scaled=scaler.fit_transform(X_train) 

X_scaled[0,:]

array([0.60717497, 0.40726644, 0.59574321, 0.47399703, 0.41238603,
       0.28532292, 0.34653233, 0.47206759, 0.31016043, 0.0885486 ,
       0.23276738, 0.14515559, 0.24016672, 0.20354358, 0.16252507,
       0.1601375 , 0.11161949, 0.37188264, 0.10597633, 0.05017937,
       0.7697499 , 0.56794317, 0.80045777, 0.73010426, 0.54357128,
       0.27913768, 0.42907348, 0.82259731, 0.28581611])

In [16]:
X_train.iloc[0,:]c

2       19.810000
3       22.150000
4      130.000000
5     1260.000000
6        0.098310
7        0.102700
8        0.147900
9        0.094980
10       0.158200
11       0.053950
12       0.758200
13       1.017000
14       5.865000
15     112.400000
16       0.006494
17       0.018930
18       0.033910
19       0.015210
20       0.013560
21       0.001997
22      27.320000
23      30.880000
24     186.800000
25    2398.000000
26       0.151200
27       0.315000
28       0.537200
29       0.238800
30       0.276800
Name: 18, dtype: float64

In [17]:
# import kNN classifier and train the model
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier()

knn.fit(X_scaled, y_train)

X_test_scaled = scaler.transform(X_test)

predictions = knn.predict(X_test_scaled)
predictions[0:5]


array(['M', 'B', 'M', 'B', 'B'], dtype=object)

In [18]:
y_test[0:5]

400    M
225    B
321    M
173    B
506    B
Name: 1, dtype: object

## Classification Performance Measures

- A variety of metrics exist to evaluate the performance of binary classifiers against trusted labels.
- The most common metrics are **accuracy**, **precision**, **recall**, **F1 measure**, and **ROC AUC** score.
- All of these measures depend on the concepts of **true positives (TP)**, **true negatives (TN)**, **false positives (FP)**, and **false negatives (FN)**.
![confusion](./confusion-matrix.jpg)
- Accuracy: (Use ```sklearn.metrics.accuracy_score(...)```) 
\\[Accuracy = \frac{TP + TN }{TP + TN + FP + FN} \\]

In [22]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

0.9707602339181286

In [24]:
#default scorer of any classifier is accuracy score 
knn.score(X_test_scaled, y_test)

0.9707602339181286

### Precision and Recall

- In some cases, accuracy may not be an ideal measure to assess the performance of the classifier. 
    - This is especially true when the distribution of the classes is skewed one side
    - i.e. consider a disease that is seen 1 in every 100,000. If we have a classifier that classifies everyone as healthy, what would be the accuracy?
    - If the aim is to catch the patients with disease, then this classifier is useless although it has a very high accuracy.
- Solution: Use **Precision** or **Recall** or both (**F1 Score**)
\\[Precision = \frac{TP }{TP + FP}  ~~~~~~~~~~~~~~~ Recall = \frac{TP}{TP + FN}\\]

 ![precision_recall](./precision_recall.PNG)
 
 - For the classifer we mentioned above, both **precision and recall is 0** provided that our **relevant class is disease positive**
 
 Let's see how we can calculate precision and recall in sklearn:

In [26]:
from sklearn.metrics import precision_score,recall_score

prec = precision_score(y_test, predictions,  pos_label='M') 
rec = recall_score(y_test, predictions,  pos_label='M') 

print("Precision for Malignant Tumors:" , prec)
print("Recall for Malignant Tumors:" , rec)

Precision for Malignant Tumors: 0.9848484848484849
Recall for Malignant Tumors: 0.9420289855072463
