# Supervised Learning Classification

**Agenda**


- What are the basic supervised learning methods for classification in ScikitLearn and how to use them?
- Scikit-learn 4-step modeling pattern
    > 1. Import the class you plan to use
	> 2. "Instantiate" the "estimator"
	> 3. Fit the model with data (aka "model training")
	> 4. Predict the response for a new observation
- Simple accuracy for this prediction!

---

## K-nearest neighbors (KNN) classification

1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

### Example training data

![Training data](images/04_knn_dataset.png)

### KNN classification map (K=1)

![1NN classification map](images/04_1nn_map.png)

### KNN classification map (K=5)

![5NN classification map](images/04_5nn_map.png)

*Image Credits: [Data3classes](http://commons.wikimedia.org/wiki/File:Data3classes.png#/media/File:Data3classes.png), [Map1NN](http://commons.wikimedia.org/wiki/File:Map1NN.png#/media/File:Map1NN.png), [Map5NN](http://commons.wikimedia.org/wiki/File:Map5NN.png#/media/File:Map5NN.png) by Agor153. Licensed under CC BY-SA 3.0*

## Loading the data

In [None]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# save "bunch" object containing iris dataset and its attributes
iris = load_iris()

# As we did in preprocessing part
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, shuffle=True)


## scikit-learn 4-step modeling pattern

**Step 1:** Import the class you plan to use

In [None]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults

In [None]:
print(knn)

**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

In [None]:
knn.fit(X_train, y_train)

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process
- Returns a NumPy array
- Can predict for multiple observations at once

In [None]:
y_predknn1 = knn.predict(X_test)

In [None]:
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_predknn1))

## Using a different value for K

In [None]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X_train, y_train)

In [None]:
# predict the response for new observations
y_predknn5 = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_predknn5))

## Using a different classification model

**Step 1:** Import the class you plan to use

In [None]:
from sklearn.linear_model import LogisticRegression

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"
- We instantiate using the default parameters here

In [None]:
# Some participants had problems using 'auto' 
# one option is to use 'ovr' which stands for one versus the rest which we spoke a bit about
# the other option is to use 'multinomial' which will try to fit multinomial even if there is only binary class
logreg = LogisticRegression(max_iter=1000, multi_class='auto', solver='lbfgs')

**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between X and y
- Occurs in-place

In [None]:
logreg.fit(X_train, y_train)

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process

In [None]:
y_predlog = logreg.predict(X_test)
print(metrics.accuracy_score(y_test, y_predlog))

## Exercise

Use logistic regression, knn1 and knn5 to predict the labels for wine dataset

In [None]:
# import load_iris function from datasets module
from sklearn.datasets import load_wine

# save "bunch" object containing wine dataset and its attributes


# As we did in preprocessing part


In [None]:
# For this dataset increase number of iterations max_iter = 10000
# Some participants had problems using 'auto' 
# one option is to use 'ovr' which stands for one versus the rest which we spoke a bit about
# the other option is to use 'multinomial' which will try to fit multinomial even if there is only binary class
