# Classification

* You're probably aware of some parametric prediction methods, e.g., linear regression.  
* Let's study a non-parametric prediction method. 
* The goal of this method: classify something into one of a discrete number of types. 
* This is also known as 'supervised learning'. 

# Scikit-Learn

* Scikit-Learn is a major machine learning library that includes many reference data sets. 
* Initial release: June 2007, predates `pandas` but not by much! (Scikit-Learn and `pandas` solved different set of problems so they could simply coexisted for a long time)
* It has its own formats. 
* It's important to know how to translate to other formats to accomplish tasks. 

<img align="right" style="padding-left:10px; height: 24%; width: 24%;" src="figures/iris_with_labels.jpg">

# The Iris Dataset

* There is one dataset that is so well-known that it bears mentioning in any context. 
* The *iris dataset* consists of a multidimensional array of iris characteristics used in determining species. 
* Let's explore this dataset and see if we can understand it. 

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
iris


# This is a special-purpose format. 
* A class
* Implemented as a dictionary. 
* Intended for testing machine-learning algorithms. 
* With fields that make sense for that task.
* Most entries are arrays in `numpy` format. 

Let's find out a bit about it. 

In [None]:
print(iris.DESCR)

# Important fields in the iris dataset
* `iris.data`: a set of feature vectors describing different plants. 
* `iris.target`: the kind of plant
* `iris.feature_names`: the names of columns
* `iris.target_names`: the English names of the kinds. 

# The classification problem
* Given what we know about a thing (`iris.data`) 
* What species is it (`iris.target`)? 

# How we approach classification: 
* Take all data into account. 
* Think of the data as a function from `data` to `target`.
* Approximate that function. 

# Then, if there is a new kind of iris, 
* Use the function to predict what species it is. 

# Let's run the demo provided by scikit-learn: 

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Declare a KNN classifier of a given complexity. The number of neighbors determines runtime.
knn = KNeighborsClassifier(n_neighbors=6)

# create a map between data and target. 
knn.fit(iris['data'], iris['target'])

# Provide data whose class labels are to be predicted
X = [
    [5.9, 1.0, 5.1, 1.8],
    [3.4, 2.0, 1.1, 4.8],
]

# Prints the data provided
print(X)

# Store predicted class labels of X
prediction = knn.predict(X)

# Prints the predicted class labels of X
print(prediction)

This, according to the predictor, they're both species 1 of 0-2. 

* Writing such a predictor is a complex task that we study in COMP 135. 
* You can read up on it here: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

For now, suffice it to say that from enough measurements, one can form a prediction 
from the instances that have been observed so far. This prediction can be accurate or inaccurate 
based upon the prediction method. 

# From whence comes accuracy
* You would be right to be suspicious of what I just did. 
* I didn't tell you anything at all about the prediction method. It is an "oracle". 
* How do we know that this worked? 

# Cross-validation
* Cross-validation is a standard technique in machine learning for testing classifiers. 
* Separate all feature data into 'training' and 'testing' subsets. 
* Train on the training subset. 
* Test on the testing subset. 
* See if you get the correct answers.

# Let's do this. I'll help.
* This is a different kind of exercise. 
* This is a real cross-validation using random data. 
* There is no one "correct" answer. 
* I can check your answers for sanity but not for correctness. 

First let's select rows of the data to use as training and testing data. This recipe selects them randomly. 

In [None]:
import random
selections = list(range(len(iris.data)))
random.shuffle(selections)
training_selections = selections[:130]
testing_selections = selections[130:]

# What this does
* `random.shuffle` scrambles the numbers between 0 and 149. 
* `training_selections` is a list of the array offsets for a training set. 
* `testing_selections` is a list of the array offsets for a testing set. 
* These are disjoint lists with no elements in common. 
* These represent a random sampling of the data in the iris database. 

In [None]:
print(training_selections)

In [None]:
print(testing_selections)

In [None]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('Classification.ok')
ok.auth(inline=True)

1. Create an `array` `training_features` that consists of the rows that match `training_selections`. Look up how to do it. Hint: `iris.data` is an `array`. Use row selection for `np.array`. 

In [None]:
# Your answer:
training_features = iris.data[training_selections]
training_features

In [None]:
_ = ok.grade('q01')  # check answer for sanity

2. Create an `array` `training_targets` that consists of the targets corresponding to the selected training rows. 

In [None]:
# Your answer
training_targets = iris.target[training_selections]
training_targets

In [None]:
_ = ok.grade('q02')  # check answer for sanity

3. Using the pattern above, train a kNN on the training data. Start with a new one `knn2` and just train on this. Hint: You need the data from parts 1 and 2. 

In [None]:
# Your answer: 
# Declare a KNN classifier of a given complexity. The number of neighbors determines runtime.
import numpy as np

knn2 = KNeighborsClassifier(n_neighbors=6)

# create a map between data and target. 
knn2.fit(training_features, training_targets)
knn2

4. Put the test data into an `array` `testing_features`, repeating what you did for training data. 

In [None]:
# Your answer: 
testing_features = iris.data[testing_selections]
testing_features

In [None]:
_ = ok.grade('q04')  # check answer for sanity

5. Run the predictor as above, but on the array `testing_features`. Put the result into `test_results`

In [None]:
# Your answer: 
test_results = knn.predict(testing_features)
test_results

In [None]:
_ = ok.grade('q05')  # check answer for sanity

6. Compute the expected outcomes and put them into the `array` `expected_results`. 

In [None]:
# Your answer: 
expected_results = iris.target[testing_selections]
expected_results

In [None]:
_ = ok.grade('q06')  # check answer for sanity

7. Count the number of identical answers between test_results and expected results and place the result into `correct_answers`

In [None]:
# Your answer: 
correct_answers = (test_results == expected_results).sum()
correct_answers

# An afterword on cross-validation
* If you got a perfect result, you're lucky. 
* Classification algorithms aren't perfect. 
* You can run it again to get an imperfect result. 
* Running the cross-validation multiple times gives one an idea of how accurate the classifier will be. 
* There are no "correct" answers to this. You just ran a random trial. 

# When you are done with this notebook, 

* Save and checkpoint. 
* Ensure that the name of this file is precisely `04-02-classification.ipynb`. 
* Save and checkpoint the notebook. 


* If your Jupyter installation can download the notebook as a PDF,
    * (File >> Download as >> PDF via LaTeX (.pdf)), 
    * Rename the downloaded file to `<loginid>-04-02-classification.pdf`. In other words, my filename would be `jsingh11-04-02-classification.pdf`.
    * Submit the file `<loginid>-04-02-classification.pdf` to Canvas.
* Otherwise 
    * (File >> Download as >> Notebook (.ipynb)). In other words, my filename would be `jsingh11-04-02-classification.ipynb`.
    * Rename the downloaded file to `<loginid>-04-02-classification.ipynb`,
    * Submit the file `<loginid>-04-02-classification.ipynb` to Canvas.