# Lab 10: Classification

This assignment covers **Chapter 17** from the textbook as well as lecture material from Weeks 12-13. Please complete this assignment by providing answers in cells after the question. Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

This assignment is due by **11:59pm on Thursday, April 30**. 

In [1]:
import numpy as np
from datascience import *

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import cross_validate

Using the classifiers in `sklearn` is very similar to finding the least squares regression line using `LinearRegression`. We first create the model object, then fit the model with our data. In our case, we will fit the data using our train data, and then predict for the test data (using the `predict_proba` method). 

This assignment will cover the main steps for applying machine learning models.
- Create train and test sets
- Fit the models using train set
- Predict using the test set
- Evaluate models using metrics such as precision and recall
- Make your conclusions

## Brazilian Sign Language

Brazilian Sign Language is a visual language used primarily by Brazilians who are deaf.  It is more commonly called Libras.  People who communicate with visual language are called *signers*.  Here is a video of someone signing in Libras:

In [2]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo("mhIcuMZmyWM")

Programs like Siri or Google Now begin the process of understanding human speech by classifying short clips of raw sound into basic categories called *phones*.  For example, the recorded sound of someone saying the word "robot" might be broken down into several phones: "rrr", "oh", "buh", "aah", and "tuh".  Phones are then grouped together into further categories like words ("robot") and sentences ("I, for one, welcome our new robot overlords") that carry more meaning.

A visual language like Libras has an analogous structure.  Instead of phones, each word is made up of several *hand movements*.  As a first step in interpreting Libras, we can break down a video clip into small segments, each containing a single hand movement.  The task is then to figure out what hand movement each segment represents.

We can do that with classification!

The [data](https://archive.ics.uci.edu/ml/machine-learning-databases/libras/movement_libras.names) in this exercise come from Dias, Peres, and Biscaro, researchers at the University of Sao Paulo in Brazil.  They identified 15 distinct hand movements in Libras (probably an oversimplification, but a useful one) and captured short videos of signers making those hand movements.  (You can read more about their work [here](http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F5161636%2F5178557%2F05178917.pdf&authDecision=-203). The paper is gated, so you will need to use the UMD Wi-Fi or VPN to access it.)

For each video, they chose 45 still frames from the video and identified the location (in horizontal and vertical coordinates) of the signer's hand in each frame.  Since there are two coordinates for each frame, this gives us a total of 90 numbers summarizing how a hand moved in each video.  Those 90 numbers will be our *attributes*.

Each video is *labeled* with the kind of hand movement the signer was making in it.  Each label is one of 15 strings like "horizontal swing" or "vertical zigzag".

For simplicity, we're going to focus on distinguishing between just two kinds of movements: "horizontal straight-line" and "vertical straight-line".  We took the Sao Paulo researchers' original dataset, which was quite small, and used some simple techniques to create a much larger synthetic dataset.

These data are in the file `movements.csv`.  Run the next cell to load it.

In [3]:
movements = Table.read_table("movements.csv")
movements.take(np.arange(5))

Frame 1 x,Frame 1 y,Frame 2 x,Frame 2 y,Frame 3 x,Frame 3 y,Frame 4 x,Frame 4 y,Frame 5 x,Frame 5 y,Frame 6 x,Frame 6 y,Frame 7 x,Frame 7 y,Frame 8 x,Frame 8 y,Frame 9 x,Frame 9 y,Frame 10 x,Frame 10 y,Frame 11 x,Frame 11 y,Frame 12 x,Frame 12 y,Frame 13 x,Frame 13 y,Frame 14 x,Frame 14 y,Frame 15 x,Frame 15 y,Frame 16 x,Frame 16 y,Frame 17 x,Frame 17 y,Frame 18 x,Frame 18 y,Frame 19 x,Frame 19 y,Frame 20 x,Frame 20 y,Frame 21 x,Frame 21 y,Frame 22 x,Frame 22 y,Frame 23 x,Frame 23 y,Frame 24 x,Frame 24 y,Frame 25 x,Frame 25 y,Frame 26 x,Frame 26 y,Frame 27 x,Frame 27 y,Frame 28 x,Frame 28 y,Frame 29 x,Frame 29 y,Frame 30 x,Frame 30 y,Frame 31 x,Frame 31 y,Frame 32 x,Frame 32 y,Frame 33 x,Frame 33 y,Frame 34 x,Frame 34 y,Frame 35 x,Frame 35 y,Frame 36 x,Frame 36 y,Frame 37 x,Frame 37 y,Frame 38 x,Frame 38 y,Frame 39 x,Frame 39 y,Frame 40 x,Frame 40 y,Frame 41 x,Frame 41 y,Frame 42 x,Frame 42 y,Frame 43 x,Frame 43 y,Frame 44 x,Frame 44 y,Frame 45 x,Frame 45 y,Movement type
0.522768,0.769731,0.536186,0.749446,0.518625,0.757197,0.517752,0.756847,0.504951,0.726008,0.50008,0.712113,0.463555,0.712355,0.49873,0.736872,0.51472,0.754353,0.517935,0.748163,0.5082,0.734278,0.50004,0.726941,0.49291,0.71189,0.480587,0.715755,0.476772,0.723531,0.504372,0.717318,0.46351,0.70031,0.463217,0.693279,0.474777,0.722122,0.512079,0.73267,0.506785,0.731242,0.497417,0.723703,0.505879,0.726615,0.51537,0.741874,0.544376,0.741177,0.51367,0.714379,0.509508,0.715222,0.519559,0.704945,0.511828,0.69361,0.511366,0.685024,0.510194,0.686122,0.518486,0.694125,0.524232,0.68817,0.531254,0.672905,0.530833,0.672029,0.521013,0.621037,0.481328,0.586983,0.450996,0.576725,0.474634,0.585757,0.465209,0.572517,0.430172,0.547155,0.429693,0.531896,0.415799,0.516734,0.40249,0.528653,0.413692,0.510434,vertical straight-line
0.179546,0.658986,0.177132,0.656834,0.168157,0.664803,0.176407,0.654713,0.167577,0.635559,0.138276,0.633621,0.143817,0.633303,0.154967,0.643993,0.169151,0.646888,0.138409,0.62286,0.141052,0.638818,0.129957,0.644284,0.141763,0.643459,0.127024,0.641122,0.133745,0.63458,0.114496,0.632741,0.0891234,0.631917,0.0836099,0.630901,0.07445,0.621396,0.072605,0.635247,0.0506362,0.620064,0.0467104,0.62067,0.0531715,0.645212,0.0374171,0.634352,0.0182681,0.61547,-0.0197023,0.6088,-0.027299,0.605641,-0.0482872,0.594468,-0.0640002,0.588416,-0.0565593,0.582703,-0.0881633,0.586423,-0.0929613,0.600561,-0.0928198,0.609785,-0.107121,0.624372,-0.115449,0.613028,-0.140709,0.614448,-0.148999,0.607538,-0.179288,0.582983,-0.196426,0.612175,-0.195264,0.580151,-0.230368,0.577835,-0.250168,0.550737,-0.274717,0.571828,-0.258795,0.590663,-0.256045,0.578798,horizontal straight-line
0.805813,0.651365,0.832204,0.666023,0.834636,0.645757,0.826685,0.645685,0.816671,0.625701,0.810289,0.637001,0.819373,0.635922,0.827567,0.637587,0.813763,0.645346,0.824472,0.632012,0.82673,0.643524,0.817462,0.638418,0.804468,0.63604,0.830122,0.652033,0.828967,0.658297,0.850648,0.678696,0.845375,0.679893,0.858148,0.677961,0.852067,0.673301,0.849921,0.668893,0.84142,0.681652,0.869216,0.68519,0.857929,0.69222,0.868462,0.683252,0.843773,0.668541,0.848835,0.674522,0.843266,0.663946,0.830001,0.655817,0.825753,0.654858,0.822624,0.660058,0.818284,0.643763,0.796939,0.62913,0.789691,0.61749,0.772315,0.606656,0.773609,0.605172,0.76006,0.579637,0.728993,0.576794,0.726034,0.584777,0.705394,0.573393,0.693345,0.579456,0.693249,0.581378,0.684606,0.576406,0.670061,0.566151,0.642557,0.569876,0.629915,0.561387,horizontal straight-line
0.83942,0.564511,0.853031,0.560031,0.845024,0.549989,0.824814,0.546812,0.821869,0.5462,0.820898,0.536278,0.800887,0.525634,0.801667,0.542531,0.806793,0.553656,0.799924,0.576862,0.810348,0.571102,0.801704,0.57294,0.773529,0.561476,0.772628,0.565349,0.773298,0.566374,0.727042,0.553929,0.723279,0.579006,0.731698,0.593158,0.727945,0.606501,0.72577,0.644594,0.721218,0.642742,0.718306,0.65346,0.702917,0.676261,0.724201,0.707004,0.711995,0.708004,0.703505,0.708526,0.697355,0.711636,0.674235,0.737123,0.68839,0.735325,0.682767,0.741957,0.671688,0.739555,0.634614,0.737214,0.605281,0.713473,0.592041,0.713161,0.561725,0.714786,0.538708,0.703583,0.531588,0.718057,0.553363,0.737859,0.539013,0.719495,0.513489,0.721538,0.503373,0.719414,0.504463,0.731782,0.514171,0.730937,0.518139,0.738488,0.503466,0.730267,horizontal straight-line
0.5504,0.724639,0.548864,0.727437,0.559092,0.757221,0.576803,0.763471,0.579116,0.752175,0.581021,0.771376,0.588351,0.773922,0.604139,0.782165,0.603875,0.768626,0.608751,0.74764,0.601986,0.732743,0.599202,0.717549,0.607302,0.721427,0.620328,0.682498,0.603376,0.66756,0.61182,0.641005,0.571499,0.605139,0.563333,0.55631,0.532991,0.52395,0.514682,0.500591,0.530536,0.486458,0.522758,0.453329,0.515001,0.412563,0.502188,0.39027,0.503148,0.368665,0.501019,0.346839,0.512556,0.312493,0.47574,0.279755,0.476174,0.257592,0.473331,0.23701,0.492565,0.245318,0.510208,0.231261,0.509312,0.21478,0.507778,0.202246,0.506741,0.192624,0.502328,0.170399,0.488535,0.143743,0.495343,0.156119,0.510498,0.17154,0.538879,0.160089,0.531483,0.171206,0.55924,0.159821,0.539761,0.153518,0.520628,0.133368,0.503185,0.112633,vertical straight-line


### Step 1: Create train and test sets

First, let's split up the data into train and test sets. For this assignment, we will do a simple holdout set, assigning a random 20% of the data as the test data, and building the model on the remaining 80% of the data. 

<font color = 'red'>**Question 1. Create two Tables, one called `test` and one called `train`. The `test` table should contain a random 20% of the data, while the `train` Table should contain the other 80%.** <\font>

*Hint:* You can shuffle the entire dataset (sample the whole dataset without replacement), then just take the top 20% as your test data

In [6]:
movements.num_rows * .2

192.0

In [7]:
shuffled_movements = movements.sample(with_replacement = False)




### Step 2: Fitting the models

Let's fit three different models: Logistic Regression, Decision Trees, and K-Nearest Neighbors. The model objects are initialized using the following code.

In [8]:
# Create the model objects
logit = LogisticRegression(solver = 'liblinear')
tree = DecisionTreeClassifier()
knn = KNeighborsClassifier(n_neighbors = 7)

Note that we are using the default values for many of these. We generally would want to try a variety of different models with different parameters (for example, using different values for the stopping criteria in decision trees). We'll stick with just the defaults for now for this assignments. You can now use the `.fit` method to give it the data. 

<font color = 'red'>**Question 2. Using the `train` Table you created above, fit each of the three models.**</font>

If you're not sure about the exact format of the data needed, remember that you need to use `.rows` for the `X` values and `.column` for the `y` values. See lecture material and Labs 8 and 9 for how we fit linear regression models if you're still unsure of how to proceed. 

After you fit the models, you can use the `.classes_` instance variable to check the classes that are being predicted.

In [None]:
knn.classes_

### Step 3: Predict Test Set

Just as with linear regression, we can use the `.predict_proba` method with the model objects. The `.predict_proba` method provides scores for each class. We need to select one of the two classes as the "positive" outcome. Since there isn't one outcome we're particularly interested in out of the two possibilities, we will just choose "horizontal straight-line" to be positive. 

> In other situations, there might be one that you care more about. For example, if you are trying to predict recidivism in order to provide social programs to help released prisoners at higher risk of returning, then your "positive" outcome -- the outcome you are trying to predict -- would be "return", so that is what you would want to set as `True` below. 

We can't simply evaluate using the scores, though. We need to set a threshold so that we predict that an observation is "horizontal straight-line" or not. Here, we will set the threshold to be 0.5, but remember, there is nothing special about this number. We could have easily selected any other number between 0 and 1, and we should, in practice, try a wide range of values and see how our models perform with each of these thresholds.

An example using K-Nearest Neighbors is shown below.

In [None]:
# Setting a threshold (above means predicted to be horizontal straight-line)
threshold = 0.5

In [None]:
# Make sure you fit the model before running this!
test_features = test.select(np.arange(0,90)).rows
knn_predicted = knn.predict_proba(test_features)[:,0] > threshold

Note that we are only taking the first value from each row, as evidenced by the `[:,0]`. This is because we want the predicted scores for horizontal straight-line, and that is the first value (see the `.classes_` instance variable above). The scores for each of the categories are given, but we only really care about one, because that's all we need since the scores will add up to 1 across each row. 

<font color = 'red'>**Question 3. Create an array called `expected` that contain the expected values based on the test set. The `expected` array should contain `True` if the observation is actually a "horizontal straight-line" and `False` otherwise. Note that the `expected` array is based entirely on the real dataset, and not on an predictions! Create additional arrays that contain the predicted values for each of the models that we've fit (call them `logit_predicted` and `tree_predicted`).**</font>

### Step 4: Evaluate

You can get a confusion matrix using the `confusion_matrix` function that we brought at the beginning. This is part of the `sklearn.metrics` module.

In [None]:
conf_matrix = confusion_matrix(expected,knn_predicted)

In [None]:
conf_matrix

The columns represent predictions and the rows represent actual values, so the top left is true negatives, the bottom right is true positives, the top right is false positives, and the bottom left is false negatives.

#### Evaluation metrics

Two metrics that are often more relevant than overall accuracy are **precision** and **recall**. 

Precision measures the accuracy of the classifier when it predicts an example to be positive. It is the ratio of correctly predicted positive examples to examples predicted to be positive. 

$$ Precision = \frac{TP}{TP+FP}$$

Recall measures the accuracy of the classifier to find positive examples in the data. 

$$ Recall = \frac{TP}{TP+FN} $$

By selecting different thresholds we can vary and tune the precision and recall of a given classifier. A conservative classifier (threshold 0.99) will classify a case as 1 only when it is *very sure*, leading to high precision. On the other end of the spectrum, a low threshold (e.g. 0.01) will lead to higher recall. 

We can use the `precision_score` and `recall_score` functions to find the value of these measures.

In [None]:
precision_score(expected,knn_predicted)

In [None]:
recall_score(expected,knn_predicted)

<font color = 'red'>**Question 4. Find the confusion matrix for the Logistic Regression and Decision Tree models. Using the confusion matrix, compute the accuracy, precision, and recall. Afterwards, use the `precision_score` and `recall_score` functions to verify your answer.**</font>

### Step 5: Repeating the steps

We've done one iteration ... but we've only done it with one threshold, and we haven't tuned the parameters much. We won't go through all of the various ways we can fine-tune our models, but we can show how it is done: using loops.

<font color = 'red'>**Question 5. Write a loop that tries thresholds of .1, .3, .5, .7, and .9. Store the precision of each model at each threshold within their own arrays, named `knn_precision`, `logit_precision` and `tree_precision`. Do the same for recall (replacing precision with recall in the naming convention).**</font>

The loop has been started below for you.

In [None]:
knn_precision = make_array()
logit_precision = make_array()
tree_precision = make_array()
    
knn_recall = make_array()
logit_recall = make_array()
tree_recall = make_array()

# Set up models and fit them here
...

for i in make_array(.1,.3,.5,.7,.9):
    ...
    knn_precision = np.append(knn_precision, ...)
    logit_precision = np.append(logit_precision, ...)
    tree_precision = np.append(tree_precision, ...)
    
    knn_recall = np.append(knn_recall, ...)
    logit_recall = np.append(logit_recall, ...)
    tree_recall = np.append(tree_recall, ...)
    
# You can use these to look at results
precision_results = Table().with_columns('Threshold', make_array(.1, .3, .5, .7, .9),
                              'KNN Precision', knn_precision,
                              'Logistic Regression Precision', logit_precision,
                              'Decision Tree Precision', tree_precision)

recall_results = Table().with_columns('Threshold', make_array(.1, .3, .5, .7, .9),
                              'KNN Recall', knn_recall,
                              'Logistic Regression Recall', logit_recall,
                              'Decision Tree Recall', tree_recall)

#### Step 6: Model Selection and Conclusions

Generally, when deciding on the best model, we compare the models we fit with each other, as well as against a baseline.

<font color = 'red'>**Question 6. Suppose we were to choose the .5 threshold. If we were to try to predict whether a hand movement was "horizontal straight-line" or not completely randomly, how often would we be right? That is, what would be our precision if we were guessing completely randomly? How well does our best model perform compared to that baseline?**</font>