# Lab 11: Prediction

Please complete this lab by providing answers in cells after the question. Use **Code** cells to write and run any code you need to answer the question and **Markdown** cells to write out answers in words. After you are finished with the assignment, remember to download it as an **HTML file** and submit it in **ELMS**.

In [None]:
# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# This is for KNN
from sklearn.neighbors import KNeighborsClassifier

# For getting results from the test set
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

## K-Nearest Neighbors

K-Nearest Neighbors (k-NN) is a classification algorithm.  Given some *attributes* (also called *predictors* or *features*) of an unseen example, it decides whether that example belongs to one or the other of two categories based on its similarity to previously seen examples. The predicted category is called an *outcome* or *label*. 

### North Carolina Births Data

In this lab, we will work with a dataset of births in North Carolina.  A series of variables were collected, most notably the smoking habit of the mother as well as the birthweight of the baby and the low birthweight status of the baby. 

Similar to the lab using hypothesis tests, the key variable we are interested in is babies that have lower birthweights. In particuar, we are interested in being able to **predict whether a baby will be categorized as a low birthweight baby or not**. This is particularly important for hospitals, because they can use this information to determine which births might need extra care and allocate resources accordingly. That is, **if the hospital is able to accurately predict low birthweight babies, then they can make sure the birth goes as smoothly as possible because they will be prepared for it**.

Let's start by bringing in the data that we will use to **train**, or build, our K-Nearest Neighbors model with. You can think about this as the data that we have collected from the past which we will use to build our model. Then, the model can be used to try to predict what will happen in the future.

In [None]:
ncbirths_train = Table.read_table('ncbirths_train.csv')

In [None]:
ncbirths_train.show(5)

The `lowbirthweight` variable has two categories: `low` and `not low`. 

## Visualizing Relationships

Let's start by looking at the relationship between some of the variables and our outcome,`lowbirthweight`. We'll create a scatterplot of mother's age and weeks of gestation, and color the points by the outcome. 

In [None]:
ncbirths_train.scatter('mage', 'weeks', group = 'lowbirthweight')

It looks like there's a decent amount of separation between the babies that were low birthweight and not low birthweight, though there is some overlap. 

<font color = 'red'>**Question 1. Create scatterplots with observations colored by `lowbirthweight` for `mage` and `gained`, as well as `weeks` and `gained`. Does it look like there is separation between the groups in these graphs?**</font>

## Training a K-Nearest Neighbors  Model

To train a model using the K-Nearest Neighbors algorithm, we will use the tools within the `sklearn` package. The `sklearn` package contains lots of very useful tools for machine learning and predictive models. You may have noticed we brought in `KNeighborsClassifier` (as well as a few other functions we'll use later) at the beginning of the lab. We'll use that now to create our **model object**. To create the model object, we simply use `KNeighborsClassifier` with an argument of `n_neighbors` signifying how many neighbors to use (that is, what the value of k should be). To start, we'll just use 1.

In [None]:
# Create a model object
knn = KNeighborsClassifier(n_neighbors = 1)

You can think about the `knn` object as having all of the instructions to run the K-Nearest Neighbors algorithm. Note that we havent provided any data yet, so it hasn't fit a model. We can use this object to give it the data using the `fit` method and providing the `X` and `y` data. 

We'll first create `predictor` and `outcome` objects to provide as arguments to the `fit` method. Note that we need to provide each of them in a certain format: the `X` needs to take in **Table rows** (since it can take in multiple variables), while the `y` can take in a **numpy array**.

> Be careful about how you define the `predictor` and `outcome` objects! You need use `.select()` (or `.drop()`) and .rows for the predictor, and use `.column()` (or have an array of some sort) for the outcome.

In [None]:
# Define predictor and outcome
predictor = ncbirths_train.select('mage', 'weeks').rows
outcome = ncbirths_train.column('lowbirthweight')

# Fit the model
knn.fit(X = predictor, y = outcome)

You won't see any meaningful output here. That's ok! We've fit the model, and the model now exists within the `knn` object. We just need to access it to make predictions. We'll do that in a little bit, but first, let's try visualizing it to see what this model actually looks like.

<font color = 'red'>**Question 2. Train the KNN model again, using the `weeks` and `gained` variables only as the predictors. Start by initializing the model object as `knn2`, then fit using the two predictors (`weeks`, and `gained`) and the outcome of `lowbirthweight`.**</font>

In [None]:
knn2 = ...

# Define predictor and outcome
predictor = ...
outcome = ...

# Fit the model
...

## Visualizing the Predictions

In order to visualize what combinations of mother's age and weeks would result in what predictions, we'll use the following `decision_graph` function to graph the **decision boundary**. The decision boundary is what we will use to visualize the model, and see what combinations of the predictors result in what predictions. 

The `decision_graph` function takes in 7 arguments:
- model: the model object
- x: the predictor variable to plot on the x-axis
- x_min: the minimum value of x
- x_max: the maximum value of x
- y: the predictor variable to plot on the y-axis
- y_min: the minimum value of y
- y_max: the maximum value of y

> Note that these x and y are both predictors and are NOT the same as the `X` and `y` arguments from the `.fit()` method.

Within this function, we take the following steps:
- Make a grid (`test_grid`) of combinations of x and y points spanning the x and y minimum and maximum values , incrementing by 0.25.
- Predict for each combination of x and y variables. 
- Create a scatterplot of the grid.

In [None]:
def decision_graph(model, x, x_min, x_max, y, y_min, y_max):
    '''
    Displays a graph of the decision boundary for a given KNN model.
    
    Arguments:
    model: A model object
    x: A string representing the predictor variable to put on the x-axis.
    x_min: An integer representing the minimum value of x.
    x_max: An integer representing the maximum value of x.
    y: A string representing the predictor variable to put on the y-axis.
    y_min: An integer representing the minimum value of y.
    y_max: An integer representing the maximum value of y.
    
    Returns:
    None
    '''
    x_array = make_array()
    y_array = make_array()
    for i in np.arange(x_min, x_max, 0.25):
        for j in np.arange(y_min, y_max, 0.25):
            x_array = np.append(x_array, i)
            y_array = np.append(y_array, j)

    test_grid = Table().with_columns(
        x, x_array,
        y, y_array
    )
    
    grid_predictions = model.predict(test_grid.rows)
    test_grid_with_predictions = test_grid.with_column('Class', grid_predictions)
    test_grid_with_predictions.scatter(x, y, group = 'Class', alpha = 0.4, s=10)

In [None]:
decision_graph(knn, 'mage', 15, 50, 'weeks', 25, 45)

<font color = 'red'>**Question 3. Interpret the decision boundary for this model. Can you describe the combinations of mother's age and weeks of gestation that would lead to a prediction of low birthweight?**</font>

<font color = 'red'>**Question 4. Using the `decision_graph` function, draw the decision boundary for the `knn2` model using `weeks` on the x-axis and `gained` on the y-axis. How would you describe the predictions that would be made using this model?**</font>

*Hint:* Use the scatterplots from earlier to motivate the minimum and maximum values for `gained`.

### Understanding the Decision Boundary

According to the decision boundary we found above, it looks like there are some big areas of blue or yellow, but also some small islands. However, this is likely not how the relationship really works. In reality, it's probably just that there were some random occurances that this model is detecting, and we wouldn't actually want to predict any future observations inside the islands to be low birthweight. To try to avoid being too sensitive to the exact data that we have, we can use bigger values of k and try to get a "vote" of the nearest few points instead of using just the nearest one neighbor. 

We'll try different values of k using the same predictors as the original `knn` model fit earlier, and compare how these decision boundaries look compared to what we saw above.

<font color = 'red'>**Question 5. Fit a KNN model using again using `n_neighbors = 11` and `n_neighbors = 25`. Use `mage` and `weeks` as your predictors. Make sure to call these model objects different names each time.**</font>

In [None]:
# With k = 11
...

# Define predictor and outcome
predictor = ...
outcome = ...

# Fit the model
...

In [None]:
# With k = 25
...

# Define predictor and outcome
predictor = ...
outcome = ...

# Fit the model
...

<font color = 'red'>**Question 6. Using the `decision_graph` function, draw the decision boundary for models fit using 11 neighbors and 21 neighbors. What does the decision boundary look like for each? How did it change in each instance?**</font>

## Adding Predictors

We can use more than just two variables to predict the outcome. We started with just `mage` and `weeks`, but we might be interested in using some others in addition to these two. Let's try adding `gained` to the model. This variable represents the amount of weight the mother gained during the pregnancy.

<font color = 'red'>**Question 7. Train the KNN model again, adding the `gained` variable to the list of predictors to be included. Start by initializing the model object as `knn_three_predictors`, then fit using the three predictors (`mage`, `weeks`, and `gained`) and the outcome of `lowbirthweight`. Use k = 11 for the number of neighbors.**</font>

In [None]:
knn_three_predictors = ...

# Define predictor and outcome
predictor = ...
outcome = ...

# Fit the model
...

We won't be able to visualize this as easily as we did above because we now have three predictors, but that's ok. We'll generally want to evaluate these models using their **performance** rather than based on the graphs.

## Predicting for Future Data

We've tried a few different values of k and different combinations of predictors. We've noticed that some higher values of k seem to do better than others. But, there's lots of possibilities for k. How do we decide which value of k is the best one to use? 

Remember, our ultimate goal here is to make predictions on low birthweight status. So, the choice of k should be the value that does the best at making predictions. Ideally, we would try to make some predictions on **new data** to see how well we are performing. In other words, if we had some more data, we could apply our models to that new data and compare how well the models with different values of k performed to decide which one was the best.

So, how do we get **new data**? In general, we actually hold out some of our data set so that we can see how our models would have performed **if we were to get new data**. In other words, we build our models on just part of our data so that we have the rest to test out how our models would have performed on new data. In this case, we've done that for you already by only including 80% of the data in the `ncbirths_train.csv` dataset. The rest was saved inside `ncbirths.csv`. 

Let's look at the `ncbirths_test` **test set** and see how well our models perform.

In [None]:
ncbirths_test = Table.read_table('ncbirths_test.csv')
ncbirths_test.show(5)

This was part of the same original dataset, but randomly sampled to be left out so that we could use it to **test the models**. This acts as our "future data", so we can see how our models might perform if we were to get actual future data.

Let's try making predictions on this data set. We need to get the data in our test set in the same format as the predictors, then pass it as an argument with the `.predict` method. We'll use the original `knn` model with just one neighbor and only two predictors, `mage` and `weeks`, to start out.

In [None]:
test_predictors = ncbirths_test.select('mage','weeks').rows
test_predictions = knn.predict(test_predictors)
test_predictions

<font color = 'red'>**Question 8. Find the predictions for your `knn_three_predictors` model. Assign these predictions the name `test_predictions_three_predictors`.**</font>

### Performance

So, how well did these predictions do? Let's check by using accuracy. **Accuracy** is simply the proportion of predictions that were correct. In other words, the numerator would have the number of correct predictions in the test set, and the denominator would be the number of cases in the test set. We can calculate the accuracy of our model using the `accuracy_score` function. This takes the diagonal values and divides by the total number of cases to give us how often we were correct. 

In [None]:
accuracy_score(ncbirths_test.column('lowbirthweight'), test_predictions)

<font color = 'red'>**Question 9. Find the confusion matrix and accuracy on the test dataset using `knn_three_predictors`. How does the accuracy differ? Which model do you think is better?**</font>

## Finding the Best Model

We've started doing a little bit of what is called **supervised machine learning**. In this example, we tried to build a model that is the best at making predictions by fitting multiple models using different values of k and different numbers of predictors and seeing how well each of them performed on the **test set**. Once we've decided on the best model, we would use that to make our predictions for any future incoming observations. So, in our example, a hospital might take the model that we decided on, then use that to predict which babies would be low birthweight, and use those predictions to make decisions on where to allocate their resources. 

Making a decision on which model is best is actually more complicated than this though. To think about why that might be, think about what our actual goal was at the begining: identifying babies at risk of having a low birthweight. What if **we really valued identifying low birthweight babies highly**, even at the **cost of falsely predicting some normal weight babies as low birthweight**? Because of nuances in what we want our models to be the best at, we need to take some care when choosing which model is our chosen "best model".