# Other predictors

This exercise is a chance to practice working with predictors.

First, set up the tests and imports by running the cell below.

In [None]:
# Run this cell.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# These lines load the tests.
from client.api.notebook import Notebook
ok = Notebook('other_predictors.ok')

We were studying prediction errors [the meaning of the mean](https://lisds.github.io/textbook/mean-slopes/mean_meaning) notebook.

In that notebook, you see the assertions that, *for any sequence of numbers*:

* The mean gives the lowest sum of prediction errors (and therefore, mean
  prediction error);
* The mean gives the lowest sum of squared prediction error (and therefore mean
  squared prediction error).

As you remember, if you have a value $c$ that is a *predictor*, then you get the prediction error for every element in your sequence, by subtracting $c$ from that element.

To be more specific, lets look at some [data on chronic kidney disease](https://lisds.github.io/textbook/data/chronic_kidney_disease).

This is a data table with one row per patient and one column per test on that patient.  Many of columns are values from blood tests.  Most of the patients have chronic kidney disease.

In [None]:
# Run this cell
ckd = pd.read_csv('ckd_clean.csv')
ckd.head()

We are interested in the column 'White Blood Cell Count'.

Make a new variable `wbc` that is a Series containing the "White Blood Cell Count" data.  Do a histogram of these values:

In [None]:
wbc = ...
wbc.head()

In [None]:
_ = ok.grade('q_1_wbc')

Could these values plausibly have been drawn from a normal distribution?

Assign either 1, 2, or 3 to the name `wbc_likely_normal` below.

1. Yes, that's plausible.
2. There isn't enough evidence to be confident either way.
3. No, that's not plausible.

In [None]:
wbc_likely_normal = ...

In [None]:
_ = ok.grade('q_2_wbc_likely_normal')

## Mean square error

Make a function called `mean_sq_err` that accepts two inputs:

1. a sequence of numbers
1. a predictor (a single number)

It returns the mean of the squared prediction errors.

For example, say the sequence of numbers was `np.array([3, 4])`, and your
predictor was 5.  Then the sum of squared prediction errors is `(3 - 5) **
2 + (4 - 5) ** 2` = `5`, and the mean of the squared prediction errors is `5 / 2` = 2.5.


In [None]:
def mean_sq_err(seq, p):
    # Your code here
    ...

Simple test with the following:

In [None]:
print(mean_sq_err(np.array([3, 4]), 5))  # Should show 2.5
print(mean_sq_err(np.array([3, 5]), 4))  # Should show 1
print(mean_sq_err(np.array([2, 3, 5]), 4))  # Should show 2

In [None]:
_ = ok.grade('q_6_mse_func')

Use this function to calculate the mean squared error of `wbc` for candidate
predictors from 7000, up to, but not including 10000, in steps of 0.5.  Your
predictors should include 7000, 7000.5, 7001.0 ... 9999.5, and you should
calculate a mean squared error for `wbc`, for each predictor.

In [None]:
predictors = ...
mse_for_predictors = ...
# Show the first five mean squared error values.
mse_for_predictors[:5]

In [None]:
_ = ok.grade('q_7_mse_for_predictors')

Plot the `predictors` on the x axis against `mse_for_predictors` on the y axis.

In [None]:
#- Plot mse_for_predictors against predictors

Now calculate the mean squared error for `wbc` using the mean as a predictor.
Subtract this value from the minimum of `mse_for_predictors` and put the result
into the variable `best_vs_mean`:

In [None]:
best_vs_mean = ...
best_vs_mean

In [None]:
_ = ok.grade('q_8_best_vs_mean')

Calculate the median of `wbc`, calculate the mean squared error for `wbc` using the median as predictor, and subtract the mean squared error using the mean as predictor, putting the result into `median_vs_mean`

In [None]:
mse_for_median = ...
median_vs_mean = ...
median_vs_mean

In [None]:
_ = ok.grade('q_9_median_vs_mean')

## Mean absolute error

You have dealt with one measure of a predictor - the mean square prediction
error.

Another measure of a predictor is its ability to reduce the *absolute* error.

For example, say we have a sequence `3, 4`, and a predictor `5`.  The absolute
errors are `abs(3 - 5), abs(4 - 5)`, and the mean absolute error is then
(2 + 1) / 2 = 1.5.

Before you continue, take some time to think whether you think the mean or the
median will do a better job here.  Write down your answer *on the piece of
paper you already had next to you on the desk*!

Write a function `mean_abs_err` to do this prediction for a sequence `seq` and
a predictor `p`.

*Hint*: remember the Numpy function to return the absolute values in an array.

In [None]:
def mean_abs_err(seq, p):
    # Your code here
    ...

Simple test with the following:

In [None]:
print(mean_abs_err(np.array([3, 4]), 5))  # Should show 1.5
print(mean_abs_err(np.array([3, 5]), 4))  # Should show 1
print(mean_abs_err(np.array([2, 3, 5]), 4))  # Should show 1.333 ish

In [None]:
_ = ok.grade('q_10_mae_func')

Use this function to calculate the mean absolute error of `wbc` for the candidate
predictors you used before, from 7000 to 10000, in steps of 0.5.  You should
calculate a mean absolute error for `wbc`, for each predictor.

In [None]:
mae_for_predictors = ...
# Show the first five mean absolute error values.
mae_for_predictors[:5]

In [None]:
_ = ok.grade('q_11_mae_for_predictors')

Plot the `predictors` on the x axis against `mae_for_predictors` on the y axis.

In [None]:
#- Plot mae_for_predictors against predictors

Now calculate the mean absolute error for `wbc` using the mean as a predictor.
Subtract this value from the minimum of `mae_for_predictors` and put the result
into the variable `a_best_vs_mean`.

In [None]:
a_best_vs_mean = ...
a_best_vs_mean

In [None]:
_ = ok.grade('q_12_a_best_vs_mean')

Calculate the median of `wbc`, calculate the mean absolute error for `wbc` using
the median as predictor, and subtract the mean absolute error using the mean as
predictor, putting the result into `a_median_vs_mean`

In [None]:
mae_for_median = ...
a_median_vs_mean = ...
a_median_vs_mean

In [None]:
_ = ok.grade('q_13_a_median_vs_mean')

Were you right in your speculation as to which of the median or mean would be a better predictor of the absolute value?

## Done

You're finished with the assignment!  Be sure to...

- **run all the tests** (the next cell has a shortcut for that),
- **Save and Checkpoint** from the "File" menu.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in sorted(os.listdir("tests")) if q.startswith('q')]