# Lab 4: Plotting, smoothing, transformation

** If you are not attending lab, this assignment is due 09/19/2017 at 11:59pm (graded on accuracy) **

** If you are attending lab, you do not need to submit the assignment; you just need to get checked off by your TA. **

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")
%matplotlib inline

In [None]:
# These lines load the tests.
# !pip install -U okpy

from IPython.display import display, Latex, Markdown
from client.api.notebook import Notebook
ok = Notebook('lab04.ok')

In [None]:
# ok.auth(force=True)
ok.auth(force=True)

## Objectives for Lab 04:

In this lab you will get some practice applying data transformations and using working with kernel density estimators.  We will be working with data from the world bank containing various statistics for countries and territories around the world.  


In [None]:
wb = pd.read_csv("data/world_bank_misc.csv", index_col=0)
wb.head()

This table contains some interesting columns.  Take a look:

In [None]:
list(wb.columns)

# Part 1: Scaling

In the first part of this assignment we will be scaling the data to linearize visualizations.


## Question 1:

Extract the fields corresponding to the **adult literacy rate in Female ages 15 and older for 2005-14** and the **gross national income per capita (atlas method)** into a new dataframe.  Then drop any rows that are missing values.

In [None]:
df = pd.DataFrame(index=wb.index)
df['lit'] = ...
df['inc'] = ...
df.dropna(inplace=True)
print("Original records:", len(wb))
print("Final records:", len(df))

In [None]:
_ = ok.grade('q01')
_ = ok.backup()

## Question 2a:

Use the seaborn `distplot` tool to construct histograms for the adult literacy data and the income data:

In [None]:
plt.figure(figsize=(10,5))

plt.subplot(1,2,1)
# Make plot here

plt.xlabel("Adult literacy rate: Female: % ages 15 and older: 2005-14")



plt.subplot(1,2,2)
# Make plot here

plt.xlabel('Gross national income per capita, Atlas method: $: 2016')


## Question 2b

One of the above plots could benefit from a log transformation.  Which one:

In [None]:
needs_log_transformation = ... # lit or inc

In [None]:
_ = ok.grade('q02b')
_ = ok.backup()

## Question 2c

Remake the appropriate plot with the data transformed using `log10`. Be sure to correct the axis label:

In [None]:
plt.figure()
...

## Question 3a

Use the Seaborn `regplot` function to plot `y=inc` vs `x=lit`:

In [None]:
plt.figure(figsize=(10,5))
# Make plot here

## Question 3b 

Using the Tukey Mosteller's Bulge diagram: 

![The Bulge](https://i0.wp.com/f.hypotheses.org/wp-content/blogs.dir/253/files/2014/06/Selection_005.png?zoom=2&resize=295%2C318)

Create two new columns `trans_lit` and `trans_inc` that correct the plot:

In [None]:
plt.figure(figsize=(10,5))

df['trans_lit'] = df['lit'] # Change me
df['trans_inc'] = df['inc'] # Change me

sns.regplot(x='trans_lit', y='trans_inc', data=df)

In [None]:
_ = ok.grade('q03b')
_ = ok.backup()

# Part 2: Kernel Density Estimation

In this part of the lab you will implement a kernel density estimator.


Let's implement our own version of the KDE plot above.  Below we give you the Guassian Kernel function

$$\Large
K_\alpha(x, z) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(-\frac{(x - z)^2}{2  \alpha ^2} \right)
$$

In [None]:
def gaussian_kernel(alpha, x, z):
    return 1.0/np.sqrt(2. * np.pi * alpha**2) * np.exp(-(x - z) ** 2 / (2.0 * alpha**2))

## Question 4a

Implement the KDE function which computes:

$$\Large
f_\lambda(x) = \frac{1}{n} \sum_{i=1}^n K_\alpha(x, z_i)
$$

Where {z_i} is the data, alpha is a parameter to control the smoothness

In [None]:
def kde(kernel, alpha, x, data):
    """
    Compute the kernel density estimate for the single query point x.

    Args:
        kernel: a kernel function from two argumen
        alpha: the smoothing parameter to pass to the kernel
        x: a single query point (in one dimension)
        data: a numpy array of data points

    Returns:
        The smooted estimate at the query point x
    """
    ...

In [None]:
_ = ok.grade('q04a')
_ = ok.backup()

Now let's test your function to generate a plot. You may find the ```np.linspace``` function helpful when plotting the KDE curve.

In [None]:
alpha = 1.0
xs = np.linspace(df['trans_inc'].min(), df['trans_inc'].max(), 1000)
curve = [kde(gaussian_kernel, alpha, x, df['trans_inc']) for x in xs]
plt.hist(df['trans_inc'], normed=True, color='orange')
plt.plot(xs, curve, 'k-')

## Question 4b

If we make the value of alpha much smaller what happens?  Try setting the value of alpha to a more appropriate value:

In [None]:
alpha = 1.0
xs = np.linspace(df['trans_inc'].min(), df['trans_inc'].max(), 1000)
curve = [kde(gaussian_kernel, alpha, x, df['trans_inc']) for x in xs]
plt.hist(df['trans_inc'], normed=True, color='orange')
plt.plot(xs, curve, 'k-')

In [None]:
_ = ok.grade('q04b')
_ = ok.backup()

## Question 4c

How is the alpha effecting the results? What if we keep using a larger alpha on the transformed data? 

In [None]:
q4c_answer = """

"""

## Question 4d

We can also try other kernel functions such as the [boxcar kernel](https://en.wikipedia.org/wiki/Boxcar_function).


In [None]:
def boxcar_kernel(alpha, x, z):
    return (((x-z)>=-alpha/2)&((x-z)<=alpha/2))/alpha

We can plot the kernel function to see how it looks like

In [None]:
from ipywidgets import interact
x = np.linspace(-10,10,1000)
def f(alpha):
    plt.plot(x, boxcar_kernel(alpha,x,0))
    plt.plot(x, gaussian_kernel(alpha,x,0))
    plt.show()
interact(f, alpha=(1,10,0.1))

Using the interactive plot below compare the the two kernel techniques:  (Generating the KDE plot is slow, so you may expect some latency after you move the slider)

In [None]:
xs = np.linspace(df['trans_inc'].min(), df['trans_inc'].max(), 1000)
def f(alpha_g, alpha_b):
    plt.hist(df['trans_inc'], normed=True, color='orange')
    g_curve = [kde(gaussian_kernel, alpha_g, x, df['trans_inc']) for x in xs]
    plt.plot(xs, g_curve, 'k-', label='Gaussian')
    b_curve = [kde(boxcar_kernel, alpha_b, x, df['trans_inc']) for x in xs]
    plt.plot(xs, b_curve, 'r-', label='Boxcar')
    plt.show()
interact(f, alpha_g=(0.01,.5,0.01), alpha_b=(0.01,3,0.1))

How is the boxcar kde plot comparing to previous plot using the gaussian kernel?

In [None]:
answer="""

"""

## Submission

Congrats! You are finished with this assignment. For convenience, we've included a cell below that runs all the OkPy tests.

In [None]:
import os
print("Running all tests...")
_ = [ok.grade(q[:-3]) for q in os.listdir("ok_tests") if q.startswith('q')]

Now, run the cell below to submit your assignment to OkPy. The autograder should email you shortly with your autograded score. The autograder will only run once every 30 minutes.

**If you're failing tests on the autograder but pass them locally**, you should simulate the autograder by doing the following:

1. In the top menu, click Kernel -> Restart and Run all.
2. Run the cell above to run each OkPy test.

**You must make sure that you pass all the tests when running steps 1 and 2 in order.** If you are still failing autograder tests, you should double check your results.

In [None]:
_ = ok.submit()