In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab07.ipynb")

# Lab 7: Visualization, Transformations, and KDEs

In this lab you will get some practice plotting, applying data transformations, and working with kernel density estimators (KDEs).  We will be working with data from the World Bank containing various statistics for countries and territories around the world. 



## Setup

Note that we configure a custom default figure size. Virtually every default aspect of matplotlib [can be customized](https://matplotlib.org/users/customizing.html).

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight') # Use plt.style.available to see more styles
sns.set()
sns.set_context("talk")
plt.rcParams['figure.figsize'] = (8, 5)
%matplotlib inline

### Get the Data

Let us load some World Bank data into a `pd.DataFrame` object named ```wb```.

In [None]:
wb = pd.read_csv("data/world_bank_misc.csv", index_col=0)
wb.head()

This table contains some interesting columns.  Take a look:

In [None]:
list(wb.columns)

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />


## Part 1: Scaling 



In the first part of this assignment we will look at the distribution of values for combined adult literacy rate as well as the gross national income per capita. The code below creates a copy of the DataFrame that contains only the two Series we want, and then drops all rows that contain null values in either column.

**Note:** For this lab we are dropping null values without investigating them further. However, this is generally not the best practice and can severely affect our analyses.

Here the combined literacy rate is the sum of the female and male literacy rates as reported by the World Bank. 0 represents no literacy, and 200 would represent total literacy by both genders that are included in the World Bank's dataset.

In this lab, we will be using the `sns.histplot`, `sns.rugplot`, and `sns.displot` function to visualize distributions. You may find it useful to consult the seaborn documentation on [distributions](https://seaborn.pydata.org/tutorial/distributions.html) and [functions](https://seaborn.pydata.org/tutorial/function_overview.html) for more details.

In [None]:
#creates a DataFrame with the appropriate index
df = pd.DataFrame(index=wb.index)

#copies the Series we want
df['lit'] = wb['Adult literacy rate: Female: % ages 15 and older: 2005-14'] + \
  wb["Adult literacy rate: Male: % ages 15 and older: 2005-14"]
df['inc'] = wb['Gross national income per capita, Atlas method: $: 2016']

#the line below drops all records that have a NaN value in either column
df.dropna(inplace=True)
print("Original records:", len(wb))
print("Final records:", len(df))

In [None]:
df.head(5)

<br>

--- 

### Question 1a 

Suppose we wanted to build a histogram of our data to understand the distribution of literacy rates and income per capita individually. We can use [seaborn's `countplot`](https://seaborn.pydata.org/generated/seaborn.countplot.html) to create bar charts from categorical data. 

In [None]:
sns.countplot(x = "lit", data = df)
plt.xlabel("Combined literacy rate: % ages 15 and older: 2005-14")
plt.title('World Bank Combined Adult Literacy Rate')

In [None]:
sns.countplot(x = "inc", data = df)
plt.xlabel('Gross national income per capita, Atlas method: $: 2016')
plt.title('World Bank Gross National Income Per Capita')

<!-- BEGIN QUESTION -->

In the cell below, explain why `countplot` is NOT the right tool for visualizing the distribution of our data.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

<br>

---

### Question 1b

In the cell below, create a plot of **income per capita** (the second plot above) using the [seaborn `histplot`](https://seaborn.pydata.org/generated/seaborn.histplot.html) function. In this case you should display the plots as two subplots, where the top subplot is literacy, and the bottom subplot is income. See [matplotlib subplots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) function and the `ax` parameter of `histplot`.  

Don't forget to title the plot and label axes!

You may need to change the size of the figure (and the font sizes for all the labels). 

**Hint:** *Copy and paste from above to start.*

In [None]:
...

<!-- END QUESTION -->

<br>

In the cell below, we explore overlaying a rug plot on top of a histogram using `rugplot`. Note that the rug plot is hard to see.

In [None]:
sns.histplot(x="inc", data = df)
sns.rugplot(x="inc", data = df)
plt.xlabel('Gross national income per capita, Atlas method: $: 2016')
plt.title('World Bank Gross National Income Per Capita')

One way to make it easier to see the difference between the rug plot and the bars is to set a different color, for example:

In [None]:
sns.histplot(x="inc", data = df, color = "lightsteelblue")
sns.rugplot(x="inc", data = df)
plt.xlabel('Gross national income per capita, Atlas method: $: 2016')
plt.title('World Bank Gross National Income Per Capita')

There is also another function called `kdeplot` which plots a Kernel Density Estimate as described in class, and covered in more detail later in this lab.



<!-- BEGIN QUESTION -->

<br>

---

### Question 1c 

Rather than manually calling `histplot`, `rugplot`, and `kdeplot` to plot histograms, rug plots, and KDE plots, respectively, we can instead use `displot`, which can simultaneously plot histogram bars, a rug plot, and a KDE plot, and adjust all the colors automatically for visbility. Using the documentation for [`displot`](https://seaborn.pydata.org/generated/seaborn.displot.html) ([Link](https://seaborn.pydata.org/generated/seaborn.displot.html)), make a plot of the income data that includes a histogram, rug plot, and KDE plot. 

Hint: You'll need to set two parameters to `True`.  Also, use a parameter to ensure the figure is approximately (8, 5) in size. 

In [None]:
...

You should see roughly the same histogram as before. However, now you should see an overlaid smooth line. This is the kernel density estimate discussed in class. 

<!-- END QUESTION -->

In the figure above, the y-axis is labeled by the counts. We can also label the y-axis by the density. An example is given below, this time using the literacy data from the beginning of this lab.

In [None]:
sns.displot(x="lit", data = df, rug = True, kde = True, stat = "density")
plt.xlabel("Adult literacy rate: Combined: % ages 15 and older: 2005-14", fontsize=12)
plt.title('World Bank Combined Adult Literacy Rate')

Observations:
* You'll also see that the y-axis value is no longer the count. Instead it is a value such that the total **area** in the histogram is 1. For example, the area of the last bar is approximately 22.22 * 0.028 = 0.62

* The KDE is a smooth estimate of the distribution of the given variable. The area under the KDE is also 1. While it is not obvious from the figure, some of the area under the KDE is beyond the 100% literacy. In other words, the KDE is non-zero for values greater than 100%. This, of course, makes no physical sense. Nonetheless, it is a mathematical feature of the KDE.

We'll talk more about KDEs later in this lab.

<!-- BEGIN QUESTION -->

<br>

---

### Question 1d 

Looking at the income data, it is difficult to see the distribution among low income countries because they are all scrunched up at the left side of the plot. The KDE also has a problem where the density function has a lot of area below 0. 

Transforming the `inc` data logarithmically gives us a more symmetric distribution of values. This can make it easier to see patterns.

In addition, summary statistics like the mean and standard deviation (square-root of the variance) are more stable with symmetric distributions.

In the cell below, make a distribution plot of `inc` with the data transformed using `np.log10` and `kde=True`. If you want to see the exact counts, just set `kde=False`. If you don't specify the `kde` parameter, it is by default set to True. 

**Hint:** Unlike the examples above, you can pass a series to the `displot` function, i.e. rather than passing an entire DataFrame as `data` and a column as `x`, you can instead pass a series.  Use a parameter to ensure the figure size is approximately (8,5). 

In [None]:
ax = ...
plt.title('World Bank Gross National Income Per Capita')
plt.ylabel('Density')
plt.xlabel('Log Gross national income per capita, Atlas method: $: 2016', 
           fontsize=16);

<!-- END QUESTION -->

When a distribution has a long right tail, a log-transformation often does a good job of symmetrizing the distribution, as it did here.  Long right tails are common with variables that have a lower limit on the values. 

On the other hand, long left tails are common with distributions of variables that have an upper limit, such as percentages (can't be higher than 100%) and GPAs (can't be higher than 4).  That is the case for the literacy rate. Typically taking a power-transformation such 
as squaring or cubing the values can help symmetrize the left skew distribution.

In the cell below, we will make a distribution plot of `lit` with the data transformed using a power, i.e., raise `lit` to the 2nd, 3rd, and 4th power. We plot the transformation with the 4th power below.

In [None]:
ax = sns.displot((df['lit']**4), kde = True, aspect=1.3)
plt.ylabel('Density')
plt.xlabel("Adult literacy rate: Combined: % ages 15+: 2005-14", fontsize=13)
plt.title('World Bank Combined Adult Literacy Rate (4th power)', pad=30);

<!-- BEGIN QUESTION -->

<br>

---

### Question 1e 


If we want to examine the relationship between the female adult literacy rate and the gross national income per capita, we need to make a scatter plot. 

In the cell below, create a scatter plot of untransformed income per capita and literacy rate using the `sns.scatterplot` function. Make  sure to label both axes using `plt.xlabel` and `plt.ylabel`.

In [None]:
...

<!-- END QUESTION -->

<br> 

We can better assess the relationship between two variables when they have been straightened because it is easier for us to recognize linearity.

In the cell below, we see a scatter plot of log-transformed income per capita against literacy rate.


In [None]:
sns.scatterplot(x = df['lit'], y = np.log10(df['inc']))
plt.xlabel("Adult literacy rate: Combined: % ages 15 and older")
plt.ylabel('Gross national income per capita (log scale)')
plt.title('World Bank: Gross National Income Per Capita vs\n Combined Adult Literacy Rate');

This scatter plot looks better. The relationship is closer to linear.

We can think of the log-linear relationship between x and y, as follows: a constant change in x corresponds to a percent (scaled) change in y.

We can also see that the long left tail of literacy is represented in this plot by a lot of the points being bunched up near 100. Try squaring literacy and taking the log of income. Does the plot look better? 

In [None]:

...


Choosing the best transformation for a relationship is often a balance between keeping the model simple and straightening the scatter plot.

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Part 2: Kernel Density Estimation 

In this part of the lab you will develop a deeper understanding of how kernel density estimation works.
- Explain KDE briefly within the lab

### Overview

Kernel density estimation is used to estimate a probability density function (i.e. a density curve) from a set of data. Just like a histogram, a density function's total area must sum to 1.

KDE centrally revolves around this idea of a "kernel". A kernel is a function whose area sums to 1. The three steps involved in building a kernel density estimate are:
1. Placing a kernel at each observation
2. Normalizing kernels so that the sum of their areas is 1
3. Summing all kernels together

The end result is a function, that takes in some value `x` and returns a density estimate at the point `x`.

When constructing a KDE, there are several choices to make regarding the kernel. Specifically, we need to choose the function we want to use as our kernel, as well as a bandwidth parameter, which tells us how wide or narrow each kernel should be. We will explore these ideas now.

In [None]:
data3pts = np.array([2, 4, 9])
sns.displot(data3pts, kde = True, stat = "density");

To understand how KDEs are computed, we need to see the KDE outside the given range. 

The easiest way to do this was to use an old function called `distplot`. But, `distplot` is now deprecated, and it will be removed at a future date. If you get an error that says that `distplot` is not a valid function, sorry, you are too far in the future to do this code cell.

Using a deprecated function will often result in a `UserWarning`. You may follow their suggestions to adapt your code. In this case, we can use `displot` [(documentation)](https://seaborn.pydata.org/archive/0.11/generated/seaborn.displot.html#seaborn.displot) with some additional parameters. These additional parameters are needed because the default values of `displot` are different from `distplot`; we can manually set them to be the same.

In [None]:
# Run this cell to create KDE plot and histogram using displot; no further action is needed.
plt.figure(figsize=(5, 15))
sns.displot(data3pts, kde=True, stat="density", kde_kws={"cut":4}, bins=2, 
            height=5, aspect=1.5);

One question you might be wondering is how the kernel density estimator decides how "wide" each point should be. It turns out this is a parameter you can set called `bw`, which stands for bandwidth. For example, the code below gives a bandwidth value of 0.5 to each data point. You'll see the resulting KDE is quite different. Try experimenting with different values of bandwidth and see what happens.

In [None]:
# Run this cell to plot displot with the specified bandwidth. 
plt.figure(figsize=(5, 15))
sns.displot(data3pts, kde=True, stat="density", 
            kde_kws={"bw_method": 0.5, "cut":4}, bins=2, height=5, aspect=1.5);

<br>

---

### Question 2a

As mentioned above, the kernel density estimate is just the sum of a bunch of copies of the kernel, each centered on our data points. The default kernel used by the `displot` function (as well as `kdeplot`) is the Gaussian kernel, given by:

$$\Large
K_\alpha(x, z) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(-\frac{(x - z)^2}{2  \alpha ^2} \right)
$$

We've implemented the Gaussian kernel for you in Python below. Here, `alpha` is the smoothing or bandwidth parameter $\alpha$ for the KDE, `z` is the center of the Gaussian (i.e., a data point or an array of data points), and `x` is an array of values of the variable whose distribution we are plotting. In other words, `z` represents the center point of our smooth KDE bell curve, while `x` represents the range of values over which we want to generate the KDE plot.

In [None]:
def gaussian_kernel(alpha, x, z):
    """
    Compute the Gaussian density estimate for values in x.

    Args:
        alpha: the smoothing parameter to pass to the kernel.
        x: an array of values whose density will be calculated.
        z: center of Gaussian.

    Returns:
        The smoothed estimate at values of x.
    """    
    return 1.0/np.sqrt(2. * np.pi * alpha**2) * np.exp(-(x - z) ** 2 / (2.0 * alpha**2))

For example, we can plot the Gaussian kernel centered at 9 with $\alpha$ = 0.5 as below: 

In [None]:
xs = np.linspace(-2, 12, 200)
alpha = 0.5
kde_curve = [gaussian_kernel(alpha, x, 9) for x in xs]
plt.plot(xs, kde_curve);

<!-- BEGIN QUESTION -->

In the cell below, plot the 3 kernel density functions corresponding to our 3 data points on the same axis. Use an `alpha` value of 0.5. Recall that our three data points are 2, 4, and 9. 

**Note:** Make sure to normalize your kernels! This means that the area under each of your kernels should be $\frac{1}{3}$ since there are three data points.

You don't have to use the following hints, but they might be helpful in simplifying your code.

**Hint:** The `gaussian_kernel` function can also take a `NumPy` array as an argument for `z`.

**Hint:** To plot multiple plots at once, you can use `plt.plot(xs, y)` with a two-dimensional array as `y`.

Add a legend to the plot. 


In [None]:
...

<!-- END QUESTION -->

In the cell below, we see a plot that shows the sum of all three of the kernels above. The plot resembles the KDE shown when you called `distplot` function with bandwidth 0.5 earlier. The area under the final curve will be 1 since the area under each of the three normalized kernels is $\frac{1}{3}$.

In [None]:
# Run this cell to plot the sum of the kernels; no further action is needed.
xs = np.linspace(-2, 12, 200)
alpha = 0.5
kde_curve = np.array([1/3 * gaussian_kernel(alpha, x, data3pts) for x in xs])
plt.plot(xs, np.sum(kde_curve, axis=1));

Recall that earlier we plotted the kernel density estimation for the logarithm of the income data, as shown again below.

In [None]:
# Run this cell to plot KDE of log income; no further action is needed.
ax = sns.displot(data=df, x=np.log10(df['inc']), kind='kde', rug=True, aspect=1.4)
plt.xlabel('Log Gross national income per capita, Atlas method: $: 2016', fontsize=13)
plt.title('World Bank Gross National Income Per Capita');

In the cell below, a similar plot is shown using what was done above. Try out different values of alpha in {0.1, 0.2, 0.3, 0.4, 0.5}. You will see that when `alpha=0.2`, the graph matches the previous graph well, except that the `displot` function hides the KDE values outside the range of the available data. Recall that `alpha` represents the spread of each individual kernel curve, which in turn affects the smoothness of the final KDE plot. When would you expect a smoother plot - when alpha = 0.1 or 0.5?

In [None]:
# Run this cell to plot KDE of log income; try out different bandwidths!
xs = np.linspace(1, 6, 200)
alpha = 0.5
kde_curve = np.array([1/len(df['inc']) * gaussian_kernel(alpha, x, np.log10(df['inc'])) for x in xs])
plt.xlabel('Log Gross national income per capita, Atlas method: $: 2016', fontsize=13)
plt.ylabel("Density")
plt.title('World Bank Gross National Income Per Capita')
plt.plot(xs, np.sum(kde_curve, axis = 1));

<br>

---

### Question 2b

In your answers above, you hard-coded a lot of your work. In this problem, you'll build a more general kernel density estimator function.

Implement the KDE function, which computes:

$$\Large
f_\alpha(x) = \frac{1}{n} \sum_{i=1}^n K_\alpha(x, z_i)
$$

where each $z_i$ represents a single datapoint in the collected dataset, $\alpha$ is a parameter to control the smoothness, and $K_\alpha$ is the kernel density function passed as `kernel`. Your code should run no longer than a couple of seconds. 

In [None]:
def kde(kernel, alpha, x, data):
    """
    Compute the kernel density estimate for the single query point x.

    Args:
        kernel: a kernel function with 3 parameters: alpha, x, and data.
        alpha: the smoothing parameter to pass to the kernel.
        x: a single query point (in one dimension).
        data: a NumPy array of data points.

    Returns:
        The smoothed estimate at the query point x.
    """    
    ...
    
kde(gaussian_kernel, 1.0, 2.0, np.array([3.0, 4.0, 5.0, 7.0]))

In [None]:
grader.check("q2b")

Assuming you implemented `kde` correctly, the code below should generate the `kde` of the log of the income data as before.

In [None]:
# Run this cell to generate the kde of the log of the income data; no further action is needed.
df['trans_inc'] = np.log10(df['inc'])
xs = np.linspace(df['trans_inc'].min(), df['trans_inc'].max(), 1000)
curve = [kde(gaussian_kernel, alpha, x, df['trans_inc']) for x in xs]
plt.hist(df['trans_inc'], density=True, color='orange')
plt.xlabel('Log Gross national income per capita, Atlas method: $: 2016', fontsize = 13);
plt.title('World Bank Gross National Income Per Capita')
plt.plot(xs, curve, 'k-');

And the code below should show a 3 x 3 set of plots showing the output of the kde for different `alpha` values.

In [None]:
# Run this cell to generate the kde of the log of the income data 
# with different alphas values, no further action is needed.
plt.figure(figsize=(15,15))
alphas = np.arange(0.2, 2.0, 0.2)
for i, alpha in enumerate(alphas):
    plt.subplot(3, 3, i+1)
    xs = np.linspace(df['trans_inc'].min(), df['trans_inc'].max(), 1000)
    curve = [kde(gaussian_kernel, alpha, x, df['trans_inc']) for x in xs]
    plt.hist(df['trans_inc'], density=True, color='orange')
    plt.plot(xs, curve, 'k-')
    plt.title(r"$\alpha = " + format(alpha, ".02") + "$")
plt.show()

<br>

---

### Question 2c 


Let's take a look at another kernel, the Boxcar kernel. The function `boxcar_kernel` is defined below.

\begin{equation} \Large
K_{a}(x, z) =
    \begin{cases}
        \frac{1}{\alpha}, & \text{if } -\frac{\alpha}{2} \leq (x-z) \leq \frac{\alpha}{2}\\
        0, & \text{otherwise}
    \end{cases}
\end{equation}

In [None]:
def boxcar_kernel(alpha, x, z):
    """
    Compute the boxcar density estimate for values in x.

    Args:
        alpha: the smoothing parameter to pass to the kernel.
        x: an array of values whose density will be calculated.
        z: center of boxcar function.

    Returns:
        The smoothed estimate at values of x.
    """    
    return (((x-z)>=-alpha/2)&((x-z)<=alpha/2))/alpha

Run the cell beloe to enable interactive plots. 

Now, we can plot the Boxcar and Gaussian kernel functions to see what they look like. 

In [None]:
from ipywidgets import interact

x = np.linspace(-10,10,1000)
def f(alpha):
    plt.plot(x, boxcar_kernel(alpha,x,0), label='Boxcar')
    plt.plot(x, gaussian_kernel(alpha,x,0), label='Gaussian')
    plt.legend(title='Kernel Function')
    plt.show()
interact(f, alpha=(1,10,0.1));

Using the interactive plot below, compare the two kernel techniques on the data `data3pts` or log income. Assign `demo = 1` to see `data3pts` or `demo = 2` to see log income.

**Note:** Generating the KDE plot is slow, so expect some latency after you move the slider.

In [None]:
demo = 1 # ... # set this value to 1 or 2

if demo == 1:
    xs = np.linspace(data3pts.min()-3, data3pts.max()+3, 1000)
    def f(alpha_g, alpha_b):
        plt.hist(data3pts, density=True, color='orange')
        g_curve = [kde(gaussian_kernel, alpha_g, x, data3pts) for x in xs]
        plt.plot(xs, g_curve, 'k-', label='Gaussian')
        b_curve = [kde(boxcar_kernel, alpha_b, x, data3pts) for x in xs]
        plt.plot(xs, b_curve, 'r-', label='Boxcar')
        plt.legend(title='Kernel Function')
        plt.show()
    interact(f, alpha_g=(0.01,.5,0.01), alpha_b=(0.01,3,0.1));
else:
    xs = np.linspace(df['trans_inc'].min(), df['trans_inc'].max(), 1000)
    def f(alpha_g, alpha_b):
        plt.hist(df['trans_inc'], density=True, color='orange')
        g_curve = [kde(gaussian_kernel, alpha_g, x, df['trans_inc']) for x in xs]
        plt.plot(xs, g_curve, 'k-', label='Gaussian')
        b_curve = [kde(boxcar_kernel, alpha_b, x, df['trans_inc']) for x in xs]
        plt.plot(xs, b_curve, 'r-', label='Boxcar')
        plt.legend(title='Kernel Function')
        plt.show()
    interact(f, alpha_g=(0.01,.5,0.01), alpha_b=(0.01,3,0.1));

<!-- BEGIN QUESTION -->

Briefly compare and contrast the Gaussian and Boxcar kernels in the cell below. How do the two kernels relate with each other for the same alpha value?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Congratulations! You have finished Lab 07!


Congrats! You are finished with this assignment.

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. Once you submit this file to the Lab 07 Coding assignment on Gradescope, Gradescope will automatically submit a PDF file with your written answers to the Lab 07 - Figures assignment. If you run into any issues when running this cell, feel free to check the [Debugging Guide](https://mtu.instructure.com/courses/1527249/pages/debugging-guide).

**Important**: Please check that your written responses were generated and submitted correctly to the Lab 07-FiguresAssignment.

**You are responsible for ensuring your submission follows our requirements and that the PDF for Lab07 Figures answers was generated/submitted correctly. We will not be granting regrade requests nor extensions to submissions that don't follow instructions.** If you encounter any difficulties with submission, please don't hesitate to reach out to staff prior to the deadline. 

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)