Before you turn this problem in, make sure everything runs as expected. First, restart the kernel (in the menubar, select Kernel → Restart) and then run all cells (in the menubar, select Cell → Run All).

Make sure you fill in any place that says YOUR CODE HERE.
Do not write your answer in anywhere else other than where it says YOUR CODE HERE.

First, write your name and NetID below:

In [None]:
NAME = 'WRITE YOUR NAME HERE'
NETID = 'WRITE YOUR NETID HERE'

## Problem 10.1. PMF and CDF.

In this problem, we will compute and plot the probability mass function (PMF) and the cumulative distribution function (CDF) of arrival delay.

In [None]:
%matplotlib inline

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv(
    "/home/data_scientist/data/2001.csv", # edit this path if necessary
    encoding="latin-1",
    usecols=["DepDelay"]
    )

Note: The plots shown as examples in the instructions are just examples, not the answers. Your plots may look different (within reasonable limits of course). You probably should not try to make your plot look exactly the same as my plots.

## Plot: Probability Mass Function

- In the following cell, plot the PMF of departure delay.

Note that Pandas will automatically replace missing values `'NA'` with `numpy.nan` or `NaN` (Not A Number). You have to remove rows with missing values.

In [None]:
def plot_pmf(df, column, nbins=200):
    """
    Plots the PMF of the specified column of the input Pandas dataframe.
    
    Parameters
    ----------
    df (pandas.DataFrame): input dataframe.
    column (str): target column
    
    Returns
    -------
    None
    """
    # YOUR CODE HERE
    
    return None

In the following cells, we plot two PMFs using two different number of bins. First, using 50 bins, I get
![](https://raw.githubusercontent.com/UI-DataScience/info490-fa15/master/Week10/assignment/pmf_50.png)

In [None]:
plot_pmf(df, "DepDelay", nbins=50)

And using 200 bins, I get

![](https://raw.githubusercontent.com/UI-DataScience/info490-fa15/master/Week10/assignment/pmf_200.png)

In [None]:
plot_pmf(df, "DepDelay", nbins=200)

The shape of PMF depends a lot on the size of the bins, and the plots look a little different. It can be tricky to get the size of the bins right. Furthermore, parts of these figures are hard to interpret due to spikes.

The CDF avoids these problems.

## Cumulative Distribution Function

## Function: get\_cdf()

- Write a function named `get_cdf()` that takes an array and returns a tuple that represents the $x$ and $y$ axes of the (empirical) CDF.

According to [Wikipedia](http://en.wikipedia.org/wiki/Empirical_distribution_function), the definition of
  [empirical distribution function](http://en.wikipedia.org/wiki/Empirical_distribution_function) is given by
  
  $\text{CDF} (t) = \frac{1}{n} \cdot \left (\text{number of elements in the sample} \leq t \right)$

So, given an array, e.g. `[1, 2, 2, 3, 5]`, you could go through each value and count the number of elements smaller than 1, smaller than 2, etc. But this method will be very inefficient and slow when the input array is very large. In Python, when you are dealing with numerical operations on a potentially huge array, you should think Numpy (because the `for` loop of pure Python is very slow and often leads to code that is difficult to read and maintain).
  
Here is a faster algorithm to produce the empirical CDF. As an example, suppose the array has values, `[2, 1, 2, 5, 3]`.

1. Use [`numpy.sort()`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html) to sort the array (with no missing values) in ascending order. In our case, when we sort the input array, we have
```
[1, 2, 2, 3, 5]
```
This will be our $x$-axis of CDF.

2. Create an array of $0, \frac{1}{N}$, $\frac{2}{N}$, ..., $1 - \frac{1}{N}$, where $N$ is the length of the input array (5 in our case). In our case, this array is
```
[0.0, 0.2, 0.4, 0.6, 0.8]
```
This will be our $y$-axis. All you have to do is use `np.arange()` to make an array of length $N$, and divide each element by $N$.

3. Use the $x$-axis from Step 1 (`[1, 2, 2, 3, 5]`) and the $y$-axis from Step 2 (`[0.0, 0.2, 0.4, 0.6, 0.8]`) to plot the CDF:
![](https://raw.githubusercontent.com/UI-DataScience/info490-fa15/master/Week10/assignment/cdf_short.png)
Our list is short and simple, so we can count the values to verify that the CDF looks correct.
```
CDF(0) = 0 
CDF(1) = 0.2
CDF(2) = 0.6
CDF(3) = 0.8
CDF(4) = 0.8
CDF(5) = 1
```

According to Wikipedia, the resulting empirical CDF is an unbiased estimator for the true CDF.

Note: Do NOT use numpy.histogram() function to create a CDF.
  It uses binning, which might be useful in other cases but not in this case.
  The method I outlined above is a better characterization of the true CDF.

In [None]:
def get_cdf(df, column):
    '''
    Reads a specific column of a Pandas DataFrame,
    and returns a tuple of arrays that represent the x and y axes of
    cumulative distribution function.
    
    Parameters
    ----------
    df (pandas.DataFrame): A pandas.DataFrame.
    column (str): The header of the target column in df.
    
    Returns
    -------
    A tuple of two numpy arrays of equal length.
    The first array represents the x axis of CDF.
    The second array represents the y axis of CDF.
    '''
    
    # YOUR CODE HERE
    
    return cdf_x, cdf_y

Make sure that your function passes all the tests.

In [None]:
test1 = pd.DataFrame(
    {
        "a": [1, 2, 2, 3, 5],
        "b": [3, 2, 5, 1, 2],
        }
    )

answer1 = np.array([1, 2, 2, 3, 5]), np.array([0.0, 0.2, 0.4, 0.6, 0.8])

np.testing.assert_allclose(get_cdf(test1, "a")[0], answer1[0])
np.testing.assert_allclose(get_cdf(test1, "a")[1], answer1[1])

np.testing.assert_allclose(get_cdf(test1, "b")[0], answer1[0])
np.testing.assert_allclose(get_cdf(test1, "b")[1], answer1[1])

test2 = pd.DataFrame(
    {
        "c": [1, 2, 2, 3, 5, np.nan],
        "d": [3, 2, 5, np.nan, 2, 1],
        }
    )

answer2 = np.array([1, 2, 2, 3, 5]), np.array([0.0, 0.2, 0.4, 0.6, 0.8])

np.testing.assert_allclose(get_cdf(test2, "c")[0], answer2[0])
np.testing.assert_allclose(get_cdf(test2, "c")[1], answer2[1])

np.testing.assert_allclose(get_cdf(test2, "d")[0], answer2[0])
np.testing.assert_allclose(get_cdf(test2, "d")[1], answer2[1])

## Plot: CDF

- Use the `get_cdf()` function to create a CDF of the `ArrDelay` column in `2001.csv`. Here's an example:

![](https://raw.githubusercontent.com/UI-DataScience/info490-fa15/master/Week10/assignment/cdf_arrival_delay.png)

In [None]:
# YOUR CODE HERE