# Week 8 Problem 2

If you are not using the `Assignments` tab on the course JupyterHub server to read this notebook, read [Activating the assignments tab](https://github.com/UI-DataScience/info490-fa16/blob/master/Week2/assignments/README.md).

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

In [1]:
import os
import numpy as np
import pandas as pd
from nose.tools import ok_, assert_equal

import warnings
warnings.filterwarnings("ignore")

# Problem 1.

In the previous weeks, we have seen different ways to read selected columns from the census CSV file and calculate basic statistics. In this problem, we will see how easy it is to perform the same task using Pandas. In particular, we will rewrite get_stats() function from Problem 4.1 and get_column() function from Problem 5.1. Remember, the purpose of this problem is to let you experience how easy it is to make a data table using Pandas. Don't overthink it.

## Function: get_column()
First, write a function named get_column() that takes a filename (string) and a column name (string), and returns a pandas.DataFrame. Remember that encoding='latin-1'.
Another useful tip: if you try to read the entire file, it will take a long time. Read in only one column by specifying the column you wish to read with the usecols option. Therefore, the get_column function should return a DataFrame with only one column.
With Pandas, the get_column() function can be written in one line.

In [2]:
def get_column(filename, column):
    '''
    Reads the specified column of airline on-time performance CSV file,
    which is in 'latin-1' encoding.
    Returns a Pandas DataFrame with only one column.
    
    Parameters
    ----------
    filename(str): The file name.
    column(str): The column header.
    
    Returns
    -------
    A pandas.DataFrame object that has only column.
    
    Examples
    --------
    arr_delay = get_column('/home/data_scientist/data/2001.csv', 'ArrDelay')
    '''
    # YOUR CODE HERE
    df = pd.read_csv(filename, encoding='latin1', usecols = [column])
    return df


In [3]:
csv_with_header = '''
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2001,1,17,3,1806,1810,1931,1934,US,375,N700äæ,85,84,60,-3,-4,BWI,CLT,361,5,20,0,NA,0,NA,NA,NA,NA,1
2001,1,18,4,1805,1810,1938,1934,US,375,N713äæ,93,84,64,4,-5,BWI,CLT,361,9,20,0,NA,0,NA,NA,NA,NA,1
2001,1,19,5,1821,1810,1957,1934,US,375,N702äæ,96,84,80,23,11,BWI,CLT,361,6,10,0,NA,0,NA,NA,NA,NA,NA
2001,1,20,6,1807,1810,1944,1934,US,375,N701äæ,97,84,66,10,-3,BWI,CLT,361,4,27,0,NA,0,NA,NA,NA,NA,NA
'''.strip().encode('latin-1')

csv_no_header = '''
2001,1,17,3,1806,1810,1931,1934,US,375,N700äæ,85,84,60,-3,-4,BWI,CLT,361,5,20,0,NA,0,NA,NA,NA,NA,1
2001,1,18,4,1805,1810,1938,1934,US,375,N713äæ,93,84,64,4,-5,BWI,CLT,361,9,20,0,NA,0,NA,NA,NA,NA,1
2001,1,19,5,1821,1810,1957,1934,US,375,N702äæ,96,84,80,23,11,BWI,CLT,361,6,10,0,NA,0,NA,NA,NA,NA,NA
2001,1,20,6,1807,1810,1944,1934,US,375,N701äæ,97,84,66,10,-3,BWI,CLT,361,4,27,0,NA,0,NA,NA,NA,NA,NA
2001,1,21,7,1810,1810,1954,1934,US,375,N768äæ,104,84,62,20,0,BWI,CLT,361,4,38,0,NA,0,NA,NA,NA,NA,1
'''.strip().encode('latin-1')

with open('test.header.csv', 'wb') as f:
    f.write(csv_with_header)
    
with open('test.noheader.csv', 'wb') as f:
    f.write(csv_no_header)

# header cases
ok_(
    get_column('test.header.csv', 'Year').equals(
        pd.DataFrame(data=[2001] * 4, columns=['Year'])
    ))
ok_(
    get_column('test.header.csv', 'DayofMonth').equals(
        pd.DataFrame(data=list(range(17, 21)), columns=['DayofMonth'])
    ))
ok_(
    get_column('test.header.csv', 'DepTime').equals(
        pd.DataFrame(data=[1806, 1805, 1821, 1807], columns=['DepTime'])
    ))
ok_(
    get_column('test.header.csv', 'SecurityDelay').equals(
        pd.DataFrame(data=[np.nan] * 4, columns=['SecurityDelay'])
    ))
ok_(
    get_column('test.header.csv', 'LateAircraftDelay').equals(
        pd.DataFrame(data=[1, 1, np.nan, np.nan], columns=['LateAircraftDelay'])
    ))

# clean up
os.remove('test.header.csv')
os.remove('test.noheader.csv')


## Function: get_stats()
Next, write a function named get_stats() that takes a pandas.DataFrame and a column name (string), and return the minimum, maximum, mean, and median (all floats) of the column.

In [4]:
def get_stats(df, column):
    '''
    Calculates the mininum, maximum, mean, and median values
    of a column from a Pandas DataFrame object.
    
    Parameters
    ----------
    df(pandas.DataFrame): A Pandas DataFrame.
    column(str): The column header.
    
    Returns
    -------
    minimum(float)
    maximum(float)
    mean(float)
    median(float)
    '''
    # YOUR CODE HERE
    # Get the column
    dt = df[column]
    # Compute the stat
    minimum = dt.min()
    maximum = dt.max()
    mean = dt.mean()
    median = dt.median()
    return minimum, maximum, mean, median

In [5]:
import warnings
warnings.filterwarnings("ignore")
data1 = {
    'A': [0, 1, 2, 3, 4],
    'B': [1, 2, 3, 4, np.nan], # append NaN since we need same number of elements
    'C': [4, 3, 2, 1, 0],
    'D': [4, 1, 0, 2, 3]
    }
df1= pd.DataFrame(data1)

assert_equal(get_stats(df1, 'A'), (0, 4, 2, 2))
assert_equal(get_stats(df1, 'B'), (1, 4, 2.5, 2.5))
assert_equal(get_stats(df1, 'C'), (0, 4, 2, 2))
assert_equal(get_stats(df1, 'D'), (0, 4, 2, 2))

data2 = {
    'E': np.append(np.arange(51), np.nan), # append NaN since we need same number of elements
    'F': np.arange(52)
}
df2 = pd.DataFrame(data2)

assert_equal(get_stats(df2, 'E'), (0, 50, 25.0, 25.0))
assert_equal(get_stats(df2, 'F'), (0, 51, 25.5, 25.5))

# shuffle rows in df2
df3 = df2.reindex(np.random.permutation(df2.index))
assert_equal(get_stats(df2, 'E'), (0, 50, 25.0, 25.0))
assert_equal(get_stats(df2, 'F'), (0, 51, 25.5, 25.5))