# Week 2 Exercises

In this weeks exercises you will use Numpy/Scipy to impliment some numerical algorithms and then you will use Pandas to perform a rudamentary data analysis using the KDD 98 dataset.  Along the way you will use unix/basic python from the first week as well as git to save your work.

As a first step we import the libraries we'll use later on.  This allows us to use numpy library calls by prefixing the call with np.

In [None]:
#Import the libraries 
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

## Matrix Manipulations
Lets first create a matrix and perform some manipulations of it.

Using numpy's matrix data structure, define the following matricies:

$$A=\left[ \begin{array}{ccc} 3 & 5 & 9 \\ 3 & 3 & 4 \\ 5 & 9 & 17 \end{array} \right]$$

$$B=\left[ \begin{array}{c} 2 \\ 1 \\ 4 \end{array} \right]$$

After this solve the matrix equation:
$$Ax = B$$

Now write three functions for matrix multiply $C=AB$ in each of the following styles:

1. By using nested for loops to impliment the naive algorithm ($C_{ij}=\sum_{k=0}^{m-1}A_{ik}B_{kj}$)
2. Using numpy's built in martrix multiplication  
3. Using Cython

The three methods should have the same answer

In [None]:
# define A and B
A = [[3, 5, 9], 
     [3, 3, 4], 
     [5, 9, 17]]

B = [[2], 
     [1], 
     [4]]

# solve Ax = B using numpy matrix multiplication
x = np.matrix(A).I * np.matrix(B)
print x

In [None]:
# solving C = AB using nested for loops
def matrix_mult_naive(A, B):
    C = []
    # iterate over rows of A
    for i in range(len(A)):
        C_row = []
        # iterate over columns of B
        for j in range(len(B[0])):
            # iteratively sum pairwise elements in ith row of A and jth column of B
            indexSum = 0
            for k in range(len(A[i])):
                indexSum += A[i][k]*B[k][j]
            C_row.append(indexSum)
        C.append(C_row)
    return C

In [None]:
'''
# test cell for multiplying large matrixes using naive python
tA = [[np.random.uniform() for x in range(100)] for y in range(1000)]
tB = [[np.random.uniform() for x in range(1000)] for y in range(100)]

matrix_mult_naive(tA, tB)
'''

In [None]:
# function to multiply matrices using numpy
def matrix_mult_np(A, B):
    return np.matrix(A) * np.matrix(B)

In [None]:
%load_ext Cython

In [None]:
%%cython
cimport numpy as np
import numpy as np

# solving C = AB using cython
def matrix_mult_cython(np.ndarray[np.int64_t, ndim=2] A, np.ndarray[np.int64_t, ndim=2] B):
    # TODO
    #cdef npc.ndarray[npc.int64_t, ndim=2] C
    cdef np.ndarray[np.int64_t, ndim=2] C = np.empty((A.shape[0], B.shape[1]), dtype=np.int64)
    cdef np.ndarray[np.int64_t, ndim=1] C_row = np.empty(B.shape[1], dtype=np.int64)
    cdef long i
    cdef long j
    cdef long k
    cdef double indexSum
    
    # iterate over rows of A
    for i in xrange(A.shape[0]):
        # iterate over columns of B
        for j in xrange(B.shape[1]):
            # iteratively sum pairwise elements in ith row of A and jth column of B
            indexSum = 0
            for k in xrange(A[i].shape[0]):
                indexSum += A[i][k]*B[k][j]
            #C_row.append(indexSum)
            C[i][j] = indexSum
        #C.append(C_row)
    return C
    #return None

In [None]:
# show that all three methods produce the same result

# basic python matrix multiplication
print matrix_mult_naive(A, B)
print '\n'

# numpy matrix multiplication
print np.matrix(A) * np.matrix(B)
print '\n'

# cython matrix multiplication
print matrix_mult_cython(np.array(A), np.array(B))

Now we wish to evaluate the performance of these three methods.  Write a method that given three dmiensions (a,b,c) makes a random a x b and b x c matrix and computes the product using your three functions and reports the speed of each method.

After this measure performance of each method for all $a,b,c \in \{10,100,1000,10000\}$ and plot the results.  Is one method always the fastest?  Discuss why this is or is not the case.

In [None]:
import time

# function to determine time to multiply two matrices of size a x b and b x c
def rand_matrix_prod(a, b, c):
    # create m1 and m2, random matrices of dimension a x b and b x c, respecively
    m1 = [[np.random.randint(5) for x in range(b)] for y in range(a)]
    m2 = [[np.random.randint(5) for x in range(c)] for y in range(b)]
    
    start_time = time.time()
    matrix_mult_naive(m1, m2)
    naive_time = time.time()
    matrix_mult_np(np.matrix(m1), np.matrix(m2))
    numpy_time = time.time()
    matrix_mult_cython(np.array(m1), np.array(m2))
    cython_time = time.time()
    
    digits = 8
    
    naive_dur = round(naive_time - start_time, digits)
    numpy_dur = round(numpy_time - naive_time, digits)
    cython_dur = round(cython_time - numpy_time, digits)
    
    return [naive_dur, numpy_dur, cython_dur]

In [None]:
# create list of inputs
s = [10, 100]#, 1000, 10000]
inputs = [[a, b, c] for a in s for b in s for c in s]

# run rand_matrix_prod function to time execution of matrix multiplications
results_prod = [rand_matrix_prod(input[0], input[1], input[2]) for input in inputs]

# print matrix size inputs and list of times required for multiplying with python, numpy, and cython
inputs, results_prod

In [None]:
# create plots of matrix multiplication timing results

# isolate results from each method
results_prod_transpose = np.array(results_prod).T
python_results_prod = results_prod_transpose[0]
numpy_results_prod = results_prod_transpose[1]
cython_results_prod = results_prod_transpose[2]

# boxplot of results
bp = plt.boxplot(np.array(results_prod))

## change the style of fliers and their fill
for flier in bp['fliers']:
    flier.set(marker='o', color='#e7298a', alpha=0.5)

# add labels to boxplot
plt.title('Matrix Multiplication Execution Time by Method')
plt.ylabel('Seconds')
plt.xticks([1, 2, 3], ['Python', 'Numpy', 'Cython'])

plt.grid()
plt.show()

# create plot by experiment number, which corresponds to 'inputs' printed in the previous cell
plt.title('Matrix Multiplication Execution Time by Method')
plt.ylabel('Seconds')
plt.xlabel('Experiment #')

# re-index values so their index corresponds to experiment number
plt.plot(np.append(np.roll(python_results_prod,1),python_results_prod[-1]), 'r--', label = 'Python', alpha = 0.75)
plt.plot(np.append(np.roll(numpy_results_prod,1),numpy_results_prod[-1]), 'bs', label = 'Numpy', alpha = 0.75)
plt.plot(np.append(np.roll(cython_results_prod,1),cython_results_prod[-1]), 'g^', label = 'Cython', alpha = 0.75)
plt.legend(loc = 'upper left')

# reset xticks to correspond to experiment number
ax = plt.subplot(111)
size = len(python_results_prod) + 1
ax.set_xlim(1, size - 1)
dim = np.arange(1, size, 1)
plt.xticks(dim)

plt.grid()
plt.show()

### Discussion of Matrix Multiplication Timing Results
It appears that the time required to calculate the product of a matrices using base python and cython increase as the size of the matrices increase, while the time required to calculate the product using numpy is not significantly increased as matrix size grows.

For example, calculating the product of a random square matrices of size 100 using naive python requires roughly 0.31 seconds, and cython requires roughly 0.4 secondson my machine, but the time required using numpy is approximately 0.02 seconds.  In all cases tested, numpy results in a faster calculation.

The reason for numpy's relative speed is that it is a specialized package that is optimized to perform calculations on arrays that must consist of elements of the same type, while python lists are arrays of pointers to objects, even if all the elements are of the same type, as in this case.  The implication of this difference is that python arrays require more space to store and take longer to access from memory.

**BONUS** Now repeat the past two problems but instead of computing the matrix product, compute a matrix's [determinant](http://en.wikipedia.org/wiki/Determinant).  Measure performance for matricies of various sizes and discuss the results.  Determinant may get impractical to calculate for not too huge of matricies, so no need to goto 1000x1000 matricies.

In [None]:
# function to calculate the determinant of a matrix using naive python
def matrix_det_naive(A):
    # base cases:  A is 1x1 or 2x2
    if len(A) == 1:
        det = A[0][0]
    elif len(A) == 2:
        det = A[0][0]*A[1][1] - A[0][1]*A[1][0]
    else:
        det = 0
        # calculate determinant by minors, iterating over first row of A
        # see http://mathworld.wolfram.com/DeterminantExpansionbyMinors.html
        for i in range(len(A[0])):
            minor = [row[0:i] + row[i+1:] for row in A[1:]]
            det += (-1)**(i+2) * A[0][i] * matrix_det_naive(minor)
    return det

In [None]:
'''
# test determinant function for random matrix
size = 8
m = [[np.random.randint(10) for _ in range(size)] for _ in range(size)]

print np.linalg.det(np.array(m))
print matrix_det_naive(m)
'''

In [None]:
# function to determine time required to calculate determinant of matrix of size n using python and numpy
def rand_matrix_det(n):
    # create random nxn matrix named m1
    m1 = [[np.random.randint(5) for _ in range(n)] for _ in range(n)]
    
    start_time = time.time()
    matrix_det_naive(m1)
    naive_time = time.time()
    np.linalg.det(np.array(m1))
    numpy_time = time.time()
    #matrix_det_cython(np.array(m1))
    #cython_time = time.time()
    
    # round time required to 8 digits
    digits = 8
    naive_dur = round(naive_time - start_time, digits)
    numpy_dur = round(numpy_time - naive_time, digits)
    #cython_dur = round(cython_time - numpy_time, digits)
    
    return [naive_dur, numpy_dur]#, cython_dur]

In [None]:
# run function to determine time to calculate determinant of functions of matrix size 1 to 11
results_det = [rand_matrix_det(size) for size in range(1,12)]
results_det

In [None]:
# isolate results from each method
results_det_transpose = np.array(results_det).T
python_results_det = results_det_transpose[0]
numpy_results_det = results_det_transpose[1]

# plot results
plt.title('Matrix Determinant Execution Time by Method')
plt.ylabel('Seconds')
plt.xlabel('Matrix Size')

# re-index values so their index corresponds to matrix size
plt.plot(np.append(np.roll(python_results_det,1),python_results_det[-1]), 'r--', label = 'Python', alpha = 0.75)
plt.plot(np.append(np.roll(numpy_results_det,1),numpy_results_det[-1]), 'bs', label = 'Numpy', alpha = 0.75)
plt.legend(loc = 'upper left')

# reset xticks to correspond to matrix size
ax = plt.subplot(111)
size = len(python_results_det) + 1
ax.set_xlim(1, size - 1)
dim = np.arange(1, size, 1)
plt.xticks(dim)

plt.grid()
plt.show()

### Discussion of Matrix Determinant Timing Results
It appears that the time required to calculate the determinant of a matrix using base python increases exponentially as the size of the matrix increases, while the time required to calculate the determinant using numpy is not significantly increased as matrix size grows.

For example, calculating the determinant of a random matrix of size 9, 10, and 11 using naive python requires roughly 0.8 seconds, 7.9 seconds, and 84 seconds on my machine, respectively, while the time required using numpy remains less than roughly 10^-4 seconds in each case.

The reason for the increased speed of numpy is the same as discussed above for multiplying matrices, namely that numpy is a specialized package that uses arrays of homogenous data that require less memory to store and access. 

### IO Exercises

Below is a map of various datatypes in python that you have come across and their corresponding JSON equivalents.

$$Datatypes=\left[ \begin{array}{cc} JSON & Python3 \\ object & dictionary \\ array & list \\ string & string \\ integer	& integer \\ real number & float \\ true & True \\ false & False \\ null & None  \end{array} \right]$$


There are atleast two very important python datatypes missing in the above list. 
Can you find the same?  [list the two mising python datatypes in this markdown cell below]

1. Tuple
2. Set

Now We can save the above map as a dictionary with Key-value pairs 
1. create a python dictionary named dataypes, having the above map as the Key-value pairs with Python datatypes as values and JSON equivalents as keys.
2. Save it as a pickle called datatypes and gzip the same.
3. Reload this pickle, and read the file contents and output the data in the following formatted way as given in this example - "The JSON equivalent for the Python datatype Dictionary is Object". Output similarly for the rest of the key-value pairs.
4. Save this data as a JSON but using Python datatypes as keys and JSON equivalent as values this time. 

In [None]:
import pickle
import gzip
import json

# create datatypes dictionary
datatypes = {
    'JSON':'Python3',
    'object':'dictionary',
    'array':'list',
    'string':'string',
    'integer':'integer',
    'realnumber':'float',
    'true':'True',
    'false':'False',
    'null':'None'
}

# save as pickle called datatypes and gzip
pickle.dump(datatypes,gzip.open('datatypes.pkl','wb'))

# reload, read, and output data in format requested
data = pickle.load(gzip.open('datatypes.pkl','rb'))
for key in data.keys():
    print "The JSON equivalent for the Python datatype %s is %s." % (data[key], key)
    
# save as JSON with Python datatypes as keys and JSON datatypes as values
new_dict = {}
for key in data.keys():
    new_dict[data[key]] = key  
    
json.dump(new_dict,open('new_dict.jsn','wb'))
!cat new_dict.jsn

## Pandas Data Analysis
Pandas gives us a nice set of tools to work with columnar data (similar to R's dataframe). 
To learn how to use this it makes the most sense to use a real data set.
For this assignment we'll use the KDD Cup 1998 dataset, which can be sourced from http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html .


### Acquiring Data
First we pull the README file from the dataset into this notebook via the unix "curl" command.  Remember you can hide/minimize output cells via the button on the left of the output.

In [None]:
!curl http://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/readme

As you can see this README describes several files which may be of use.  In particular there are two more documentation files (DOC and DIC) we should read to get an idea of the data format.  Bring these files into the notebook.

In [None]:
!curl http://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98doc.txt

In [None]:
!curl http://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98dic.txt

Now we wish to download the cup98lrn.zip file and unzip it into a new subdirectory called "data".  
However, since this file is pretty big we don't want to store it on github.  
Luckily git provides the [.gitignore](http://git-scm.com/docs/gitignore) file which allows us to specify files we don't want to put into our git repository.

Please do the following steps:

1. Add the directory "data" to the .gitignore file
2. Commit the new .gitignore file
3. Create a new directory "data"
4. Download http://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98lrn.zip into the data directory
5. Unzip the cup98lrn.zip (we will only be using the unzipped version, so feel free to remove the zip file)
6. Run "git status" to show that the data directory is not a tracked file (this indicates it is ignored)

**NOTE:** These steps only need to be run once, it is advised you comment all the lines out by putting a # at the start of each line after they have run.  This will save you time in the future when you have to rerun all cells/don't want to spend a few minutes downloading the data file.

In [None]:
#!echo '/data' >> .gitignore
#!cat .gitignore
#!git commit .gitignore -m 'added data folder to .gitignore'
#!mkdir data
#%cd data
#!curl http://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98lrn.zip -o 'cup98lrn.zip'
#!unzip 'cup98lrn.zip'
#!rm 'cup98lrn.zip'
#!git status

Now perform some basic sanity checks on the data.  Using a combination of unix/basic python answer the following questions:

1. How many lines are there?  
2. Is the file character seperated or fixed width format?
3. Is there a header?  If so how many fields are in it?
4. Do all rows have the same number of fields as the header?
5. Does anyhting in 1-4 disagree with the readme file or indicate erroneous data?

In [None]:
# determine number of lines in file
!wc -l 'data/cup98LRN.txt'

In [None]:
# look at first row to see if delimited or fixed width, and to see if there is a header
!head -1 'data/cup98LRN.txt'

In [None]:
# replace commas in header with newlines and count lines to determine number of header fields
!head -1 'data/cup98LRN.txt' | tr ',' '\n' | wc -l

In [None]:
# determine count of rows of data with differing number of entries as column headers
with open("data/cup98LRN.txt") as f:
    rows = [line.split(',') for line in f]
    header_cols = len(rows[0])
    mismatched_rows = []
    for i in range(len(rows)):
        if len(rows[i]) != header_cols:
            mismatched_rows.append((i, rows[i]))
            
print len(mismatched_rows)

Give answers to questions 1-4 in this markdown cell:

1. There are 95,413 lines in the file (including the header row).
2. The file is comma separated.
3. Yes, there is a header with 481 fields.
4. All rows have 481 fields.

Now load the data file into a pandas dataframe called "learn".  To save some time, we've loaded the data dictionary into col_types.  

Finally split learn into two data frames, learn_y: the targets (two columns described in the documentation) and learn_x: the predictors (everything but the targets)

In [None]:
dict_file = open("dict.dat")
col_types = [ (x.split("\t")[0], x.strip().split("\t")[1]) for x in dict_file.readlines() ]
#col_types

In [None]:
# create data dictionary from 'col_types'
data_dict = {}
for col in col_types:
    data_dict[col[0]] = col[1]
    if col[1] == 'Num':
        data_dict[col[0]] = 'float64'
    else:
        data_dict[col[0]] = 'object'

#data_dict

In [None]:
# load data into learn dataframe
learn = pd.read_csv('data/cup98LRN.txt', dtype = data_dict)

# Split TARGET_B and TARGET_D into learn_y dataframe
learn_y = learn[['TARGET_B', 'TARGET_D']]

# Remove TARGET_B and TARGET_D from learn_x dataframe
learn_x = learn.drop(['TARGET_B', 'TARGET_D'], axis=1)

In [None]:
learn_y.head()

In [None]:
learn_x.head()

### Summarizing Data
Now that we have loaded data into the learn table, we wish to to summarize the data.  
Write a function called summary which takes a pandas data frame and prints a summary of each column containing the following:

If the column is numeric:

1. Mean
2. Standard Deviation
3. Min/Max
4. Number of missing values (NaN, Inf, NA)

If the column is non numeric:

1. Number of distinct values
2. Number of missing values (NaN, INF, NA, blank/all spaces)
3. The frequency of the 3 most common values and 3 least common values

Format the output to be human readable.

For example:
> Field_1  
> mean: 50  
> std_dev: 25  
> min: 0  
> max: 100  
> missing: 5
>  
> Field_2  
> distinct_values: 100  
> missing: 10  
>  
> 3 most common:  
>   the: 1000  
>   cat: 950  
>   meows: 900  
>  
> 3 least common:  
>   dogs: 5  
>   lizards: 4  
>   eggs: 1  

In [None]:
# function to summarize dataframe of format similar to 'learn_x'
def summary(df):
    cols = list(df.columns.values)
    df['count'] = 1 # adding count column for convenience
    
    # replace blanks and empty spaces with nan
    df.replace('', np.nan, inplace = True)
    df.replace(' ', np.nan, inplace = True)
    
    for col in cols:
        data = df[col]
        missing = data.isnull().sum()
        
        print col
        if data_dict[col] == 'float64':
            mean = data.mean()
            sd = data.std()
            minimum = data.min()
            maximum = data.max()
                
            print 'mean:     %.2f' % (mean)
            print 'std dev:  %.2f' % (sd)
            print 'min:      %.2f' % (minimum)
            print 'max:      %.2f' % (maximum)
            print 'missing:  %d\n' % (missing)
        elif data_dict[col] == 'object':
            distinct = len(data.dropna().unique())
            
            print 'distinct: %d' % (distinct)
            print 'missing:  %d\n' % (missing)

            tmp_df = df.groupby(col).sum()
            tmp_df = tmp_df.sort_values('count', ascending = 0)
            tmp_df = tmp_df.reset_index()
            
            print '3 most common:'
            most1 = tmp_df[col][0]
            most1count = tmp_df['count'][0]
            print '%s: %d' % (most1, most1count)

            if distinct >= 2:
                most2 = tmp_df[col][1]
                most2count = tmp_df['count'][1]
                print '%s: %d' % (most2, most2count)
            else:
                print 'No second most common element'
            
            if distinct >= 3:
                most3 = tmp_df[col][2]
                most3count = tmp_df['count'][2]
                print '%s: %d\n' % (most3, most3count)
            else:
                print 'No third most common element\n'
            
            tmp_df2 = df.groupby(col).sum()
            tmp_df2 = tmp_df2.sort_values('count', ascending = 1)
            tmp_df2 = tmp_df2.reset_index()

            print '3 least common:'
            least1 = tmp_df2[col][0]
            least1count = tmp_df2['count'][0]
            print '%s: %d' % (least1, least1count)

            if distinct >= 2:
                least2 = tmp_df2[col][1]
                least2count = tmp_df2['count'][1]
                print '%s: %d' % (least2, least2count)
            else:
                print 'No second least common element'
            
            if distinct >= 3:
                least3 = tmp_df2[col][2]
                least3count = tmp_df2['count'][2]
                print '%s: %d\n' % (least3, least3count)
            else:
                print 'No third least common element\n'
        else:
            pass
    return None

# summarize 'learn_x' dataframe
summary(learn_x)

 ### Pandas analysis on Calit2 data 

Import data from http://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/CalIt2.data using curl

This data comes from the main door of the CalIt2 building at UCI. Observations come from 2 data streams (people flow in and out of the building), over 15 weeks, 48 time slices per day (half hour count aggregates).

Attribute Information:
1. Flow ID: 7 is out flow, 9 is in flow
2. Date: MM/DD/YY
3. Time: HH:MM:SS
4. Count: Number of counts reported for the previous half hour


In [None]:
!curl http://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/CalIt2.data > CalIt2.data
df = pd.read_csv('CalIt2.data', names = ['Flow_ID','Date','Time','Count'])
df.head()

#### Selecting Data ####
1. Select all data for the date July 24 2005 having flow id=7. Also output the row count of results 
2. Select all rows whose count is greater than 5. Sort the result on count in descending order and output the top 10 rows

In [None]:
# all rows for 'Date' July 24 2005 having 'Flow_ID' 7
subset_df = df.loc[(df['Flow_ID'] == 7) & (df['Date'] == '07/24/05')]

# print first few rows of data
print subset_df.head()

# output row count
print 'row count = %d' % (subset_df.shape[0])

In [None]:
# select rows whose count is greater than 5 and print top 10 rows by count
subset_df2 = df[df['Count'] > 5].sort_values('Count', ascending = 0)
#subset_df2.sort_values('Count', ascending = 0)
subset_df2 = subset_df2.iloc[0:10]
subset_df2

#### Apply function ####
1. For the 10 rows outputted above, use Pandas Apply function to subtract lowest value of the 10 from all of them and then output the average value of the resulting counts
2. On the entire data, use apply function to sum all counts with flow_id=9 and date is 07/24/05

In [None]:
# use apply to subtract lowest count from top 10 and output average of resulting counts
lowest_count = subset_df2['Count'].values.min()
print subset_df2['Count'].apply(lambda(x): x - lowest_count).values.mean()

# use apply to sum all counts with Flow_ID = 9 and Date = 07/24/05
print df.apply(lambda(x): x.loc[(df['Flow_ID'] == 9) & (df['Date'] == '07/24/05')])['Count'].sum()

#### Indexing an Selecting ####
Exlain the following

1. loc:  subsets a dataframe based on index label; most suitable when a dataframe has meaningful and clearly labeled non-numerical indices (such as months in order)

2. iloc:  subsets a dataframe based on numerical index position; suitable use case would be returning the top N rows of a dataframe after sorting

3. ix:  subsets a dataframe based on index label or position; most general and flexible but can be confusing and produce unexpected results; use case would be subsetting a dataframe with mixed positional and label based indices

4. at:  similar to loc, but faster, and can only be used to access a single element from a dataframe; use case is the same as loc, but if you only need to access a single element from the dataframe

5. iat:  similar to iloc, but faster, and can only be used to access a single element from a dataframe; use case is the same as iloc, but if you only need to access a single element from the dataframe

Highlight the differences by providing usecases where one is more useful than the other

Write a function to take two dates as input and return all flow ids and counts in that date range having both the dates inclusive. You can use pandas to_datetime function to convert the date to pandas datetime format 

In [None]:
# function to return all flow ids and counts falling within specified date range
def dateRange(start, end):
    return_df = df
    return_df['Date'] =  pd.to_datetime(return_df['Date'], format='%m/%d/%y')
    return_df = return_df.sort_values('Date', ascending = 1)
    return_df = return_df.set_index('Date')
    return_df = return_df.loc[start:end]
    return_df.reset_index(level=0, inplace=True)
    return_df = return_df.sort_values(['Date','Time'], ascending = [1, 1])
    return_df['Date'] = return_df['Date'].dt.strftime('%m/%d/%y')
    return_df = return_df[['Flow_ID','Date','Time','Count']]
    return return_df

# test dateRange function
test_df = dateRange('08/01/05','08/31/05')
print test_df.head()
print '\n'
print test_df.tail()

#### Grouping ####
1. Select data in the month of August 2005 having flow id=7
2. Group the data based on date and get the max count per date

In [None]:
# use dateRange function written above to select data for August 2005
subset_august = dateRange('08/01/05','08/31/05')

# subset August 2005 data by flow id = 7
subset_august = subset_august[subset_august['Flow_ID'] == 7]

subset_august.head()

In [None]:
subset_august.tail()

In [None]:
'''
# select date in August 2005 with flow id 7
#subset_df = df[df.Date.str.startswith('08')]
#subset_df = subset_df[subset_df.Date.str.endswith('05')]
subset_df = dateRange('')
subset_df = subset_df[subset_df['Flow_ID'] == 7]

subset_df
'''
# group by date and get max count by date
print subset_august.groupby('Date')['Count'].max()

# in case the intention was to group by date and get total count by date
#print subset_august.groupby('Date')['Count'].sum()

#### Stacking, Unstacking ####
1. Stack the data with count and flow_id as indexes
2. Use reset_index to reset the stacked hierarchy by 1 level. The index then will just be the counts
3. Unstack the data to get back original data

In [None]:
# stacking then applying 'reset_index'
stacked_df = df.set_index(['Flow_ID', 'Count'], append = True)
stacked_df = stacked_df.stack()
stacked_df = stacked_df.reset_index()
#stacked_df = stacked_df.unstack()
print stacked_df.head()

# print newline for readability
print '\n'

# stacking and unstacking, then applying 'reset_index'
stacked_df = df.set_index(['Flow_ID', 'Count'], append = True)
stacked_df = stacked_df.stack()
stacked_df = stacked_df.unstack()
stacked_df = stacked_df.reset_index(['Flow_ID','Count'])
stacked_df

#### Pandas and Matplotlib

Plot a histogram of date vs total counts for flow_id=7 and flow_id=9 for the month of July 2005

In [None]:
# use dateRange function to create dataframe with only data from July 2005
july_df = dateRange('07/01/05','07/31/05')

# reformat dataframe to get flow ids in separate columns
hist_df = july_df.groupby(['Date','Flow_ID'])['Count'].sum().unstack('Flow_ID')

# extract data by flow_id
f7 = hist_df[7]
f9 = hist_df[9]

# determine dates and set x_index values
dates = july_df['Date'].unique()
x_index = np.arange(len(dates))

# plot Flow_ID = 7
plt.bar(x_index, f7, align='center', color='blue', alpha=0.5)
plt.xticks(x_index, dates, rotation = 90)
plt.title('Daily July Counts for Flow ID = 7')
plt.xlabel('Date in July 2005') 
plt.ylabel('Count')
plt.show()

# plot Flow_ID = 9
plt.bar(x_index, f9, align='center', color='green', alpha=0.5)
plt.xticks(x_index, dates, rotation = 90)
plt.title('Daily July Counts for Flow ID = 9')
plt.xlabel('Date in July 2005') 
plt.ylabel('Count')
plt.show()