# Python Data Handling:
## A Deeper Dive

This notebook is for answering the challenges from Dave Beazley's presentation at 

http://www.dabeaz.com/datadeepdive/DataDeepDive.pdf

In [1]:
from dateutil.parser import parse
from datetime import date as Date
cta_in = open("../BIG_DATA/ctabus.csv")
next(cta_in)  # skip first line

'route,date,daytype,rides\n'

In [2]:
records = {}
for row in cta_in:
    route,route_date,daytype,rides = row.rstrip().split(',')
#     print(route,date,daytype,rides)
    key = (route, parse(route_date).date(), daytype)
    value = int(rides)
    records[key] = value

print(len(records))
cta_in.close()

736461


Answer a few questions about the Chicago bus data...
1. How many bus routes exist?
2. How many people rode route 22 on 9-Apr-2007?
3. What are 10 most popular routes?
4. What are 10 most popular routes in 2016?
5. What 10 routes had greatest increase 2001-2016?

In [3]:
# how many routes
len(set(r[0] for r in records))

185

In [4]:
# how many rode route 22 on 4/9/07?

for x, y in records.items():
    print(x, y)
    break
    
records[("22", Date(2007, 4, 9), 'W')]


('3', datetime.date(2001, 1, 1), 'U') 7354


24154

# Using Pandas on the same data

In [5]:
import pandas as pd
df = pd.read_csv(
    "../BIG_DATA/ctabus.csv", 
    dtype={'date': 'category', 'route': "category", 'daytype': "category"}
)
df.head()

Unnamed: 0,route,date,daytype,rides
0,3,01/01/2001,U,7354
1,4,01/01/2001,U,9288
2,6,01/01/2001,U,6048
3,8,01/01/2001,U,6309
4,9,01/01/2001,U,11207


In [6]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 736461 entries, 0 to 736460
Data columns (total 4 columns):
route      736461 non-null category
date       736461 non-null category
daytype    736461 non-null category
rides      736461 non-null int64
dtypes: category(3), int64(1)
memory usage: 9.7 MB


TypeError: data type "Any" not understood

In [None]:
df.daytype.unique()

In [None]:
len(df.route.unique())

In [None]:
len(df.date.unique())

In [None]:
df = df[(df.route == '22') & (df.date  == Date(2007, 4, 22))]

## Now using awk

In [None]:
%%bash
time awk -F, 'FNR == 1 {next} { foo[$1]++ } END { print length(foo)}' ../BIG_DATA/ctabus.csv