### Challenge 1
Open up a new IPython notebook
Download a few MTA turnstile data files
Open up a file, use csv reader to read it, make a python dict where there is a key for each (C/A, UNIT, SCP, STATION). These are the first four columns. The value for this key should be a list of lists. Each list in the list is the rest of the columns in a row. For example, one key-value pair should look like

In [102]:
import pandas as pd
from datetime import datetime
import csv
import urllib.request
import codecs

In [103]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 25)
pd.set_option('display.precision', 3)

In [104]:
# Copy desired dates from website: http://web.mta.info/developers/turnstile.html
raw_dates_txt = """
Saturday, January 12, 2019
Saturday, January 05, 2019
"""

In [105]:
date_obj_lst = [datetime.strftime(datetime.strptime(line, '%A, %B %d, %Y'), '%y%m%d') for line in raw_dates_txt.split('\n') if line]
    
turnstile_url = ['http://web.mta.info/developers/data/nyct/turnstile/turnstile_' + date + '.txt' for date in date_obj_lst]

with open('turnstile_data', 'w') as outfile:
    writer = csv.writer(outfile, delimiter=',')
    ftpstream = urllib.request.urlopen(turnstile_url[0])
    csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
    for line in csvfile:
        writer.writerow(line)    
    for url in turnstile_url[1:]:
        ftpstream = urllib.request.urlopen(url)
        csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
        firstline = True
        for line in csvfile:
            if firstline:    #skip first line
                firstline = False
                continue
            writer.writerow(line)

In [106]:
with open('turnstile_data') as input:
    turnstile = pd.read_csv(input)

In [92]:
turnstile.columns

Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES',
       'EXITS                                                               '],
      dtype='object')

In [107]:
turnstile.columns = [col.strip() for col in turnstile.columns]

In [83]:
turnstile_dict = ['C/A','UNIT','SCP','STATION']
turnstile_dict_df = turnstile.set_index(turnstile_dict)  # make MultiIndex
# turnstile_dict_df.head()


## Challenge 2
Let's turn this into a time series.
For each key (basically the control area, unit, device address and station of a specific turnstile), have a list again, but let the list be comprised of just the point in time and the count of entries.

This basically means keeping only the date, time, and entries fields in each list. You can convert the date and time into datetime objects -- That is a python class that represents a point in time. You can combine the date and time fields into a string and use the dateutil module to convert it into a datetime object. 

In [84]:
turnstile.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402693 entries, 0 to 402692
Data columns (total 11 columns):
C/A         402693 non-null object
UNIT        402693 non-null object
SCP         402693 non-null object
STATION     402693 non-null object
LINENAME    402693 non-null object
DIVISION    402693 non-null object
DATE        402693 non-null object
TIME        402693 non-null object
DESC        402693 non-null object
ENTRIES     402693 non-null int64
EXITS       402693 non-null int64
dtypes: int64(2), object(9)
memory usage: 33.8+ MB


In [108]:
turnstile['DATE & TIME']=turnstile['DATE'] +' '+ turnstile['TIME']

In [113]:
turnstile.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402693 entries, 0 to 402692
Data columns (total 12 columns):
C/A            402693 non-null object
UNIT           402693 non-null object
SCP            402693 non-null object
STATION        402693 non-null object
LINENAME       402693 non-null object
DIVISION       402693 non-null object
DATE           402693 non-null object
TIME           402693 non-null object
DESC           402693 non-null object
ENTRIES        402693 non-null int64
EXITS          402693 non-null int64
DATE & TIME    402693 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(9)
memory usage: 36.9+ MB


In [114]:
import datetime
import dateutil.parser

turnstile['DATE & TIME']= turnstile['DATE & TIME'].apply(dateutil.parser.parse)

In [115]:
turnstile.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,DATE & TIME
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,03:00:00,REGULAR,6897012,2338472,2019-01-05 03:00:00
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,07:00:00,REGULAR,6897023,2338487,2019-01-05 07:00:00
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,11:00:00,REGULAR,6897083,2338565,2019-01-05 11:00:00
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,15:00:00,REGULAR,6897262,2338624,2019-01-05 15:00:00
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,19:00:00,REGULAR,6897572,2338679,2019-01-05 19:00:00


In [120]:
turnstile['DATE']=pd.to_datetime(turnstile['DATE'])

## Challenge 3
These counts are for every n hours. (What is n?) We want total daily entries.
Now make it that we again have the same keys, but now we have a single value for a single day, which is the total number of passengers that entered through this turnstile on this day.



In [121]:
turnstile.groupby(turnstile_dict)['DATE','ENTRIES'].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,ENTRIES
C/A,UNIT,SCP,STATION,Unnamed: 4_level_1
A002,R051,02-00-00,59 ST,565517642
A002,R051,02-00-01,59 ST,511097911
A002,R051,02-03-00,59 ST,97657711
A002,R051,02-03-01,59 ST,79344866
A002,R051,02-03-02,59 ST,501366118
A002,R051,02-03-03,59 ST,465674501
A002,R051,02-03-04,59 ST,578981091
A002,R051,02-03-05,59 ST,929665773
A002,R051,02-03-06,59 ST,730233414
A002,R051,02-05-00,59 ST,7438


## Challenge 4
We will plot the daily time series for a turnstile.

In [133]:
import matplotlib.pyplot as plt
%matplotlib inline
df=turnstile.groupby(['DATE'])['ENTRIES']\
     .sum()\
     .reset_index(name='TOTAL ENTRIES per DAY')
df.head()

Unnamed: 0,DATE,TOTAL ENTRIES per DAY
0,2018-12-29,1147188776251
1,2018-12-30,1161499174459
2,2018-12-31,1164076390193
3,2019-01-01,1152861327413
4,2019-01-02,1151465206350


### Challenge 9
**Over multiple weeks, sum total ridership for each station and sort them, so you can find out the stations with the highest traffic during the time you investigate**

In [188]:
Jan05['year_month'] = pd.to_datetime(Jan05['DATE']).dt.to_period('M')

In [189]:
Jan05.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,ENTRIES DIFF,EXITS DIFF,year_month
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,07:00:00,REGULAR,6897023,2338487,11.0,15.0,2019-01
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,11:00:00,REGULAR,6897083,2338565,60.0,78.0,2019-01
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,15:00:00,REGULAR,6897262,2338624,179.0,59.0,2019-01
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,19:00:00,REGULAR,6897572,2338679,310.0,55.0,2019-01
5,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,23:00:00,REGULAR,6897740,2338703,168.0,24.0,2019-01


In [190]:
Jan05['year_month_station'] = Jan05['year_month'].map(str) + ' ' + Jan05['STATION']

In [191]:
Jan05.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,ENTRIES DIFF,EXITS DIFF,year_month,year_month_station
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,07:00:00,REGULAR,6897023,2338487,11.0,15.0,2019-01,2019-01 59 ST
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,11:00:00,REGULAR,6897083,2338565,60.0,78.0,2019-01,2019-01 59 ST
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,15:00:00,REGULAR,6897262,2338624,179.0,59.0,2019-01,2019-01 59 ST
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,19:00:00,REGULAR,6897572,2338679,310.0,55.0,2019-01,2019-01 59 ST
5,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/05/2019,23:00:00,REGULAR,6897740,2338703,168.0,24.0,2019-01,2019-01 59 ST


In [192]:
Jan05.groupby('year_month_station')['ENTRIES DIFF'].sum()

year_month_station
2018-12 1 AV               3.940e+09
2018-12 103 ST             1.875e+08
2018-12 103 ST-CORONA      6.658e+07
2018-12 104 ST             8.389e+09
2018-12 110 ST             2.966e+07
2018-12 111 ST             2.048e+08
2018-12 116 ST             4.485e+09
2018-12 116 ST-COLUMBIA    3.473e+09
2018-12 121 ST             5.838e+09
2018-12 125 ST             1.085e+10
2018-12 135 ST             2.795e+08
2018-12 137 ST CITY COL    5.649e+08
                             ...    
2019-01 WEST FARMS SQ      4.287e+08
2019-01 WESTCHESTER SQ     7.163e+06
2019-01 WHITEHALL S-FRY    1.394e+08
2019-01 WHITLOCK AV        3.980e+06
2019-01 WILSON AV          7.731e+05
2019-01 WINTHROP ST        1.625e+07
2019-01 WOODHAVEN BLVD     3.812e+07
2019-01 WOODLAWN           5.970e+06
2019-01 WORLD TRADE CTR    1.587e+09
2019-01 WTC-CORTLANDT      7.309e+08
2019-01 YORK ST            1.019e+05
2019-01 ZEREGA AV          8.418e+05
Name: ENTRIES DIFF, Length: 754, dtype: float64

In [None]:
plt.figure(figsize=[200,100])
plt.plot(df_station_total)

In [None]:
df_station_total = Jan05[Jan05['ENTRIES DIFF'].sum()]
df_station_total

In [None]:
ym_station = [name for name in Jan05['year_month_station'].unique()]

In [None]:
df_ym_station_sort= (Jan05.set_index(ym_station)
                              .sort_index())