# MTA 1 Exercise

In [2]:
syd25!


## Exercise 1

- Open up a new IPython notebook
- Download MTA turnstile data files for the following three dates: ```160903, 160910, 160917```

- Load the files into a pandas DataFrame (hint: `pd.read_csv()` to load files and `pd.concat()` to combine DataFrames)


In [3]:
one = pd.read_csv("http://web.mta.info/developers/data/nyct/turnstile/turnstile_160903.txt")
two = pd.read_csv("http://web.mta.info/developers/data/nyct/turnstile/turnstile_160910.txt")
three = pd.read_csv("http://web.mta.info/developers/data/nyct/turnstile/turnstile_160917.txt")

In [None]:
one

In [None]:
df = pd.concat([one, two, three], join="inner", ignore_index=True)

In [None]:
df.head()

## Exercise 2

- Let's turn this into a time series.

- Our pandas dataframe has columns called `Date` and `Time` (what datatype did pandas assign to these columns on import?). However, in python and pandas we can convert date and time information to _datetime_ objects, which allow us to do time-based operations

- Using either [pd.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) in pandas or the [python datetime library](https://docs.python.org/2/library/datetime.html), combine the `Date` and `Time` columns into a single new column of the datetime datatype

In [None]:
df.dtypes

Pandas assigned an `object` (i.e. string) type to the `DATE` and `TIME` columns on import.

To begin with for data cleaning, I will clean the column names to make it easier to wrangle later on.

In [None]:
df.columns = [column.strip().lower().replace('/','_') for column in df.columns] 
df.columns

In [None]:
df['date_time'] = pd.to_datetime(df.date + ' ' + df.time)

In [None]:
df.dtypes

The `date_time` column is now a datetime object.

## Exercise 3a
* Each combination of C/A, UNIT, SCP, and STATION represents a unique turnstile. Take a look at one specific turnstile on a specific date. What does each row in the dataframe represent?
* Obtain the maximum ENTRIES value for each day, for each unique turnstile.

To look at a specifc turnstile, I'll use a mask. 

In [None]:
mask = ((df["c_a"] == "A002") &
        (df["unit"] == "R051") & 
        (df["scp"] == "02-00-00") & 
        (df["station"] == "59 ST"))

df[mask].head()

It looks like, for a unique turnstile, each row represents a 4 hour window for a specific day with the total number of CUMULATIVE entries and exits at that point. 

As per the question, let's filter for the maximum entries in a four hour window per day and call that new dataframe `df_daily`. The last reading for each day is the maximum number of entries, therefore we would filter for the last row in each day. 

In [None]:
df_daily = (df
            .groupby(["c_a", "unit", "scp", "station", "date"],as_index=False)
            .entries.last())

In [None]:
df_daily

## Exercise 3b

- Work off of the daily maximum `ENTRIES` calculations from Problem 3a. Recall that the `ENTRIES` column contains cumulative entries on each day. We would now like you to calculate daily entries, i.e. the number of new entries gained each day
- Hint: Group the data by turnstile and use the Pandas `.apply()` method to compute the same differencing function for each turnstile. Check out the `.shift()` and `.diff()` DataFrame methods for this purpose
- Look through your results. Do they all make sense?

To calculate daily entries, I have to deduct the former day's cumulative entries from each row's `entries` column. I can do this by using `.apply()` to apply a single function for each turnstile.

In [None]:
df_daily[['prev_date', 'prev_entries']] = (df_daily
                                   .groupby(["c_a", "unit", "scp", "station"])["date", "entries"]
                                   .apply(lambda x : x.shift(1)))

In [None]:
df_daily.head()

I will remove the NaN value for the first row for each turnstile.

In [None]:
df_daily.dropna(subset=['prev_date'], inplace=True)

In [None]:
df_daily.head()

In [None]:
df_daily['daily_entries'] = df_daily['entries'] - df_daily['prev_entries']

In [None]:
df_daily.head()

Now let's see if there's anything funky with the `daily_entries` column.

In [None]:
(df_daily[df_daily['daily_entries'] < 0]
 .groupby(['c_a', 'unit', 'scp', 'station'])).head()

Looks like there are negative values for `daily_entries` -- possibly because the previous day's entries are higher than the current day. Let's see how many rows this is happening for.

In [None]:
df_daily[df_daily['entries'] < df_daily['prev_entries']].head()

Let's look at a specific instance of 2 consecutive days to see what is happening.

In [None]:
mask3 = ((df["c_a"] == "A011") &
        (df["unit"] == "R080") & 
        (df["scp"] == "01-00-00") & 
        (df["station"] == "57 ST-7 AV") &
        ((df["date"] == "08/28/2016") |
        (df["date"] == "08/29/2016")))

df[mask3]

The counter seems to be working in reverse, so I will assume we can just take the inverse sign of the `daily_entries` column for the rows which have negative values.

In [None]:
df_daily['daily_entries'] = df_daily.daily_entries.apply(lambda x: -x if x < 0 else x)

In [188]:
# Let's see if there are any more negative values

(df_daily[df_daily['daily_entries'] < 0]).size

0

### Problem 4

So far we've been operating on a single turnstile level. Now let's
  combine turnstiles that fall within the same ControlArea/Unit/Station combo. There
  are some ControlArea/Unit/Station groups that have a single
  turnstile, but most have multiple turnstiles -- same value for the
  C/A, UNIT and STATION columns, different values for the SCP column.

We want to combine the numbers together. For each
ControlArea/UNIT/STATION combo, for each day, sum the counts from each
turnstile belonging to that combo. (hint: `pd.groupby`)

In [202]:
df_unit_station_daily = df_daily.groupby(['c_a', 'unit', 'station', 'date']).daily_entries.sum().reset_index()

In [203]:
df_unit_station_daily.head()

Unnamed: 0,c_a,unit,station,date,daily_entries
0,A002,R051,59 ST,08/28/2016,7896.0
1,A002,R051,59 ST,08/29/2016,15462.0
2,A002,R051,59 ST,08/30/2016,16622.0
3,A002,R051,59 ST,08/31/2016,16557.0
4,A002,R051,59 ST,09/01/2016,16464.0


### Problem 5

Similarly, come up with daily time series for each STATION, by adding up all the turnstiles in a station.

In [207]:
df_station_daily = df_unit_station_daily.groupby(['station', 'date']).daily_entries.sum().reset_index()

In [209]:
df_station_daily.head()

Unnamed: 0,station,date,daily_entries
0,1 AV,08/28/2016,13871.0
1,1 AV,08/29/2016,18064.0
2,1 AV,08/30/2016,19182.0
3,1 AV,08/31/2016,19616.0
4,1 AV,09/01/2016,20170.0


### Problem 6

Over multiple weeks, sum total ridership for each station and sort
  them, so you can find out the stations with the highest traffic
  during the time you investigate.

In [216]:
df_station_alltime = df_station_daily.groupby('station').sum().sort_values('daily_entries', ascending=False).reset_index()

In [222]:
df_station_alltime

Unnamed: 0,station,daily_entries
0,HIGH ST,1905251230.0
1,57 ST-7 AV,1895370561.0
2,EUCLID AV,1508883271.0
3,CHRISTOPHER ST,1145702830.0
4,137 ST CITY COL,418038454.0
...,...,...
368,BEACH 44 ST,12581.0
369,BEACH 105 ST,6556.0
370,BROAD CHANNEL,5597.0
371,ORCHARD BEACH,1020.0
