# MTA Traffic Forecasting
#### Justin Morgan & Khyatee Desai
This notebook includes the data collection, storage, and cleaning process. The data is sourced from the [MTA turnstile data archive](http://web.mta.info/developers/turnstile.html), which all turnstile data from all NYC subway stations in comma delineated text files, segmented by week.
# Part 1: Web Scrape MTA Data
The data is iteratively scraped from the webpage using BeautifulSoup, and the text files are currently stored locally within a "data" folder.
<br><br>
Future steps will involve storing this data within an S3 bucket in the AWS cloud for faster storage & retrieval.

In [4]:
# import necessary packages
import pandas as pd
import numpy as np
import os
import requests
import urllib.request
import time
from timeit import default_timer as timer
import humanfriendly
from bs4 import BeautifulSoup
from datetime import datetime as dt

### Get MTA turnstile data from publicly available website

In [7]:
url_root = r'http://web.mta.info/developers/' # set root url
# path = r'/Users/justinwilliams/projects/mta_turnstile/data/' # commented this out and changed it below to just save to "./data/" folder so it works for both of us - k
starttime = timer() # start timer to time process

req = requests.get(url_root + 'turnstile.html') # send request
soup = BeautifulSoup(req.content, 'html.parser') # parse html and save to bs4 object
weekly_data = soup.find(class_='span-84 last') # find class_ where file links are locatedb

## This way took much longer so used urllib.request.urlretrieve
# for file in weekly_data.findAll('a'):
#     print('Saving file turnstile ' + str(file)[39:49])
#     datafile = requests.get(url_root + str(file)[9:49])
#     with open(path + str(file)[39:49], 'w') as outf:
#         for line in datafile.text:
#             outf.writelines(line)
#     time.sleep(1)

# endtime = timer()
# print('Completed in ' + humanfriendly.format_timespan(endtime-starttime))

counter = 1
for one_a_tag in weekly_data.findAll('a')[:5]:
    file = one_a_tag['href']
    datafile = url_root + file
    urllib.request.urlretrieve(datafile, './data/' + datafile[datafile.find('/turnstile_')+1:])
    time.sleep(1)
    counter += 1
    print('Saving file turnstile ' + str(one_a_tag)[39:49])
    
endtime = timer()
print('Completed in ' + humanfriendly.format_timespan(endtime-starttime))

Saving file turnstile 210313.txt
Saving file turnstile 210306.txt
Saving file turnstile 210227.txt
Saving file turnstile 210220.txt
Saving file turnstile 210213.txt
Completed in 11.89 seconds


Took __47 minutes and 17.79 seconds__ to download all files, also folder size is __12.05GB__

Maybe there is a better way to do this?

# Part 2: Data Cleaning
This notebook includes the data cleaning process

### Inspect Data
Field Descriptions: http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt

`C/A      = Control Area (A002)`<br>
`UNIT     = Remote Unit for a station (R051)`<br>
`SCP      = Subunit Channel Position represents an specific address for a device (02-00-00)`<br>
`STATION  = Represents the station name the device is located at`<br>
`LINENAME = Represents all train lines that can be boarded at this station
           Normally lines are represented by one character.  LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.`<br>
`DIVISION = Represents the Line originally the station belonged to BMT, IRT, or IND   `<br>
`DATE     = Represents the date (MM-DD-YY)`<br>
`TIME     = Represents the time (hh:mm:ss) for a scheduled audit event`<br>
`DESc     = Represent the "REGULAR" scheduled audit event (Normally occurs every 4 hours)`<br>
        `1. Audits may occur more that 4 hours due to planning, or troubleshooting activities.`<br>
        `2. Additionally, there may be a "RECOVR AUD" entry: This refers to a missed audit that was recovered.`<br>
`ENTRIES  = The comulative entry register value for a device`<br>
`EXIST    = The cumulative exit register value for a device`<br>

In [8]:
df = pd.read_csv("./data/turnstile_210313.txt")
df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/06/2021,03:00:00,REGULAR,7540642,2572027
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/06/2021,07:00:00,REGULAR,7540645,2572030
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/06/2021,11:00:00,REGULAR,7540676,2572093
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/06/2021,15:00:00,REGULAR,7540764,2572128
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/06/2021,19:00:00,REGULAR,7540904,2572160


## Concatenate Files
Concatenate each data file into a Pandas dataframe

In [9]:
# create main df using first file in the folder
for filename in os.listdir("./data/")[:1]:
    df = pd.read_csv("./data/"+filename)

# iterate over remaining files and concat them to main df
for filename in os.listdir("./data/")[1:]:
    df = pd.concat([df,pd.read_csv("./data/"+filename)])

In [10]:
df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,02/06/2021,03:00:00,REGULAR,7527244,2565995
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,02/06/2021,07:00:00,REGULAR,7527246,2566004
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,02/06/2021,11:00:00,REGULAR,7527296,2566054
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,02/06/2021,15:00:00,REGULAR,7527430,2566098
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,02/06/2021,19:00:00,REGULAR,7527588,2566129


## Reformat Data Types
Convert strings to DateTime format

In [11]:
df.DATE = pd.to_datetime(df['DATE'])

### Change Time Column to Timestamp Object

In [12]:
# df.TIME = pd.to_datetime(df['TIME'])
# df.TIME.apply(lambda x: dt.timestamp(x))

In [13]:
df.dtypes

C/A                                                                             object
UNIT                                                                            object
SCP                                                                             object
STATION                                                                         object
LINENAME                                                                        object
DIVISION                                                                        object
DATE                                                                    datetime64[ns]
TIME                                                                            object
DESC                                                                            object
ENTRIES                                                                          int64
EXITS                                                                            int64
dtype: object

### Inspect a specific station

In [38]:
df[(df.STATION == 'HALSEY ST') & (df.DATE.astype(str)=='2021-02-15')].head(50)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
36293,H028,R266,00-00-00,HALSEY ST,L,BMT,2021-02-15,03:00:00,REGULAR,5897529,3008491
36294,H028,R266,00-00-00,HALSEY ST,L,BMT,2021-02-15,07:00:00,REGULAR,5897606,3008499
36295,H028,R266,00-00-00,HALSEY ST,L,BMT,2021-02-15,11:00:00,REGULAR,5897771,3008564
36296,H028,R266,00-00-00,HALSEY ST,L,BMT,2021-02-15,15:00:00,REGULAR,5897864,3008636
36297,H028,R266,00-00-00,HALSEY ST,L,BMT,2021-02-15,19:00:00,REGULAR,5897921,3008755
36298,H028,R266,00-00-00,HALSEY ST,L,BMT,2021-02-15,23:00:00,REGULAR,5897949,3008810
36335,H028,R266,00-00-01,HALSEY ST,L,BMT,2021-02-15,03:00:00,REGULAR,7492100,1779800
36336,H028,R266,00-00-01,HALSEY ST,L,BMT,2021-02-15,07:00:00,REGULAR,7492141,1779803
36337,H028,R266,00-00-01,HALSEY ST,L,BMT,2021-02-15,11:00:00,REGULAR,7492228,1779827
36338,H028,R266,00-00-01,HALSEY ST,L,BMT,2021-02-15,15:00:00,REGULAR,7492286,1779840


## Daily Entries
Grouped by station, date, SCP, UNIT, and C/A on Max and Min, then subtract the two to find the number of turnstile entries per day at each station.

**Note**....not really sure what SCP, UNIT, and C/A are but i think they might indicate an individual tracker device at a station for each turnstile? idkidk

In [61]:
# min gives us cumulative entries at the beginning of each day
df.groupby(['STATION','DATE','SCP','UNIT','C/A'])[['ENTRIES']].min()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,ENTRIES
STATION,DATE,SCP,UNIT,C/A,Unnamed: 5_level_1
1 AV,2021-02-06,00-00-00,R248,H007,15524923
1 AV,2021-02-06,00-00-01,R248,H007,61232032
1 AV,2021-02-06,00-03-00,R248,H007,370878741
1 AV,2021-02-06,00-03-01,R248,H007,2615699
1 AV,2021-02-06,00-03-02,R248,H007,6659920
...,...,...,...,...,...
ZEREGA AV,2021-03-12,00-00-01,R326,R419,227376
ZEREGA AV,2021-03-12,00-03-00,R326,R419,1142551
ZEREGA AV,2021-03-12,00-03-01,R326,R419,1218309
ZEREGA AV,2021-03-12,00-05-00,R326,R419,232


In [62]:
# max gives us cumulative entries at the end of each day
df.groupby(['STATION','DATE','SCP','UNIT','C/A'])[['ENTRIES']].max()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,ENTRIES
STATION,DATE,SCP,UNIT,C/A,Unnamed: 5_level_1
1 AV,2021-02-06,00-00-00,R248,H007,15525120
1 AV,2021-02-06,00-00-01,R248,H007,61232338
1 AV,2021-02-06,00-03-00,R248,H007,370878782
1 AV,2021-02-06,00-03-01,R248,H007,2615731
1 AV,2021-02-06,00-03-02,R248,H007,6660014
...,...,...,...,...,...
ZEREGA AV,2021-03-12,00-00-01,R326,R419,227473
ZEREGA AV,2021-03-12,00-03-00,R326,R419,1142796
ZEREGA AV,2021-03-12,00-03-01,R326,R419,1218850
ZEREGA AV,2021-03-12,00-05-00,R326,R419,232


In [63]:
# subtract min from max to get number of entries each day
grouped = df.groupby(['STATION','DATE','SCP','UNIT','C/A'])[['ENTRIES']].max()- df.groupby(['STATION','DATE','SCP','UNIT','C/A'])[['ENTRIES']].min()
grouped


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,ENTRIES
STATION,DATE,SCP,UNIT,C/A,Unnamed: 5_level_1
1 AV,2021-02-06,00-00-00,R248,H007,197
1 AV,2021-02-06,00-00-01,R248,H007,306
1 AV,2021-02-06,00-03-00,R248,H007,41
1 AV,2021-02-06,00-03-01,R248,H007,32
1 AV,2021-02-06,00-03-02,R248,H007,94
...,...,...,...,...,...
ZEREGA AV,2021-03-12,00-00-01,R326,R419,97
ZEREGA AV,2021-03-12,00-03-00,R326,R419,245
ZEREGA AV,2021-03-12,00-03-01,R326,R419,541
ZEREGA AV,2021-03-12,00-05-00,R326,R419,0


In [64]:
# sum up the entries at each station by day
grouped_entries = grouped.groupby(['STATION', 'DATE']).sum()
grouped_entries

Unnamed: 0_level_0,Unnamed: 1_level_0,ENTRIES
STATION,DATE,Unnamed: 2_level_1
1 AV,2021-02-06,4295
1 AV,2021-02-07,2559
1 AV,2021-02-08,5636
1 AV,2021-02-09,5638
1 AV,2021-02-10,5995
...,...,...
ZEREGA AV,2021-03-08,918
ZEREGA AV,2021-03-09,979
ZEREGA AV,2021-03-10,976
ZEREGA AV,2021-03-11,1051


In [66]:
# pivot the dataframe so date is the columns
entries_df = grouped_entries.pivot_table(index='STATION', columns='DATE')
entries_df

Unnamed: 0_level_0,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES,ENTRIES
DATE,2021-02-06,2021-02-07,2021-02-08,2021-02-09,2021-02-10,2021-02-11,2021-02-12,2021-02-13,2021-02-14,2021-02-15,...,2021-03-03,2021-03-04,2021-03-05,2021-03-06,2021-03-07,2021-03-08,2021-03-09,2021-03-10,2021-03-11,2021-03-12
STATION,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1 AV,4295.0,2559.0,5636.0,5638.0,5995.0,5927.0,6044.0,4112.0,3343.0,4040.0,...,6365.0,6290.0,6377.0,4470.0,3392.0,5833.0,6154.0,6021.0,6356.0,6631.0
103 ST,5498.0,3432.0,8697.0,8926.0,9227.0,9013.0,8845.0,5195.0,4030.0,5743.0,...,9707.0,9392.0,9380.0,5533.0,4236.0,9097.0,9570.0,9773.0,9705.0,9605.0
103 ST-CORONA,7324.0,4676.0,9175.0,9150.0,9461.0,9798.0,9717.0,7207.0,5439.0,7855.0,...,9736.0,10141.0,10283.0,7292.0,5654.0,9427.0,9947.0,9926.0,10257.0,10197.0
104 ST,1076.0,663.0,1844.0,1890.0,1834.0,1936.0,1816.0,1123.0,765.0,1404.0,...,1766.0,1946.0,1932.0,1052.0,862.0,1952.0,1943.0,1952.0,1894.0,1909.0
110 ST,2551.0,1551.0,3620.0,3797.0,3945.0,3803.0,3594.0,2485.0,1722.0,2500.0,...,4203.0,4086.0,4224.0,2436.0,1888.0,3802.0,4267.0,4190.0,4299.0,4247.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WOODLAWN,1974.0,929.0,2660.0,2825.0,2783.0,2934.0,2918.0,1867.0,1290.0,2006.0,...,3022.0,2889.0,2869.0,1910.0,1395.0,2841.0,3077.0,3083.0,3217.0,3087.0
WORLD TRADE CTR,132.0,73.0,2274.0,2318.0,2387.0,2409.0,2324.0,138.0,102.0,137.0,...,2446.0,2477.0,2414.0,187.0,107.0,2381.0,2565.0,2493.0,2669.0,2568.0
WTC-CORTLANDT,979.0,469.0,1524.0,1511.0,1587.0,1560.0,1544.0,1204.0,837.0,1025.0,...,1599.0,1682.0,1607.0,1150.0,836.0,1641.0,1759.0,1628.0,1747.0,1702.0
YORK ST,1.0,0.0,2034.0,2124.0,2290.0,2245.0,2198.0,0.0,0.0,0.0,...,2448.0,2427.0,2362.0,0.0,0.0,2176.0,2418.0,2374.0,2386.0,2517.0


### Pickle Cleaned Data

In [67]:
#need to do this