# MTA Traffic Forecasting
#### Justin Morgan & Khyatee Desai
This notebook includes the data collection, storage, and cleaning process. The data is sourced from the [MTA turnstile data archive](http://web.mta.info/developers/turnstile.html), which all turnstile data from all NYC subway stations in comma delineated text files, segmented by week.
# Part 1: Web Scrape MTA Data
The data is iteratively scraped from the webpage using BeautifulSoup, and the text files are currently stored locally within a "data" folder.
<br><br>
Future steps will involve storing this data within an S3 bucket in the AWS cloud for faster storage & retrieval.

In [3]:
# import necessary packages
import pandas as pd
import numpy as np
import os
import requests
import urllib.request
import time
from timeit import default_timer as timer
import humanfriendly
from bs4 import BeautifulSoup

### Get MTA turnstile data from publicly available website

In [6]:
url_root = r'http://web.mta.info/developers/' # set root url
# path = r'/Users/justinwilliams/projects/mta_turnstile/data/' # commented this out and changed it below to just save to "./data/" folder so it works for both of us - k
starttime = timer() # start timer to time process

req = requests.get(url_root + 'turnstile.html') # send request
soup = BeautifulSoup(req.content, 'html.parser') # parse html and save to bs4 object
weekly_data = soup.find(class_='span-84 last') # find class_ where file links are locatedb

## This way took much longer so used urllib.request.urlretrieve
# for file in weekly_data.findAll('a'):
#     print('Saving file turnstile ' + str(file)[39:49])
#     datafile = requests.get(url_root + str(file)[9:49])
#     with open(path + str(file)[39:49], 'w') as outf:
#         for line in datafile.text:
#             outf.writelines(line)
#     time.sleep(1)

# endtime = timer()
# print('Completed in ' + humanfriendly.format_timespan(endtime-starttime))

counter = 1
for one_a_tag in weekly_data.findAll('a')[:5]:
    file = one_a_tag['href']
    datafile = url_root + file
    urllib.request.urlretrieve(datafile, './data/' + datafile[datafile.find('/turnstile_')+1:])
    time.sleep(1)
    counter += 1
    print('Saving file turnstile ' + str(one_a_tag)[39:49])
    
endtime = timer()
print('Completed in ' + humanfriendly.format_timespan(endtime-starttime))

Saving file turnstile 210313.txt
Saving file turnstile 210306.txt
Saving file turnstile 210227.txt
Saving file turnstile 210220.txt
Saving file turnstile 210213.txt
Completed in 18.03 seconds


Took __47 minutes and 17.79 seconds__ to download all files, also folder size is __12.05GB__

Maybe there is a better way to do this?

# Part 2: Data Cleaning
This notebook includes the data cleaning process

### Inspect Data
Field Descriptions: http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt

`C/A      = Control Area (A002)`<br>
`UNIT     = Remote Unit for a station (R051)`<br>
`SCP      = Subunit Channel Position represents an specific address for a device (02-00-00)`<br>
`STATION  = Represents the station name the device is located at`<br>
`LINENAME = Represents all train lines that can be boarded at this station
           Normally lines are represented by one character.  LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.`<br>
`DIVISION = Represents the Line originally the station belonged to BMT, IRT, or IND   `<br>
`DATE     = Represents the date (MM-DD-YY)`<br>
`TIME     = Represents the time (hh:mm:ss) for a scheduled audit event`<br>
`DESc     = Represent the "REGULAR" scheduled audit event (Normally occurs every 4 hours)`<br>
        `1. Audits may occur more that 4 hours due to planning, or troubleshooting activities.`<br>
        `2. Additionally, there may be a "RECOVR AUD" entry: This refers to a missed audit that was recovered.`<br>
`ENTRIES  = The comulative entry register value for a device`<br>
`EXIST    = The cumulative exit register value for a device`<br>

In [7]:
df = pd.read_csv("./data/turnstile_210313.txt")
df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/06/2021,03:00:00,REGULAR,7540642,2572027
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/06/2021,07:00:00,REGULAR,7540645,2572030
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/06/2021,11:00:00,REGULAR,7540676,2572093
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/06/2021,15:00:00,REGULAR,7540764,2572128
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/06/2021,19:00:00,REGULAR,7540904,2572160


### Concatenate Data
Concatenate each data file into a Pandas dataframe

In [8]:
# create main df using first file in the folder
for filename in os.listdir("./data/")[:1]:
    df = pd.read_csv("./data/"+filename)

# iterate over remaining files and concat them to main df
for filename in os.listdir("./data/")[1:]:
    df = pd.concat([df,pd.read_csv("./data/"+filename)])

In [9]:
df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,02/06/2021,03:00:00,REGULAR,7527244,2565995
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,02/06/2021,07:00:00,REGULAR,7527246,2566004
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,02/06/2021,11:00:00,REGULAR,7527296,2566054
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,02/06/2021,15:00:00,REGULAR,7527430,2566098
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,02/06/2021,19:00:00,REGULAR,7527588,2566129


### Reformat Data Types
Convert strings to DateTime format

In [10]:
df.DATE = pd.to_datetime(df['DATE'])
df.TIME = pd.to_datetime( df['TIME'])

In [11]:
df.dtypes

C/A                                                                             object
UNIT                                                                            object
SCP                                                                             object
STATION                                                                         object
LINENAME                                                                        object
DIVISION                                                                        object
DATE                                                                    datetime64[ns]
TIME                                                                    datetime64[ns]
DESC                                                                            object
ENTRIES                                                                          int64
EXITS                                                                            int64
dtype: object

In [13]:
df

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,2021-02-06,2021-03-16 03:00:00,REGULAR,7527244,2565995
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,2021-02-06,2021-03-16 07:00:00,REGULAR,7527246,2566004
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,2021-02-06,2021-03-16 11:00:00,REGULAR,7527296,2566054
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,2021-02-06,2021-03-16 15:00:00,REGULAR,7527430,2566098
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,2021-02-06,2021-03-16 19:00:00,REGULAR,7527588,2566129
...,...,...,...,...,...,...,...,...,...,...,...
209039,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,2021-02-19,2021-03-16 04:00:00,REGULAR,5554,544
209040,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,2021-02-19,2021-03-16 08:00:00,REGULAR,5554,544
209041,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,2021-02-19,2021-03-16 12:00:00,REGULAR,5554,544
209042,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,2021-02-19,2021-03-16 16:00:00,REGULAR,5554,544


### Pickle Cleaned Data