# NYC MTA

This notebook will explore patterns in traffic for the NYC MTA system. Notably these findings will be used to determine measures for spreading out peak traffic time. Visualizations created are to aid the viewer in understanding what lulls may be opportune times for taking on some of peak transit traffic.

In [194]:
import pandas as pd
from FixWith import likeNew
from FixWith import combinedTraffic
from FixWith import trafficFix

## Getting Data

Getting data online will have to be performed first if not already saved on desktop. For a one-time use online data is a handy retrieval method. But for multiple notebook sessions, it's most seamless to save the data locally after retrieval. Accessing local data can then be the only method of importing.

In [181]:
# Source: http://web.mta.info/developers/turnstile.html

def get_data_online(week_nums):
    url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
    dfs = []
    for week_num in week_nums:
        file_url = url.format(week_num)
        dfs.append(pd.read_csv(file_url))
    return pd.concat(dfs)

def get_csv_data(ourCSV):
    df = pd.read_csv(ourCSV)
    del df["Unnamed: 0"]
    return df

The week numbers for retrieving data online follow the format year, month, day, with each week number being on a saturday.

Time Frames (3 wks each) of interest for the project:
1. Early to Mid December 2019
2. Mid to Late March 2020
3. Mid to Late June 2020

The code for getting data from online is as follows.
```
week_nums = [200613, 200620, 200627]
june_turnstiles = get_data_online(week_nums)

week_nums = [200328, 200321, 200314]
march_turnstiles = get_data_online(week_nums)

week_nums = [191207, 191214, 191221]
december_turnstiles = get_data_online(week_nums)
```

Once retrieved, this data can push it into .csvs

```
# december_turnstiles.to_csv('december_turnstiles.csv')
# march_turnstiles.to_csv('march_turnstiles.csv')
# june_turnstiles.to_csv('june_turnstiles.csv')
```

In [182]:
december_turnstiles = get_csv_data('december_turnstiles.csv')
march_turnstiles = get_csv_data('march_turnstiles.csv')
june_turnstiles = get_csv_data('june_turnstiles.csv')

## Cleaning Up 

 The data's timestamp will need to be combined into one column.
 Column names should also be stripped of any white space.

In [190]:
june_turnstiles = betterColumns(june_turnstiles)

##### Making a Turnstiles Column

Although the data is available to discern turnstiles within C/A, UNIT, SCP, & Station, a turnstile column has not yet been created. Together each of these features are 

In [191]:
def turnstileColumn(dataframe):
    '''
    Takes an MTA dataframe & creates a new column
    Where there's an index number for each Turn Stile in frame.
    '''
    eachTS = dataframe.groupby(["C/A", "UNIT", "SCP", "STATION"])[['ENTRIES']].sum()
    howMany = eachTS.shape[0]
    eachTS['TURNSTILE'] = range(1,(howMany + 1))
    del eachTS['ENTRIES']
    dataframe = pd.merge(dataframe, eachTS,  how='left',
                         left_on=['C/A','UNIT','SCP', 'STATION'],
                         right_on = ['C/A','UNIT','SCP', 'STATION'])    
    return dataframe

In [192]:
june_turnstiles = turnstileColumn(june_turnstiles)
june_turnstiles.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,TIMESTAMP,FOOTTRAFFIC,TURNSTILE_x,TURNSTILE_y,TURNSTILE
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/20/2020,00:00:00,REGULAR,7424218,2522558,2020-06-20 00:00:00,9946776,1,1,1
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/20/2020,04:00:00,REGULAR,7424220,2522559,2020-06-20 04:00:00,9946779,1,1,1


#### Updating Foot Traffic Column

In [193]:
june_turnstiles = trafficFix(june_turnstiles)
june_turnstiles.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,TIMESTAMP,FOOTTRAFFIC,TURNSTILE_x,TURNSTILE_y,TURNSTILE
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/20/2020,00:00:00,REGULAR,7424218,2522558,2020-06-20 00:00:00,3,1,1,1
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/20/2020,04:00:00,REGULAR,7424220,2522559,2020-06-20 04:00:00,3,1,1,1


### General Data Infos

In [178]:
print(june_turnstiles.shape, '\n')
print('Columns:\n' , june_turnstiles.columns)
june_turnstiles.head(2)

(620072, 13) 

Columns:
 Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES', 'EXITS', 'TIMESTAMP', 'FOOTTRAFFIC'],
      dtype='object')


Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,TIMESTAMP,FOOTTRAFFIC
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/20/2020,00:00:00,REGULAR,7424218,2522558,2020-06-20 00:00:00,9946776
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/20/2020,04:00:00,REGULAR,7424220,2522559,2020-06-20 04:00:00,9946779


## Plotting Values
- average for each hour ?
- what stations are used at what time ?


In [179]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline 

In [None]:
june_turnstiles['']

In [None]:
plt.plot(data_x, data_list)

### Notitas

grouping by stations issues
time series chunk into hours 
rush hour time frames


would need explanation for why to divide by 4
interval starts depend on line & units
all stations on hour clock
take into account the dates
focus on most popular stations