# NYC MTA

This notebook will explore patterns in traffic for the NYC MTA system. Notably these findings will be used to determine measures for spreading out peak traffic time. Visualizations created are to aid the viewer in understanding what lulls may be opportune times for taking on some of peak transit traffic.

In [1]:
import pandas as pd
import numpy as np

## Getting Data

Getting data online will have to be performed first if not already saved on desktop. For a one-time use online data is a handy retrieval method. But for multiple notebook sessions, it's most seamless to save the data locally after retrieval. Accessing local data can then be the only method of importing.

In [2]:
# Source: http://web.mta.info/developers/turnstile.html

def get_data_online(week_nums):
    url = "http://web.mta.info/developers/data/nyct/turnstile/turnstile_{}.txt"
    dfs = []
    for week_num in week_nums:
        file_url = url.format(week_num)
        dfs.append(pd.read_csv(file_url))
    return pd.concat(dfs)

def get_csv_data(ourCSV):
    df = pd.read_csv(ourCSV)
    del df["Unnamed: 0"]
    return df

The week numbers for retrieving data online follow the format year, month, day, with each week number being on a saturday.

Time Frames (3 wks each) of interest for the project:
1. Early to Mid December 2019
2. Mid to Late March 2020
3. Mid to Late June 2020

The code for getting data from online is as follows.

`week_nums = [200613, 200620, 200627]
june_turnstiles = get_data_online(week_nums)`

`week_nums = [200328, 200321, 200314]
march_turnstiles = get_data_online(week_nums)`

`week_nums = [191207, 191214, 191221]
december_turnstiles = get_data_online(week_nums)`

Place holder 

`week_nums = [200613, 200620, 200627]
june_turnstilesv1 = get_data_online(week_nums)`

`week_nums = [200328, 200321, 200314]
march_turnstilesv1 = get_data_online(week_nums)`

`week_nums = [191207, 191214, 191221]
december_turnstilesv1 = get_data_online(week_nums)`

Once retrieved, this data can push it into .csvs

`december_turnstiles.to_csv('december_turnstiles.csv')
march_turnstiles.to_csv('march_turnstiles.csv')
june_turnstiles.to_csv('june_turnstiles.csv')
`

Place holder

`december_turnstilesv1.to_csv('v1december_turnstiles.csv')
march_turnstilesv1.to_csv('v1march_turnstiles.csv')
june_turnstilesv1.to_csv('v1june_turnstiles.csv')
`

In [3]:
december_turnstiles = get_csv_data('december_turnstiles.csv')
march_turnstiles = get_csv_data('march_turnstiles.csv')
june_turnstiles = get_csv_data('june_turnstiles.csv')

## Cleaning Up 

 The data's timestamp will need to be combined into one column.
 Column names should also be stripped of any white space.

##### Initial Column Adjustment

In [4]:
def betterColumns(dataframe):
    dataframe['TIMESTAMP'] = pd.to_datetime(dataframe['DATE'] + ' ' + dataframe['TIME'])
    dataframe.columns = dataframe.columns.str.rstrip()
    dataframe['FOOTTRAFFIC'] = dataframe['ENTRIES'] + dataframe['EXITS']
    return dataframe

def trafficFix(dataframe):
    dataframe['FOOTTRAFFIC'] = dataframe.groupby('TURNSTILE')['FOOTTRAFFIC'].diff().fillna(method='backfill')
    dataframe['FOOTTRAFFIC'] = dataframe['FOOTTRAFFIC'].astype(int)
    return dataframe

```june_turnstiles = betterColumns(june_turnstiles)
march_turnstiles = betterColumns(march_turnstiles)
december_turnstiles = betterColumns(december_turnstiles)
```

##### Turnstiles' Column

Data in available C/A, UNIT, SCP, & Station are used to discern unique turnstiles for creation of a turnstile column.

In [5]:
def turnstileColumn(dataframe):
    '''
    Takes an MTA dataframe & creates a new column
    Where there's an index number for each Turn Stile in frame.
    '''
    eachTS = dataframe.groupby(["C/A", "UNIT", "SCP", "STATION"])[['ENTRIES']].sum()
    howMany = eachTS.shape[0]
    eachTS['TURNSTILE'] = range(1,(howMany + 1))
    del eachTS['ENTRIES']
    dataframe = pd.merge(dataframe, eachTS,  how='left',
                         left_on=['C/A','UNIT','SCP', 'STATION'],
                         right_on = ['C/A','UNIT','SCP', 'STATION'])    
    return dataframe

```june_turnstiles = turnstileColumn(june_turnstiles)
march_turnstiles = turnstileColumn(march_turnstiles)
december_turnstiles = turnstileColumn(december_turnstiles)
```

#### Updating Foot Traffic Column

```june_turnstiles = trafficFix(june_turnstiles)
march_turnstiles = trafficFix(march_turnstiles)
december_turnstiles = trafficFix(december_turnstiles)
```

In [15]:
march_turnstiles.
# december_turnstiles.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,TIMESTAMP,FOOTTRAFFIC,TURNSTILE
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/21/2020,00:00:00,REGULAR,7411940,2515962,2020-03-21 00:00:00,6,1
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/21/2020,04:00:00,REGULAR,7411942,2515966,2020-03-21 04:00:00,6,1
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/21/2020,08:00:00,REGULAR,7411945,2515979,2020-03-21 08:00:00,16,1
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/21/2020,12:00:00,REGULAR,7411969,2516000,2020-03-21 12:00:00,45,1
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,03/21/2020,16:00:00,REGULAR,7412028,2516024,2020-03-21 16:00:00,83,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
616467,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,03/13/2020,05:00:00,REGULAR,5554,507,2020-03-13 05:00:00,0,4935
616468,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,03/13/2020,09:00:00,REGULAR,5554,507,2020-03-13 09:00:00,0,4935
616469,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,03/13/2020,13:00:00,REGULAR,5554,507,2020-03-13 13:00:00,0,4935
616470,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,03/13/2020,17:00:00,REGULAR,5554,507,2020-03-13 17:00:00,0,4935


#### No more negatives
7/3 note- hasn't been written to CSV yet. needs to be used

In [7]:
def noNegatives(dataframe):
    dataframe[dataframe['FOOTTRAFFIC'] < 0] = np.nan
    grouped = dataframe.groupby('TURNSTILE')['FOOTTRAFFIC'].mean()
    dataframe = pd.merge(dataframe, grouped,  how='left',
                         left_on=['TURNSTILE'],
                         right_on = ['TURNSTILE'])
    dataframe['FOOTTRAFFIC'] = dataframe.FOOTTRAFFIC_x.fillna(dataframe.FOOTTRAFFIC_y)
    del dataframe['FOOTTRAFFIC_x']
    dataframe = dataframe.rename(columns = {'FOOTTRAFFIC_y':'MEANTRAFFIC'})
    return dataframe

```june_turnstiles = noNegatives(june_turnstiles)
march_turnstiles = noNegatives(march_turnstiles)
december_turnstiles = noNegatives(december_turnstiles)
```

Let's be sure !

In [8]:
no_good = march_turnstiles[march_turnstiles['FOOTTRAFFIC'] < 0]
print(no_good.shape)
no_good = june_turnstiles[june_turnstiles['FOOTTRAFFIC'] < 0]
print(no_good.shape)
no_good = december_turnstiles[december_turnstiles['FOOTTRAFFIC'] < 0]
print(no_good.shape)

(14499, 14)
(13963, 14)
(5752, 14)


#### Re-writing to CSVS
Now that our data is in good condition it's helpful to save again these updated frames.

`
december_turnstiles.to_csv('december_turnstiles.csv')
march_turnstiles.to_csv('march_turnstiles.csv')
june_turnstiles.to_csv('june_turnstiles.csv')
`

### General Data Infos

In [9]:
print(june_turnstiles.shape, '\n')
print('Columns:\n' , june_turnstiles.columns)
june_turnstiles.head(2)

(620072, 14) 

Columns:
 Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES', 'EXITS', 'TIMESTAMP', 'FOOTTRAFFIC', 'TURNSTILE'],
      dtype='object')


Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,TIMESTAMP,FOOTTRAFFIC,TURNSTILE
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/20/2020,00:00:00,REGULAR,7424218,2522558,2020-06-20 00:00:00,3,1
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/20/2020,04:00:00,REGULAR,7424220,2522559,2020-06-20 04:00:00,3,1


#### Finding Daily Foottraffic Numbers

In [10]:
june_turnstiles.head(2)

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS,TIMESTAMP,FOOTTRAFFIC,TURNSTILE
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/20/2020,00:00:00,REGULAR,7424218,2522558,2020-06-20 00:00:00,3,1
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/20/2020,04:00:00,REGULAR,7424220,2522559,2020-06-20 04:00:00,3,1


as 0's

In [11]:
daily_counts_df = june_turnstiles.groupby(["TURNSTILE","DATE"],as_index=False)["FOOTTRAFFIC"].sum()
daily_counts_df.head(2)

Unnamed: 0,TURNSTILE,DATE,FOOTTRAFFIC
0,1,06/06/2020,-4539
1,1,06/07/2020,124


#### Checking for Top Stations

Group by station,

sum of foottraffic

order this descending!

Can do a slice for top 5 rows

## Plotting Values
- average for each hour ?
- what stations are used at what time ?


In [12]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline 

In [13]:
# Create turnstile unique ID
turnstile = "1"

dates = daily_counts_df[daily_counts_df["TURNSTILE"] == turnstile]["DATE"]
counts = daily_counts_df[daily_counts_df["TURNSTILE"] == turnstile]["DAILY_ENTRIES"]

fig, ax = plt.subplots(figsize=(10,3))
ax.plot(dates, counts)
ax.set_title(f"Turnstile ({turnstile}) Daily Entries", weight="bold")
ax.xaxis.set_major_formatter(mdates.DateFormatter("%m/%d"))
ax.xaxis.set_major_locator(mdates.DayLocator(interval = 3))
ax.set_ylabel("Daily Entries", labelpad=8);

  res_values = method(rvalues)


KeyError: 'DAILY_ENTRIES'

In [None]:
june_turnstiles['']

In [None]:
plt.plot(data_x, data_list)

### Notitas

grouping by stations issues
time series chunk into hours 
rush hour time frames


would need explanation for why to divide by 4
interval starts depend on line & units
all stations on hour clock
take into account the dates
focus on most popular stations

In [None]:
# def get_daily_counts(row, max_counter):
#     """
#     This function is used for maxFoottraffic.
#     """
#     counter = row["FOOTTRAFFIC"] - row["PREV_FOOTTRAFFIC"]
#     if row < 0:
#         # Maybe counter is reversed?
#         row =- row
        
#     if row > max_counter:
#         # Maybe counter was reset to 0? 
#         row = min(row["FOOTTRAFFIC"], row["PREV_FOOTTRAFFIC"])
        
#     if row > max_counter:
#         # Check it again to make sure we're not still giving a counter that's too big
#         return 0
#     return row

# def maxFoottraffic(dataframe):
#     """
#     This function calls get_daily_counts 
#     so that function must be entered prior.
#     The purpose of this function is check the 
#     top 10 highly trafficked stations,
#     depending on what dataframe is entered 
#     (June, July, December, etc.)
#     """
#     daily_counts_df = dataframe.groupby(["TURNSTILE","DATE"], as_index=False)["FOOTTRAFFIC"].first()

#     daily_counts_df[["PREV_DATE", "PREV_FOOTTRAFFIC"]] = (daily_counts_df
#                                                        .groupby(["TURNSTILE"])["DATE", "FOOTTRAFFIC"]
#                                                        .apply(lambda grp: grp.shift(1)))

#     daily_counts_df.dropna(subset=["PREV_DATE"], inplace=True)
    
#     daily_counts_df["DAILY_FOOTTRAFFIC"] = daily_counts_df.apply(get_daily_counts, axis=1, max_counter=100000)
  
#     station_counts_df = daily_counts_df.groupby(["STATION"], as_index=False)["DAILY_FOOTTRAFFIC"].sum()
    
#     return station_counts_df.sort_values('DAILY_FOOTTRAFFIC', ascending=False)

In [None]:
def get_daily_counts(row, max_counter):
    """
    This function is used for maxFoottraffic.
    """
    counter = row["FOOTTRAFFIC"] - row["PREV_FOOTTRAFFIC"]
    if counter < 0:
        # Maybe counter is reversed?
        counter = -counter
    if counter > max_counter:
        # Maybe counter was reset to 0? 
        counter = min(row["FOOTTRAFFIC"], row["PREV_FOOTTRAFFIC"])
    if counter > max_counter:
        # Check it again to make sure we're not still giving a counter that's too big
        return 0
    return counter

def maxFoottraffic(dataframe):
    """
    This function calls get_daily_counts so that function must be entered prior.
    The purpose of this function is check the top 10 highly trafficked stations,
    depending on what dataframe is entered (June, July, December, etc.)
    """
    daily_counts_df = dataframe.groupby(["TURNSTILE","DATE"], as_index=False)["FOOTTRAFFIC"].first()
    daily_counts_df[["PREV_DATE", "PREV_FOOTTRAFFIC"]] = (daily_counts_df
                                                       .groupby(["TURNSTILE"])["DATE", "FOOTTRAFFIC"]
                                                       .apply(lambda grp: grp.shift(1)))
    daily_counts_df.dropna(subset=["PREV_DATE"], inplace=True)
    
    daily_counts_df["DAILY_FOOTTRAFFIC"] = daily_counts_df.apply(get_daily_counts, axis=1, max_counter=100000)
  
    station_counts_df = daily_counts_df.groupby(["STATION"], as_index=False)["DAILY_FOOTTRAFFIC"].sum()
    
    return station_counts_df.sort_values('DAILY_FOOTTRAFFIC', ascending=False).head(10)


In [None]:
# station_counts_df = daily_counts_df.groupby(["STATION", "DATE"], as_index=False)["DAILY_ENTRIES"].sum()
# station_counts_df.head()

In [None]:
top5 = june_turnstiles[(june_turnstiles['STATION']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')]