# Project 1: Exploratory Data Analysis

## MTA Turnstile Dataset

### Chris Doenlen, Vanessa Hu, Jay Park, Matt Ranalletta

#### Sources & Reference
- [MTA Turnstile Data](http://web.mta.info/developers/turnstile.html)
- [MTA Turnstile Data - Codebook](http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt)
- [MTA NYC Subway Map](http://web.mta.info/maps/Large_Print_Map.pdf)
- [Kaggle: MTA Turnstile Data Analysis](https://www.kaggle.com/nieyuqi/mta-turnstile-data-analysis)

### Data Compilation and Cleaning

1. Retrieve 10 weeks of MTA Turnstile data (July 20, 2019 through Sept 21, 2019) and compile into a single dataframe
2. Clean raw data and perform basic manipulations and calculations
3. Export final dataset to csv to be used in subsequent notebooks

#### Raw Data Import and Treatment

In [None]:
import numpy as np
import pandas as pd

In [None]:
# MTA Files to read: 10 Week Range 2019-07-20 through 2019-09-21

datafiles = ['http://web.mta.info/developers/data/nyct/turnstile/turnstile_190720.txt',
            'http://web.mta.info/developers/data/nyct/turnstile/turnstile_190727.txt',
            'http://web.mta.info/developers/data/nyct/turnstile/turnstile_190803.txt',
            'http://web.mta.info/developers/data/nyct/turnstile/turnstile_190810.txt',
            'http://web.mta.info/developers/data/nyct/turnstile/turnstile_190817.txt',
            'http://web.mta.info/developers/data/nyct/turnstile/turnstile_190824.txt',
            'http://web.mta.info/developers/data/nyct/turnstile/turnstile_190831.txt',
            'http://web.mta.info/developers/data/nyct/turnstile/turnstile_190907.txt',
            'http://web.mta.info/developers/data/nyct/turnstile/turnstile_190914.txt',
            'http://web.mta.info/developers/data/nyct/turnstile/turnstile_190921.txt']

In [None]:
file_list = []

for datafile in datafiles:
    df = pd.read_csv(datafile)
    file_list.append(df)
    
mta_raw = pd.concat(file_list)   

In [None]:
mta_raw.head()

In [None]:
# Rename columns 

mta_raw.columns = ['ca',
                   'unit',
                  'scp',
                  'station',
                  'linename',
                  'division',
                  'date',
                  'time',
                  'desc',
                  'entries_cum',
                  'exits_cum']

In [None]:
# Create a timestamp column with datetime object
# Convert date data to datetime object

mta_raw['timestamp'] = pd.to_datetime(mta_raw['date'] + ' ' + mta_raw['time'])
mta_raw['date'] = pd.to_datetime(mta_raw['date'])

In [None]:
# Create turnstile column as proxy for unique identifier
mta_raw['turnstile'] = mta_raw['station'] + '-' + mta_raw['ca'] + '-' + mta_raw['unit'] + '-' + mta_raw['scp']

#### Calculating actual entries and exits from cumulative figures

In [None]:
mta_sorted = mta_raw.sort_values(['turnstile', 'timestamp'])
mta_sorted = mta_sorted.reset_index(drop = True)

turnstile_grouped = mta_sorted.groupby(['turnstile'])

mta_sorted['entries'] = turnstile_grouped['entries_cum'].transform(pd.Series.diff)
mta_sorted['exits'] = turnstile_grouped['exits_cum'].transform(pd.Series.diff)

In [None]:
mta_sorted.head()

In [None]:
# Delete mta_raw
del mta_raw

#### Dealing with outliers and messy data

Three types of messy data: 
* **Negative values**: some turnstiles count down, resulting in negative values. Because the dataset is so large, we can afford to convert negative values to NaN and drop them. 
* **Very large values**: some turnstile counts are unbelievably large. We'll set a threshold of 10,000 entries or exits per turnstile per time period (this translates to ~40 entries/exits per minute, which is feasible). Any values above this threshold will be converted to NaN and dropped.
* **Not a number (NaN)**: the majority of these NaN values is because they were the start of our timeperiod for the turnstile and thus had no prior time period to calculate the actual values from the cumulative figures. 

All three cases will be converted to NaN (if not already NaN) and dropped from the dataset. 

In [None]:
print('Number of negative entries: %d' %len(mta_sorted['entries'][mta_sorted['entries'] < 0]))
print('Number of negative exits: %d' %len(mta_sorted['exits'][mta_sorted['exits'] < 0]))
print('')
print('Number of entries > 10k: %d' %len(mta_sorted['entries'][mta_sorted['entries'] > 10000]))
print('Number of exits > 10k: %d' %len(mta_sorted['exits'][mta_sorted['exits'] > 10000]))
print('')
print('Number of NaN rows: %d' %len(mta_sorted[mta_sorted['entries'].isnull()]))

In [None]:
# Filtering for negative and above threshold values

ents_neg = mta_sorted.loc[:, 'entries'] < 0
exits_neg = mta_sorted.loc[:, 'exits'] < 0

ents_10k = mta_sorted.loc[:, 'entries'] > 10000
exits_10k = mta_sorted.loc[:, 'exits'] > 10000

In [None]:
# Converting negative and above threshold entries to Nan

mta_sorted.loc[ents_neg, 'entries'] = np.nan
mta_sorted.loc[exits_neg, 'exits'] = np.nan

mta_sorted.loc[ents_10k, 'entries'] = np.nan
mta_sorted.loc[exits_10k, 'exits'] = np.nan

In [None]:
print('Number of negative entries: %d' %len(mta_sorted['entries'][mta_sorted['entries'] < 0]))
print('Number of negative exits: %d' %len(mta_sorted['exits'][mta_sorted['exits'] < 0]))
print('')
print('Number of entries > 10k: %d' %len(mta_sorted['entries'][mta_sorted['entries'] > 10000]))
print('Number of exits > 10k: %d' %len(mta_sorted['exits'][mta_sorted['exits'] > 10000]))
print('')
print('Number of NaN rows: %d' %len(mta_sorted[mta_sorted['entries'].isnull()]))

In [None]:
# Dropping na values

mta_sorted.dropna(inplace=True)

In [None]:
print('Number of NaN rows: %d' %len(mta_sorted[mta_sorted['entries'].isnull()]))

#### Calculating total activity per turnstile

In [None]:
mta_sorted['total'] = mta_sorted['entries'] + mta_sorted['exits']

In [None]:
mta_sorted.head()

#### Creating clean, organized dataframe to be exported as csv and used for analysis

In [None]:
mta = mta_sorted[['station', 
                'turnstile',
                 'ca',
                 'unit',
                 'scp',
                 'linename',
                 'division',
                 'date',
                 'time',
                 'desc',
                 'timestamp',
                 'entries',
                 'exits',
                 'total']]

In [None]:
mta.head()

In [None]:
del mta_sorted

### Export clean dataframe to CSV for later use

*NOTE: Saved as zip file for GitHub limits

In [None]:
mta.to_csv('mta_clean.zip', index=False)