# Explore the Data

In [1]:
import pandas as pd
import processTurnstiles

### List functions of the processTurnstiles module

In [3]:
help(processTurnstiles)

Help on module processTurnstiles:

NAME
    processTurnstiles - Jonathan L Chu, 2020 for Metis SF20_DS18

DESCRIPTION
    Module to download and process
    MTA turnstiles data for use in pandas

FUNCTIONS
    get_data(week_nums)
        Downloads MTA turnstiles data from site, for specified week_nums.
        Returns pandas dataframe of raw data
    
    processTurnstiles(df)
        Reads in raw MTA turnstile data as a pd dataframe
        Converts 'DATE' and 'TIME' cols to Datetime objects,
        adds 'ENTRIES_DIFF' col of count of entries (calculated 
        from 'ENTRIES' running total)
        
        Please be careful about the automatic outlier removal here
        
        Returns: pandas DataFrame
    
    readProcessedData(path)
        Reads in ALREADY PROCESSED turnstile data from 'file' and drops old index,
        then converts DATETIME and DATE columns to datetime objects

FILE
    /Users/kibbles/Documents/Metis/metisproject1/processTurnstiles.py




In [4]:
path = r'' # path to where you want the turnstiles data. Here it's in the current directory so the path is null
turnstiles_raw = r'turnstiles_june2019.txt'
path+turnstiles_raw

'turnstiles_june2019.txt'

### If we need to download or process the data, uncomment the following lines (and don't run readProcessedData() function in the next cell)

In [5]:
week_nums = [190608, 190615, 190622, 190629]
#df = processTurnstiles.getData(week_nums)
#df =  processTurnstiles.processTurnstiles(df)

### Optionally, save the data

In [8]:
turnstiles_processed = r'turnstiles_june2019_procd.txt'
# pd.to_csv(path+turnstiles_process)

In [6]:

df = processTurnstiles.readProcessedData(path+turnstiles_processed)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 779480 entries, 0 to 779479
Data columns (total 16 columns):
 #   Column                                                                Non-Null Count   Dtype         
---  ------                                                                --------------   -----         
 0   level_0                                                               779480 non-null  int64         
 1   C/A                                                                   779480 non-null  object        
 2   UNIT                                                                  779480 non-null  object        
 3   SCP                                                                   779480 non-null  object        
 4   STATION                                                               779480 non-null  object        
 5   LINENAME                                                              779480 non-null  object        
 6   DIVISION                    