**MTA Importer**

This module downloads the data from the MTA website between two specified dates.
Each individual text file which lies between the inputted dates is downloaded and combined into one data frame.
It is possible at the end to save the dataframe to csv.

This is just for showing and testing the functions, the **main script is located in mta_importer.py**

Below is the mta_importer function.

In [12]:
import csv
import pandas as pd
from bs4 import *
import urllib
import re


In [4]:
#Find all the data files on MTA website
def mta_updater():
    prefix='http://web.mta.info/developers/'
    html='http://web.mta.info/developers/turnstile.html'
    webv=urllib.request.urlopen(html)
    soup=BeautifulSoup(webv,"lxml")
    tags = soup('a')
    linkslist=[]
    for tag in tags:
        h=tag.get('href',None)
        if h is not None:
            if h.startswith('data'):
                dates=re.findall('.[_]([0-9]+)',h)[0]
                linkslist.append((int(dates),prefix+h))
    return linkslist

In [5]:
links=mta_updater()
print (links[:5])

[(170624, 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_170624.txt'), (170617, 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_170617.txt'), (170610, 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_170610.txt'), (170603, 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_170603.txt'), (170527, 'http://web.mta.info/developers/data/nyct/turnstile/turnstile_170527.txt')]


>This returns the index position for the dates.

In [7]:
#Return the index of the files for the start and end dates
def mta_importer(ds,de,links):
    start=int(ds[-2:]+ds[0:2]+ds[3:5])
    end=int(de[-2:]+de[0:2]+de[3:5])
    i=0
    for date_end in links:
        if end >= date_end[0]+7:
            start_ind=i
            d_e=date_end[0]
            break
        else:
            i=i+1
    for date_start in links[start_ind:]:
        if start >= date_start[0]:
            end_ind=i
            d_s=date_start[0]
            break
        else:
            i+=1
    print ('Date Range: %s to %s' % (d_s,d_e))
    return ([start_ind,end_ind])

In [None]:
ds='06-17-17'
de='06-30-17'
sel=mta_importer(ds,de,links)
print (sel)

>This is the main function returning a dataframe of the csv files that request the dates.
>This currently has fixed dates, not to mess up on Jupyter notebook.

In [26]:
def mta_selector():
    #Define date period
    ds='06-17-17' #I have fixed the values just for usability in jupyter notebooks
    de='06-30-17' # Fixed
    #Run mta_updater which returns an updated list of links from MTA website
    links=mta_updater()
    sel=mta_importer(ds,de,links)
    df_list=[]
    clicks=0
    for url in links[sel[0]:sel[1]+1]:
        df_list.append(pd.read_csv(url[1],header=0))
        clicks+=1
        print ('%d/%d completed' % (clicks,sel[1]-sel[0]))

    #df_list=[(pd.read_csv(url[1],header=0),print() for url in links[sel[0]:sel[1]+1]]
    df=pd.concat(df_list,ignore_index=True)

    #Write to csv file
    csv_q=input('Do you want to write to csv (y/n): ')

    if csv_q[0]=='y' or csv_q[0]=='Y':
        name=input('CSV file name: ')
        df.to_csv(name, sep=',')

    return df

In [27]:
qq=mta_selector() #This returns the datafram from a specified dates

Date Range: 170617 to 170617
1/0 completed
Do you want to write to csv (y/n): n


In [24]:
qq.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/10/2017,00:00:00,REGULAR,6215258,2104297
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/10/2017,04:00:00,REGULAR,6215284,2104303
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/10/2017,08:00:00,REGULAR,6215318,2104337
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/10/2017,12:00:00,REGULAR,6215475,2104417
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,06/10/2017,16:00:00,REGULAR,6215841,2104465
