# Ridership per Station

In [1]:
import sys
import string
import logging

from util import mapper_logfile

logging.basicConfig(filename=mapper_logfile, format='%(message)s', level=logging.INFO, filemode='w')

For each line of input, the mapper output should PRINT (not return) the `UNIT` as the key, the number of `ENTRIESn_hourly` as the value, and separate the key and the value by a tab. 

For example:
```
'R002\t105105.0
```

In [3]:
def mapper():
    for line in sys.stdin:
        # Ignore the header
        if line.startswith(','):
            continue
        else:
            data = line.strip().split(',')
            unit = data[1]
            entries = data[6]
            print '{0}\t{1}'.format(unit, entries)
        
mapper()

Given the output of the mapper for this exercise, the reducer should PRINT (not return) one line per `UNIT` along with the total number of `ENTRIESn_hourly`  over the course of May (which is the duration of our data), separated by a tab.

An example output row from the reducer might look like this: 
```
R001\t500625.0
```

You can assume that the input to the reducer is sorted such that all rows corresponding to a particular `UNIT` are grouped together.

In [4]:
def reducer():
    total_entries = 0
    old_unit = None

    for line in sys.stdin:
        data = line.strip().split('\t')
        
        if len(data) != 2:
            continue
        
        this_unit, entries = data
        
        # If we found a new unit, print out the data up to here
        if old_unit and old_unit != this_unit:
            print '{0}\t{1}'.format(old_unit, total_entries)
            total_entries = 0
        
        old_unit = this_unit
        total_entries += float(entries)
        
        if old_unit != None:
            print '{0}\t{1}'.format(old_unit, total_entries)
            
reducer()

# Ridership by Weather Type

For this exercise, compute the average value of the `ENTRIESn_hourly` column  for different weather types. Weather type will be defined based on the combination of the columns fog and rain (which are boolean values).

For example, one output of our reducer would be the average hourly entries across all hours when it was raining but not foggy.

Each line of input will be a row from our final Subway-MTA dataset in csv format.

In [8]:
def mapper():

    def format_key(fog, rain):
        return '{}fog-{}rain'.format(
            '' if fog else 'no',
            '' if rain else 'no'
        )
    
    for line in sys.stdin:
        data = line.strip().split(',')
        if len(data) != 22 or data[1] == "UNIT":
            continue

        # Must convert to an int here so the format_key 
        # function will work. This cost me a lot of time.
        fog = int(float(data[14])) 
        rain = int(float(data[15]))

        entries = float(data[6])
        key = format_key(fog, rain)
        
        print '{0}\t{1}'.format(key, entries)    

mapper()

In [9]:
def reducer():
    riders = 0      # The number of total riders for this key
    num_hours = 0   # The number of hours with this key
    old_key = None

    for line in sys.stdin:
        data = line.strip().split('\t')
        
        if len(data) != 2:
            continue
        
        this_key, entries = data
        
        # If we found a new key, print out the data up to here
        if old_key and old_key != this_key:
            logging.info("Old Key: %s, This Key: %s" % (old_key, this_key))
            avg_riders = riders / num_hours
            print '{0}\t{1}'.format(old_key, avg_riders)
            riders = 0
            num_hours = 0
        
        old_key = this_key
        riders += float(entries)
        num_hours += 1.0
        
        if old_key != None:
            avg_riders = riders / num_hours
            print '{0}\t{1}'.format(old_key, avg_riders)
        
reducer()

# Busiest Hour

In this exercise, for each turnstile unit, you will determine the date and time (in the span of this data set) at which the most people entered through the unit.
    
For each line, the mapper should return the `UNIT`, `ENTRIESn_hourly`, `DATEn`, and `TIMEn` columns, separated by tabs. 

For example:
```
R001\t100000.0\t2011-05-01\t01:00:00
```

In [10]:
import sys
import string
import logging

from util import mapper_logfile

logging.basicConfig(filename=mapper_logfile, format='%(message)s', level=logging.INFO, filemode='w')

In [11]:
def mapper():
    for line in sys.stdin:
        data = line.strip().split(',')

        if len(data) != 22 or data[1] == "UNIT":
            continue

        unit = data[1]
        date = data[2]
        time = data[3]
        entries = float(data[6])
        print '{0}\t{1}\t{2}\t{3}'.format(unit, entries, date, time) 

mapper()

Write a reducer that will compute the busiest date and time (that is, the date and time with the most entries) for each turnstile unit. Ties should  be broken in favor of datetimes that are later on in the month of May. You may assume that the contents of the reducer will be sorted so that all entries corresponding to a given UNIT will be grouped together.
    
The reducer should print its output with the `UNIT` name, the datetime (which is the `DATEn` followed by the `TIMEn` column, separated by a single space), and  the number of entries at this datetime, separated by tabs.

For example, the output of the reducer should look like this:
```
R001    2011-05-11 17:00:00	   31213.0
R002	2011-05-12 21:00:00	   4295.0
R003	2011-05-05 12:00:00	   995.0
R004	2011-05-12 12:00:00	   2318.0
R005	2011-05-10 12:00:00	   2705.0
R006	2011-05-25 12:00:00	   2784.0
R007	2011-05-10 12:00:00	   1763.0
R008	2011-05-12 12:00:00	   1724.0
R009	2011-05-05 12:00:00	   1230.0
R010	2011-05-09 18:00:00	   30916.0
...
...
```

In [13]:
def reducer():
    max_entries = 0
    old_unit = None
    datetime = ''

    for line in sys.stdin:
        data = line.strip().split('\t')

        if len(data) != 4:
            continue
        
        current_unit, entries, date, time = data
        current_entries = float(entries)
        current_datetime = date + ' ' + time

        if old_unit and old_unit != current_unit:
            print "{0}\t{1}\t{2}".format(old_unit, datetime, max_entries)
            max_entries = 0

        # If this record has more entries than what we have seen yet
        if current_entries >= max_entries:
            old_unit = current_unit
            datetime = current_datetime
            max_entries = current_entries

        # If there is a tie
        if current_entries == max_entries:
            if current_datetime > datetime:
                datetime = current_datetime

        if old_unit != None:
            print "{0}\t{1}\t{2}".format(old_unit, datetime, max_entries)

reducer()