
# TLC Trip Data

<img src="https://images.unsplash.com/photo-1540644622016-296c896739f3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80" width=600>

*** 
The New York City Taxi and Limousine Commission (TLC), created in 1971, is the agency responsible for licensing and regulating New York City's Medallion (Yellow) taxi cabs, for-hire vehicles (community-based liveries, black cars and luxury limousines), commuter vans, and paratransit vehicles. The Commission's Board consists of nine members, eight of whom are unsalaried Commissioners. The salaried Chair/ Commissioner presides over regularly scheduled public commission meetings and is the head of the agency, which maintains a staff of approximately 600 TLC employees.

Over 200,000 TLC licensees complete approximately 1,000,000 trips each day. To operate for hire, drivers must first undergo a background check, have a safe driving record, and complete 24 hours of driver training. TLC-licensed vehicles are inspected for safety and emissions at TLC's Woodside Inspection Facility.

*** 
TLC Data has spawned an enormous amount coding challenges on Kaggle (including [this](https://www.kaggle.com/c/nyc-taxi-trip-duration) and [this](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction)), cool interactive visualizations like [this one](http://chriswhong.github.io/nyctaxi/), and has been known to pop on coding challenges given out by firms hiring data scientists.

Today we'll try to calculate the likelihood of ending a trip in a given area of the city given a certain starting point. With roughly 8 million trip records per month, we'll have to think carefully about how we process our data and scale our analysis.    


In [None]:
import requests
import time
import json
import pandas as pd

pd.options.mode.chained_assignment = None 

In [None]:
imUsingColab = False
if imUsingColab:
    !pip install geopandas
    alt.renderers.enable('colab')
else: 
    alt.renderers.enable('notebook')

import altair as alt 
import geopandas as gpd


#### Read in taxi zone data so we can better interpret our results

Taxi zone data available for download as geojson from [here](https://data.cityofnewyork.us/Transportation/NYC-Taxi-Zones/d3c5-ddgc), I've it posted on my site for convenience


In [None]:
zones = gpd.read_file('https://grantmlong.com/data/taxi_zones.geojson')

# Create a dictionary that will allow us to better understand the zones defined by the TLC
zone_guide = zones[['location_id', 'zone']].set_index('zone')['location_id'].to_dict()

In [None]:
zone_guide["Manhattanville"]

# Stream in our files from AWS S3

The TLC hosts uncompressed csv files on AWS S3. Links to the files can be [found here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) and include several columns described below and in [a data dictionary posted on the City's website](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf).

['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime', 'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag', 'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge', 'total_amount']

| Position | Column | Definition |
|---|---|---|
| 0 | `VendorID` | A code indicating the TPEP provider that provided the record. |
| 1 | `tpep_pickup_datetime` | The date and time when the meter was engaged. | 
| 2 | `tpep_dropoff_datetime` | The date and time when the meter was disengaged. |
| 3 | `passenger_count` | The number of passengers in the vehicle. |
| 4 | `trip_distance` | The elapsed trip distance in miles reported by the taximeter. |
| 5 | `RatecodeID` | The final rate code in effect at the end of the trip. |
| 6 | `store_and_fwd_flag` | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. |
| 7 | `PULocationID` | TLC Taxi Zone in which the taximeter was engaged. | 
| 8 | `DOLocationID` | TLC Taxi Zone in which the taximeter was disengaged. | 
| 9 | `payment_type` | A numeric code signifying how the passenger paid for the trip. |
| 10 | `fare_amount` | The time-and-distance fare calculated by the meter. | 
| 11 | `extra` | Miscellaneous extras and surcharges. Currently, this only includes the \\$0.50 and \\$1 rush hour and overnight charges. |
| 12 | `mta_tax` | \\$0\\.50 MTA tax that is automatically triggered based on the metered rate in use. |
| 13 | `improvement_surcharge` | \\$0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. |
| 14 | `tip_amount` | Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. | 
| 15 | `tolls_amount` | Total amount of all tolls paid in trip. | 
| 16 | `total_amount` | The total amount charged to passengers. Does not include cash tips. |

In [None]:
def process_TLC_file(month, 
                     pickups=None,
                     pickups_to=None,
                     base_URL='https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_',
                     max_rows=1000,
                     verbose=True
                    ):

    t0 = time.clock()
    
    data_stream = requests.get(url=base_URL + month + '.csv', stream=True)

    # Strip out headers
    headers = next(data_stream.iter_lines()).decode("utf-8").split(',')

    # Initialize row and error counters
    rows_processed = 0
    error_count = 0

    # Initialize counting dictionaries if not given 
    if not pickups:
        pickups = {}
    if not pickups_to:
        pickups_to= {}

    for line in data_stream.iter_lines():

        trip = line.decode("utf-8").split(',')

        try:
            pickup = int(trip[7])
            dropoff = int(trip[8])

            if pickup in pickups.keys():
                pickups[pickup] += 1
                if dropoff in pickups_to[pickup].keys():
                    pickups_to[pickup][dropoff] += 1
                else:
                    pickups_to[pickup][dropoff] = 1
            else:
                pickups[pickup] = 1
                pickups_to[pickup] = {}
                pickups_to[pickup][dropoff] = 1

        except:
            error_count += 1

        rows_processed += 1
        if rows_processed > max_rows:
            break
    
    if verbose:
        print('%s: %i excepted errors.' % (month, error_count))        
        print('%s: %i rows processed.' % (month, rows_processed))
        print('%s: %0.1f seconds processing time.' % (month, (time.clock()-t0)))
            
    return pickups, pickups_to

pickups, pickups_to = process_TLC_file('2018-12', max_rows=10000)

# Analyzing the Results
* What's the top spot in the city for pickups?
* What's the top destination here in Manhattanville? Other neighborhoods?
* What's the likelihood a taxi picking up here drops off on the Upper East Side?


Let's also create some additional functions to help us make sense of the data.

In [None]:
# Create function to return likelihood of 
def dropoff_likelihood(pickup, dropoff):
    n_pickups = pickups[pickup]
    n_dropoffs = pickups_to[pickup][dropoff]
    return n_dropoffs/n_pickups


# Create function to return top destination for a given pickup zone
def get_top_destination(pickup, pickups_to=pickups_to):
    top_dropoffs = sorted(pickups_to[pickup].items(), key=lambda x: x[1], reverse=True)
    return top_dropoffs[0]


In [None]:
get_top_destination(152)

In [None]:
dropoff_likelihood(152, 166)

In [None]:
pickups[152]

In [None]:
pickups_to[152][166]

# Visualize Results

This altair code derives from [this example](https://medium.com/dataexplorations/creating-choropleth-maps-in-altair-eeb7085779a1). We'll create a function that will allow us to map different trip frequencies across the city.



In [None]:
def map_values(trips, 
               zones=zones, 
               title="Number of Trips", 
               size=750, 
               drop=['Staten Island', 'EWR']):
    
   # Optional, but drop Staten Island and Newark Airport to make things cleaner
    zones = zones.loc[~zones.borough.isin(drop)]
    
    # Join values for map to frame
    zones['trips'] = zones.location_id.astype(int).map(trips)
    zones.loc[zones.trips.isnull(), 'trips'] = 0
    
    # Reformat into Altair-friendly format
    choro_json = json.loads(zones.to_json())
    choro_data = alt.Data(values=choro_json['features'])

    # Add base map layer with taxi zones
    base = alt.Chart(choro_data).mark_geoshape(
            stroke='black',
            strokeWidth=1
    ).encode(
    ).properties(
        width=size,
        height=size
    )

    # Add choropleth layer with taxi data
    choro = alt.Chart(choro_data).mark_geoshape(
        fill='lightgray',
        stroke='black'
    ).encode(
        alt.Color(
            'properties.trips:Q',
            type='quantitative', 
            scale=alt.Scale(scheme='plasma'),
            title=title),
        tooltip=[
            'properties.zone:O',
            'properties.location_id:O',
            'properties.borough:O',
            'properties.trips:Q',
        ]
    )

    return base + choro


In [None]:
map_values(pickups_to[152])

# Going Deeper: Other Factors to Explore

#### Big data allows us to better understand events and their likelihoods under a wide range of scenarios? How might we look at taxi trips across the following dimensions: 
1. Time of year
2. Day of week
3. Time of day
4. Number of passengers
5. Others?

*** 
Suppose if we owned a bar or restaurant in a certain area of the city and wanted to learn more about where people going out in our neighborhood were coming from and how that has changed over time. How could this data help us?