## CRQ2

Visualize Taxis movements! NYC is divided in many Taxis zones. For each yellow cab trip we know the zone the Taxi pick up and drop off the users. Let's visualize, on a chropleth map, the number of trips that starts in each zone. Than, do another map to count the races that end up in the single zone. Comment your discoveries. To perform this task we use the library folium. You find some examples of chorophlet maps here and here. The Geojson we use to trace the zones is taxi_zones.json in the Homework's repository.

In [184]:
''' imports '''
import pandas as pd
import numpy as np
from loader import Loader
import folium as folium
from IPython.display import HTML


''' data paths '''
data = {
    'jan': {
        'path': 'data/yellow_tripdata_2018-01.csv',
        'start': '2018-01-01',
        'end': '2018-01-31'
    },
    'feb': {
        'path': 'data/yellow_tripdata_2018-02.csv',
        'start': '2018-02-01',
        'end': '2018-02-28'
    },
    'mar': {
        'path': 'data/yellow_tripdata_2018-03.csv',
        'start': '2018-03-01',
        'end': '2018-03-31'
    },
    'apr': {
        'path': 'data/yellow_tripdata_2018-04.csv',
        'start': '2018-04-01',
        'end': '2018-04-30'
    },
    'may': {
        'path': 'data/yellow_tripdata_2018-05.csv',
        'start': '2018-05-01',
        'end': '2018-05-31'
    },
    'jun': {
        'path': 'data/yellow_tripdata_2018-06.csv',
        'start': '2018-06-01',
        'end': '2018-06-30'
    }
}
locations = 'data/taxi_zone_lookup.csv'
zones = 'data/taxi_zones.json'

# map prefs
coords = [40.7142700,-74.0059700]
zoom = 8

# Months to work on
MONTHS = [(m, data[m]['path']) for m in data.keys()]

This code below could seem messy but it is necessary (could be improved for sure, though) to gather data from chunks.

The basic flow could be understood from comments below. The gathered data will be organized per each month, boroughs and days.

In [186]:
# read data for each month
loader = Loader(
    csv=MONTHS,
    chunksize=100000
)

# get data from iterator
data_iterator = loader.iterate(
    usecols=['tpep_pickup_datetime', 'PULocationID', 'DOLocationID'],
    parse_dates=['tpep_pickup_datetime'],
    date_index='tpep_pickup_datetime'
)

In [187]:
''' processing chunk of data '''
# declaring two counters to enhance verbosity
tot_rows = 0
processed_rows = 0

# count will be stored here
# and incremented chunk by chunk
d_pu_bkp = None
d_do_bkp = None

# iterate over chunks
for month, d in data_iterator:
    
    # info
    tot_rows += len(d.index)
    
    # strictly related to the considered years
    d = d.loc['2017' : '2019']
    
    # drop any row with missing values
    d = d.dropna()
    
    # group by PULocationID and DOLocationID
    d_pu = d.groupby(['PULocationID']).PULocationID.agg('count').to_frame('count')
    d_do = d.groupby(['DOLocationID']).DOLocationID.agg('count').to_frame('count')
    
    # info
    processed_rows += len(d.index)
    
    # concat and save data
    d_pu_bkp = pd.concat([d_pu, d_pu_bkp]) if d_pu_bkp is not None else d_pu
    d_do_bkp = pd.concat([d_do, d_do_bkp]) if d_do_bkp is not None else d_do
    
    # re-group and sum() to keep less stuff in memory
    d_pu_bkp = d_pu_bkp.groupby(['PULocationID']).sum()
    d_do_bkp = d_do_bkp.groupby(['DOLocationID']).sum()

# be verbose
print(str(processed_rows) + ' over ' + str(tot_rows) + ' rows have been processed')

53925242 over 53925735 rows have been processed


### Data on maps

In [188]:
# create map container
# the following are adapted from http://comet.lehman.cuny.edu/owen/teaching/datasci/choroplethLab.html
pu_locations = folium.Map(location=coords, zoom_start=zoom)
do_locations = folium.Map(location=coords, zoom_start=zoom)

In [189]:
pu_locations.choropleth(
    geo_data=zones,
    key_on='feature.properties.LocationID',
    legend_name = 'Trip no.',
    fill_color='YlGn', fill_opacity=0.8, line_opacity=0.5,
    data = d_pu_bkp['count'],
    columns = [d_pu_bkp.index, 'count']
)

# produce .html file
folium.LayerControl().add_to(pu_locations)
pu_locations.save('pu_locations.html')

In [190]:
do_locations.choropleth(
    geo_data=zones,
    key_on='feature.properties.LocationID',
    legend_name = 'Trip no.',
    fill_color='YlGn', fill_opacity=0.8, line_opacity=0.5,
    data = d_do_bkp['count'],
    columns = [d_do_bkp.index, 'count']
)

# produce .html file
folium.LayerControl().add_to(do_locations)
do_locations.save('do_locations.html')

**Maps are available in HTML format in the repository**

It is evident that Manhattan zones are the most commonly pick up/drop off preferred by people. Specifically, the island in front of Manhattan is probably a trendy zone and the hottest. Speaking about drop off zones, the most considered from passengers are the ones at the center of Manhattan and the Great Kills Park in Staten Island where people could spend great time. 
On the other hand, the hottest pick up zones are (obviously) the JFK airport, Manhattan center and its island and, strangely, a precise residential zones of Staten Island (see the map).

For what that concerns Manhattan, it is clear enough that people use taxis a lot since that borough is the pulsing heart of NYC that offers amusement, nightlife and top-notch jobs. Probably, people from more suburban zones own private vehicles to move.