# New York taxis trips

This homework is about New York taxi trips. Here is something from [Todd Schneider](https://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/):

> The New York City Taxi & Limousine Commission has released a  detailed historical dataset covering over 1 billion individual taxi trips in the city from January 2009 through December 2019. 
Taken as a whole, the detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates: it's a story of a City. 
How bad is the rush hour traffic from Midtown to JFK? 
Where does the Bridge and Tunnel crowd hang out on Saturday nights?
What time do investment bankers get to work? How has Uber changed the landscape for taxis?
The dataset addresses all of these questions and many more.

The NY taxi trips dataset has been plowed by series of distinguished data scientists.
The dataset is available from on Amazon S3 (Amazon's cloud storage service).
The link for each file has the following form:

    https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_{year}-{month}.csv

There is one CSV file for each NY taxi service (`yellow`, `green`, `fhv`) and each calendar month (replacing `{year}` and `{month}` by the desired ones).
Each file is moderately large, a few gigabytes. 
The full dataset is relatively large if it has to be handled on a laptop (several hundred gigabytes).

You will focus on the `yellow` taxi service and a pair of months, from year 2015 and from year 2018. 
Between those two years, for hire vehicles services have taken off and carved a huge marketshare.

Whatever the framework you use, `CSV` files prove hard to handle. 
After downloading the appropriate files (this takes time, but this is routine), a first step will consist in converting the csv files into a more Spark friendly format such as `parquet`.

Saving into one of those formats require decisions about bucketing, partitioning and so on. Such decisions influence performance. It is your call.
Many people have been working on this dataset, to cite but a few:


- [1 billion trips with a vengeance](https://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/)
- [1 billion trips with R and SQL ](http://freerangestats.info/blog/2019/12/22/nyc-taxis-sql)
- [1 billion trips with redshift](https://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html)
- [nyc-taxi](https://github.com/fmaletski/nyc-taxi-map)

In [None]:
!pip install geojson geopandas plotly geopy ipyleaflet

In [None]:
!pip install pyshp

In [1]:
# import the usual suspects
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from pathlib import Path
import sys
import timeit
import shapefile
import urllib.request
import zipfile
import random
import itertools
import math
import geopandas as gpd
import seaborn as sns
from pathlib import Path
# spark
from pyspark import SparkConf, SparkContext
from pyspark import StorageLevel
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.functions import col
import pyspark.sql.functions as fn
from pyspark.sql.catalog import Catalog
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import IntegerType, StringType
from pyspark.sql.types import *
from pyspark.sql.functions import isnan, when, count, col

In [2]:
conf = SparkConf().setAppName("Spark SQL Illustrations")
conf.set("spark.driver.memory", "8g")
sc = SparkContext(conf=conf)

spark = (SparkSession
    .builder
    .appName("Spark SQL")
    .getOrCreate()
)

# Loading data as parquet files
## Try to read the CSV file without imposing a schema. 

In [3]:
df_data_2015_12 = spark.read\
             .format('csv')\
             .option("header", "true")\
             .option("sep", ",")\
             .load('data_2015/yellow_tripdata_2015-12.csv')

## Inspect the inferred schema. Do you agree with Spark's typing decision?
Answer: No, every field are string typed but they should be 'int','float','date'

In [17]:
df_data_2015_12.rdd.getNumPartitions()

14

In [13]:
df_data_2015_12.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- tpep_pickup_datetime: string (nullable = true)
 |-- tpep_dropoff_datetime: string (nullable = true)
 |-- passenger_count: string (nullable = true)
 |-- trip_distance: string (nullable = true)
 |-- pickup_longitude: string (nullable = true)
 |-- pickup_latitude: string (nullable = true)
 |-- RatecodeID: string (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- dropoff_longitude: string (nullable = true)
 |-- dropoff_latitude: string (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: string (nullable = true)
 |-- extra: string (nullable = true)
 |-- mta_tax: string (nullable = true)
 |-- tip_amount: string (nullable = true)
 |-- tolls_amount: string (nullable = true)
 |-- improvement_surcharge: string (nullable = true)
 |-- total_amount: string (nullable = true)



## Eventually correct the schema and read again the data

In [18]:
def type_transformer(df):
    df = df.withColumn('tpep_pickup_datetime', df['tpep_pickup_datetime'].cast(TimestampType()))\
    .withColumn('tpep_dropoff_datetime', df['tpep_dropoff_datetime'].cast(TimestampType()))\
    .withColumn('passenger_count', df['passenger_count'].cast(IntegerType()))\
    .withColumn('trip_distance', df['trip_distance'].cast(FloatType()))\
    .withColumn('pickup_longitude', df['pickup_longitude'].cast(FloatType()))\
    .withColumn('pickup_latitude', df['pickup_latitude'].cast(FloatType()))\
    .withColumn('RateCodeID', df['RateCodeID'].cast(IntegerType()))\
    .withColumn('dropoff_longitude', df['dropoff_longitude'].cast(FloatType()))\
    .withColumn('dropoff_latitude', df['dropoff_latitude'].cast(FloatType()))\
    .withColumn('payment_type', df['payment_type'].cast(IntegerType()))\
    .withColumn('fare_amount', df['fare_amount'].cast(FloatType()))\
    .withColumn('extra', df['extra'].cast(FloatType()))\
    .withColumn('mta_tax', df['mta_tax'].cast(FloatType()))\
    .withColumn('tip_amount', df['tip_amount'].cast(FloatType()))\
    .withColumn('tolls_amount', df['tolls_amount'].cast(FloatType()))\
    .withColumn('improvement_surcharge', df['improvement_surcharge'].cast(FloatType()))\
    .withColumn('total_amount', df['total_amount'].cast(FloatType()))
    return df

In [21]:
transformers = [
    type_transformer,
]

for transformer in transformers:
    df_data_2015_12 = transformer(df_data_2015_12)

    df_data_2015_12.printSchema()

root
 |-- VendorID: string (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: float (nullable = true)
 |-- pickup_longitude: float (nullable = true)
 |-- pickup_latitude: float (nullable = true)
 |-- RateCodeID: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- dropoff_longitude: float (nullable = true)
 |-- dropoff_latitude: float (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: float (nullable = true)
 |-- extra: float (nullable = true)
 |-- mta_tax: float (nullable = true)
 |-- tip_amount: float (nullable = true)
 |-- tolls_amount: float (nullable = true)
 |-- improvement_surcharge: float (nullable = true)
 |-- total_amount: float (nullable = true)



## Save the data into parquet files

In [22]:
df_data_2015_12.write.parquet('./input-parquet')

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving


Py4JError: An error occurred while calling o339.parquet