# Import the dataset in CSV format and export to Parquet format

## Setup

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("Convertor").getOrCreate()

## Read CSV file, infer the schema

In [3]:
df = spark.read.csv('./data/green_tripdata_2013-09.csv',inferSchema=True,header=True)

In [4]:
df.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- lpep_pickup_datetime: timestamp (nullable = true)
 |-- Lpep_dropoff_datetime: timestamp (nullable = true)
 |-- Store_and_fwd_flag: string (nullable = true)
 |-- RateCodeID: integer (nullable = true)
 |-- Pickup_longitude: double (nullable = true)
 |-- Pickup_latitude: double (nullable = true)
 |-- Dropoff_longitude: double (nullable = true)
 |-- Dropoff_latitude: double (nullable = true)
 |-- Passenger_count: integer (nullable = true)
 |-- Trip_distance: double (nullable = true)
 |-- Fare_amount: double (nullable = true)
 |-- Extra: double (nullable = true)
 |-- MTA_tax: double (nullable = true)
 |-- Tip_amount: double (nullable = true)
 |-- Tolls_amount: double (nullable = true)
 |-- Ehail_fee: string (nullable = true)
 |-- Total_amount: double (nullable = true)
 |-- Payment_type: integer (nullable = true)
 |-- Trip_type: integer (nullable = true)



The data types look good except the `Ehail_fee`. It stands among other "amount" fields but have a string type.

In [5]:
# Describe the dataset
df.describe().show()

+-------+-------------------+------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+------------------+---------+------------------+------------------+---------+
|summary|           VendorID|Store_and_fwd_flag|        RateCodeID|  Pickup_longitude|   Pickup_latitude| Dropoff_longitude| Dropoff_latitude|   Passenger_count|     Trip_distance|      Fare_amount|              Extra|            MTA_tax|        Tip_amount|      Tolls_amount|Ehail_fee|      Total_amount|      Payment_type|Trip_type|
+-------+-------------------+------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+-----------------+-------------------+-------------------+------------------+------------------+---------+------------------+------------------+------

In [6]:
df.filter('Ehail_fee is not null').show()

+--------+--------------------+---------------------+------------------+----------+----------------+---------------+-----------------+----------------+---------------+-------------+-----------+-----+-------+----------+------------+---------+------------+------------+---------+
|VendorID|lpep_pickup_datetime|Lpep_dropoff_datetime|Store_and_fwd_flag|RateCodeID|Pickup_longitude|Pickup_latitude|Dropoff_longitude|Dropoff_latitude|Passenger_count|Trip_distance|Fare_amount|Extra|MTA_tax|Tip_amount|Tolls_amount|Ehail_fee|Total_amount|Payment_type|Trip_type|
+--------+--------------------+---------------------+------------------+----------+----------------+---------------+-----------------+----------------+---------------+-------------+-----------+-----+-------+----------+------------+---------+------------+------------+---------+
+--------+--------------------+---------------------+------------------+----------+----------------+---------------+-----------------+----------------+---------------

The field `Ehail_fee` is completely empty. I think we should keep it and cast its type to `double` like other "amount" columns for future proof.

In [7]:
df = df.withColumn("Ehail_fee", df.Ehail_fee.cast("double"))

Print the schema again.

In [8]:
df.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- lpep_pickup_datetime: timestamp (nullable = true)
 |-- Lpep_dropoff_datetime: timestamp (nullable = true)
 |-- Store_and_fwd_flag: string (nullable = true)
 |-- RateCodeID: integer (nullable = true)
 |-- Pickup_longitude: double (nullable = true)
 |-- Pickup_latitude: double (nullable = true)
 |-- Dropoff_longitude: double (nullable = true)
 |-- Dropoff_latitude: double (nullable = true)
 |-- Passenger_count: integer (nullable = true)
 |-- Trip_distance: double (nullable = true)
 |-- Fare_amount: double (nullable = true)
 |-- Extra: double (nullable = true)
 |-- MTA_tax: double (nullable = true)
 |-- Tip_amount: double (nullable = true)
 |-- Tolls_amount: double (nullable = true)
 |-- Ehail_fee: double (nullable = true)
 |-- Total_amount: double (nullable = true)
 |-- Payment_type: integer (nullable = true)
 |-- Trip_type: integer (nullable = true)



Every fields look good now.

## Export to Parquet

In [9]:
df.write.mode("overwrite").parquet("output/tripdata.parquet")