## Problem Statement:

The Taxi and Limousine Commission (TLC) of New York City collects trip record data from licensed taxis and for-hire vehicles (FHVs) and provides it to the public. The data includes details such as pick-up and drop-off times, locations, passenger counts, and payment information for each trip. As a data engineer, your task is to build a batch data processing pipeline using PySpark to process and analyze this data to gain insights into taxi and FHV trips in New York City.

### Goals:

Data ingestion: Download the trip record data from the NYC TLC website and ingest it into the pipeline for further processing.

Data cleaning and validation: Perform data quality checks and validation to ensure that the data is clean and consistent. Identify and remove duplicates, null values, and other data quality issues that may impact downstream analysis.

Data transformation: Transform the raw trip record data into a format that is optimized for analysis. This may include aggregating the data by time periods, geographical regions, and other factors of interest.

Data analysis: Use PySpark to perform statistical analysis, data exploration, and data visualization to gain insights into taxi and FHV trips in New York City. This may include identifying popular pick-up and drop-off locations, peak trip times, and other patterns and trends in the data.

Data storage: Store the processed and analyzed data in a suitable data storage system such as Hadoop Distributed File System (HDFS) or Apache Cassandra for future use.

Automation and scheduling: Automate the data processing pipeline using tools such as Apache Airflow or Apache Oozie. Schedule the pipeline to run at regular intervals to ensure that the data is up to date and accurate.

---

The overall goal of the project is to build a batch data processing pipeline using PySpark to extract insights from the NYC TLC trip record data. The pipeline should be scalable, efficient, and automated to enable easy data processing and analysis.

### Import Libraries and Intiate Spark session

In [1]:
import configparser

import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, LongType, TimestampType

In [2]:
spark = SparkSession.builder.appName("nyc_batch_pipeline").getOrCreate()

23/05/15 21:17:49 WARN Utils: Your hostname, joker021-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
23/05/15 21:17:49 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/15 21:17:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark

### Data Ingestion

In [4]:
# Parsing from config file
conf = configparser.ConfigParser()
conf.read("config")
data_source_path = conf.get("DATASOURCE PATH", "PATH")

# Reading the DataSource from PySpark
df_full = spark.read.parquet(data_source_path)

                                                                                

In [5]:
print(f"No of Partitons: {df_full.rdd.getNumPartitions()}")

No of Partitons: 4


In [6]:
frac = 0.1
df = df_full.sample(fraction=frac, seed=123)

In [7]:
# Schema
df.printSchema()

root
 |-- vendor_name: string (nullable = true)
 |-- Trip_Pickup_DateTime: string (nullable = true)
 |-- Trip_Dropoff_DateTime: string (nullable = true)
 |-- Passenger_Count: long (nullable = true)
 |-- Trip_Distance: double (nullable = true)
 |-- Start_Lon: double (nullable = true)
 |-- Start_Lat: double (nullable = true)
 |-- Rate_Code: double (nullable = true)
 |-- store_and_forward: double (nullable = true)
 |-- End_Lon: double (nullable = true)
 |-- End_Lat: double (nullable = true)
 |-- Payment_Type: string (nullable = true)
 |-- Fare_Amt: double (nullable = true)
 |-- surcharge: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- Tip_Amt: double (nullable = true)
 |-- Tolls_Amt: double (nullable = true)
 |-- Total_Amt: double (nullable = true)



### Data Cleaning And Validation

#### Missing Values

In [8]:
no_of_row = df.count()
print(f"No of Rows: {no_of_row}")
print(f"No of cols: {len(df.columns)}")

No of Rows: 1409962
No of cols: 18


In [9]:
null_count = df.\
select(
    [F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in df.columns]
).collect()[0]\
.asDict()
null_col_list = [c for c in null_count if null_count[c] > 0]

                                                                                

In [10]:
null_count

{'vendor_name': 0,
 'Trip_Pickup_DateTime': 0,
 'Trip_Dropoff_DateTime': 0,
 'Passenger_Count': 0,
 'Trip_Distance': 0,
 'Start_Lon': 0,
 'Start_Lat': 0,
 'Rate_Code': 1409962,
 'store_and_forward': 1409835,
 'End_Lon': 0,
 'End_Lat': 0,
 'Payment_Type': 0,
 'Fare_Amt': 0,
 'surcharge': 0,
 'mta_tax': 1409962,
 'Tip_Amt': 0,
 'Tolls_Amt': 0,
 'Total_Amt': 0}

In [11]:
# We could see below three cols have huge no of Null Values
# The Rate_Code and mta_tax is completely null
# store_and_forward have few rows present
null_col_list = ["Rate_Code", "store_and_forward", "mta_tax"]
df.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in null_col_list]).show()

+---------+-----------------+-------+
|Rate_Code|store_and_forward|mta_tax|
+---------+-----------------+-------+
|  1409962|          1409835|1409962|
+---------+-----------------+-------+



In [12]:
# Checking no of Distinct values in Null having columns
df.select([F.countDistinct(F.col(c)).alias(c) for c in null_col_list]).show()

+---------+-----------------+-------+
|Rate_Code|store_and_forward|mta_tax|
+---------+-----------------+-------+
|        0|                2|      0|
+---------+-----------------+-------+



In [13]:
# Count of distinct values on store_and_forward, we could see the values are very small compared to total rows
df.select(F.col("store_and_forward")).groupBy('store_and_forward').count().show()

+-----------------+-------+
|store_and_forward|  count|
+-----------------+-------+
|              0.0|    126|
|             null|1409835|
|              1.0|      1|
+-----------------+-------+



In [14]:
# Since the amount of null values is very large compared to total no of records we are dropping those columns
df_not_null = df.drop(*null_col_list)

In [15]:
df.unpersist()

DataFrame[vendor_name: string, Trip_Pickup_DateTime: string, Trip_Dropoff_DateTime: string, Passenger_Count: bigint, Trip_Distance: double, Start_Lon: double, Start_Lat: double, Rate_Code: double, store_and_forward: double, End_Lon: double, End_Lat: double, Payment_Type: string, Fare_Amt: double, surcharge: double, mta_tax: double, Tip_Amt: double, Tolls_Amt: double, Total_Amt: double]

In [16]:
df_not_null.printSchema()

root
 |-- vendor_name: string (nullable = true)
 |-- Trip_Pickup_DateTime: string (nullable = true)
 |-- Trip_Dropoff_DateTime: string (nullable = true)
 |-- Passenger_Count: long (nullable = true)
 |-- Trip_Distance: double (nullable = true)
 |-- Start_Lon: double (nullable = true)
 |-- Start_Lat: double (nullable = true)
 |-- End_Lon: double (nullable = true)
 |-- End_Lat: double (nullable = true)
 |-- Payment_Type: string (nullable = true)
 |-- Fare_Amt: double (nullable = true)
 |-- surcharge: double (nullable = true)
 |-- Tip_Amt: double (nullable = true)
 |-- Tolls_Amt: double (nullable = true)
 |-- Total_Amt: double (nullable = true)



#### DateTime

In [17]:
# Currently the DateTime col are in String format, will check the format
date_col = ["Trip_Pickup_DateTime", "Trip_Dropoff_DateTime"]
df_not_null.select(date_col).show(3)

+--------------------+---------------------+
|Trip_Pickup_DateTime|Trip_Dropoff_DateTime|
+--------------------+---------------------+
| 2009-01-05 10:23:13|  2009-01-05 10:33:56|
| 2009-01-03 17:08:30|  2009-01-03 17:16:31|
| 2009-01-19 21:26:03|  2009-01-19 21:51:42|
+--------------------+---------------------+
only showing top 3 rows



In [18]:
# We could see the format is %Y-%M-%D %H:%M:%s
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
date_dict_map = {date_c: F.to_timestamp(F.col(date_c)) for date_c in date_col}
df_date_parsed = df_not_null.withColumns(date_dict_map)

In [19]:
df_date_parsed.printSchema()

root
 |-- vendor_name: string (nullable = true)
 |-- Trip_Pickup_DateTime: timestamp (nullable = true)
 |-- Trip_Dropoff_DateTime: timestamp (nullable = true)
 |-- Passenger_Count: long (nullable = true)
 |-- Trip_Distance: double (nullable = true)
 |-- Start_Lon: double (nullable = true)
 |-- Start_Lat: double (nullable = true)
 |-- End_Lon: double (nullable = true)
 |-- End_Lat: double (nullable = true)
 |-- Payment_Type: string (nullable = true)
 |-- Fare_Amt: double (nullable = true)
 |-- surcharge: double (nullable = true)
 |-- Tip_Amt: double (nullable = true)
 |-- Tolls_Amt: double (nullable = true)
 |-- Total_Amt: double (nullable = true)



#### Distinct Values

In [20]:
string_cols = [f.name for f in df_date_parsed.schema.fields if isinstance(f.dataType, F.StringType)]
distinct_count = df_date_parsed.select([F.countDistinct(F.col(c)).alias(c) for c in string_cols])
distinct_count_pd = distinct_count.pandas_api().transpose()
dist_cols = distinct_count_pd[distinct_count_pd[0]<50].index.tolist()

                                                                                

In [21]:
df_date_parsed.select([F.countDistinct(F.col(c)).alias(c) for c in dist_cols]).show()

+-----------+------------+
|vendor_name|Payment_Type|
+-----------+------------+
|          3|           6|
+-----------+------------+



In [22]:
# View Distinct Values
for c in dist_cols:
    df_date_parsed.select(c).distinct().show()

+-----------+
|vendor_name|
+-----------+
|        CMT|
|        VTS|
|        DDS|
+-----------+

+------------+
|Payment_Type|
+------------+
|   No Charge|
|        CASH|
|      Credit|
|        Cash|
|     Dispute|
|      CREDIT|
+------------+



#### Duplicates

In [23]:
# Tried distinct method and directly dropDup but both consumed lot of memory, 
# hence applying a tranformation and removing the duplicates which resulted in same result as direct dropDuplicates
# but less mem consumption

# We are concating all cols, and then checking dup based on concated col
df_drop_by_concat = df_date_parsed\
.withColumn("concat_cols", F.concat_ws("||", *df_date_parsed.columns))\
.dropDuplicates(["concat_cols"])\
.drop("concat_cols")

In [24]:
# Taking count after duplicates
cnt_after_drop = df_drop_by_concat.count()

# No of Duplciates dropped
no_of_row - cnt_after_drop

                                                                                

0

#### Data Transformation

In [25]:
# Adding Duration Columns
df_cleaned = df_drop_by_concat.withColumn(
    "duration", 
    F.col("Trip_Dropoff_DateTime").cast("long") - F.col("Trip_Pickup_DateTime").cast("long")
)

### Data Validation

#### Schema Validation

In [26]:
# Defining Schema for Validation
validate_schema = StructType(
    [
        StructField('vendor_name', StringType(), True), 
        StructField('Trip_Pickup_DateTime', TimestampType(), True), 
        StructField('Trip_Dropoff_DateTime', TimestampType(), True), 
        StructField('Passenger_Count', LongType(), True), 
        StructField('Trip_Distance', DoubleType(), True), 
        StructField('Start_Lon', DoubleType(), True), 
        StructField('Start_Lat', DoubleType(), True), 
        StructField('End_Lon', DoubleType(), True), 
        StructField('End_Lat', DoubleType(), True), 
        StructField('Payment_Type', StringType(), True), 
        StructField('Fare_Amt', DoubleType(), True), 
        StructField('surcharge', DoubleType(), True), 
        StructField('Tip_Amt', DoubleType(), True), 
        StructField('Tolls_Amt', DoubleType(), True),
        StructField('Total_Amt', DoubleType(), True),
        StructField('duration', LongType(), True)
    ]
)

In [27]:
# Validate Schema
assert validate_schema == df_cleaned.schema, "schema is not valid"

#### Null Value Validation

In [28]:
# This method cosumes more memory hence commenting the below, It better to do one by one
# is_null_values = df_cleaned.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df_cleaned.columns]).collect()[0].asDict()
# [col for col in is_null_values if is_null_values[col] > 0]

In [29]:
for c in df_cleaned.columns:
#     cnt = df_cleaned.select(c).where(F.col(c).isNull()).count()
#     Experimenting below
    cnt = df_cleaned.where(F.col(c).isNull()).select(c).count()
    if cnt > 0:
        print(c, cnt)
else:
    print("There No Null columns")

                                                                                

There No Null columns


In [30]:
#### Data Range Validation

In [31]:
# df_numeric_cols = [
#     f.name 
#     for f in df_cleaned.schema.fields 
#     if isinstance(f.dataType, DoubleType) or isinstance(f.dataType, LongType)
# ]
# bounds = {
#     c: dict(
#         zip(["q1", "q3"], df_cleaned.approxQuantile(c, [0.25, 0.75], 0.1))
#     )
#     for c in df_numeric_cols
# }

In [32]:
# for c in bounds:
#     iqr = bounds[c]['q3'] - bounds[c]['q1']
#     bounds[c]['lower'] = bounds[c]['q1'] - (iqr * 1.5)
#     bounds[c]['upper'] = bounds[c]['q3'] + (iqr * 1.5)
# print(bounds)

In [33]:
# df_outlier = df_cleaned.select(
#     "*",
#     *[
#         F.when(
#             F.col(c).between(bounds[c]['lower'], bounds[c]['upper']),
#             0
#         ).otherwise(1).alias(c+"_out") 
#         for c in df_numeric_cols
#     ]
# )

In [34]:
# df_outlier.select([F.sum(c).alias(c) for c in df_outlier.columns if c.endswith("out")]).show()

### Data Transformation

#### Data Aggregations

In [35]:
df_cleaned.printSchema()

root
 |-- vendor_name: string (nullable = true)
 |-- Trip_Pickup_DateTime: timestamp (nullable = true)
 |-- Trip_Dropoff_DateTime: timestamp (nullable = true)
 |-- Passenger_Count: long (nullable = true)
 |-- Trip_Distance: double (nullable = true)
 |-- Start_Lon: double (nullable = true)
 |-- Start_Lat: double (nullable = true)
 |-- End_Lon: double (nullable = true)
 |-- End_Lat: double (nullable = true)
 |-- Payment_Type: string (nullable = true)
 |-- Fare_Amt: double (nullable = true)
 |-- surcharge: double (nullable = true)
 |-- Tip_Amt: double (nullable = true)
 |-- Tolls_Amt: double (nullable = true)
 |-- Total_Amt: double (nullable = true)
 |-- duration: long (nullable = true)



In [36]:
numeric_col = ["Passenger_Count", "Trip_Distance", "Fare_Amt", "surcharge", "Tip_Amt", "Tolls_Amt", "Total_Amt", "duration"]
df_per_vendor = df_cleaned.select("vendor_name",*numeric_col)\
.groupBy(F.col("vendor_name"))\
.agg(*[F.mean(F.col(c)).alias(c+"_avg") for c in numeric_col])

In [37]:
df_per_vendor.show()

23/05/15 21:23:38 WARN TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.

+-----------+-------------------+------------------+-----------------+-------------------+------------------+-------------------+------------------+-----------------+
|vendor_name|Passenger_Count_avg| Trip_Distance_avg|     Fare_Amt_avg|      surcharge_avg|       Tip_Amt_avg|      Tolls_Amt_avg|     Total_Amt_avg|     duration_avg|
+-----------+-------------------+------------------+-----------------+-------------------+------------------+-------------------+------------------+-----------------+
|        CMT| 1.3154908167742745|2.5192744812887864|9.542147809189458|                0.0|0.4255705509935462|0.10707211809690544|10.078321851689623|644.7473672663449|
|        VTS|  2.104780176092554|2.5648959230269863| 9.41795026516011|0.32534740712102533|0.5058805138342817|0.11915780489762136| 10.37054279399214|729.3283243487712|
|        DDS| 1.3510595206406888| 2.710563958938519|9.704646845741683| 0.3411413427152893|0.4095606837409003|0.13657516173458314|10.593646695290929|772.2622704179059



In [38]:
numeric_col = ["Passenger_Count", "Trip_Distance", "Fare_Amt", "surcharge", "Tip_Amt", "Tolls_Amt", "Total_Amt", "duration"]
df_per_hour = df_cleaned.select("Trip_Pickup_DateTime", *numeric_col)\
.groupBy(F.hour(F.col("Trip_Pickup_DateTime")).alias("Hour"))\
.agg(*[F.mean(F.col(c)).alias(c+"_avg") for c in numeric_col])

In [39]:
df_per_hour.show(3)

23/05/15 21:23:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:23:58 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:05 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:05 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:05 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:05 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.


+----+-------------------+------------------+------------------+--------------------+-------------------+-------------------+------------------+-----------------+
|Hour|Passenger_Count_avg| Trip_Distance_avg|      Fare_Amt_avg|       surcharge_avg|        Tip_Amt_avg|      Tolls_Amt_avg|     Total_Amt_avg|     duration_avg|
+----+-------------------+------------------+------------------+--------------------+-------------------+-------------------+------------------+-----------------+
|  12| 1.6641747789254664|2.3139684327859102|  8.95982284312999|2.972430705209184...|0.39772683361819133| 0.1345919595749435| 9.496324589433005|681.5317232667013|
|  22| 1.8037264809954692| 2.753081841565801| 9.850208390457944| 0.27377361693237756| 0.5254443965686075|0.08620836434381855|10.738692075678665| 678.570227323175|
|   1|  1.802593842795868|  3.03510776376234|10.250417755894517|  0.2565036420395421| 0.5168655617877718|0.05219106114058011|  11.0793611837263| 689.831527118601|
+----+----------------



In [40]:
numeric_col = ["Passenger_Count", "Trip_Distance", "Fare_Amt", "surcharge", "Tip_Amt", "Tolls_Amt", "Total_Amt", "duration"]
df_per_date = df_cleaned.select(
    F.to_date("Trip_Pickup_DateTime", "yyyy-MM-dd").alias("date"), 
    *numeric_col
)\
.groupBy("date")\
.agg(*[F.mean(F.col(c)).alias(c+"_avg") for c in numeric_col])

In [41]:
df_per_date.show(3)

23/05/15 21:24:15 WARN TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
23/05/15 21:24:15 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:24 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:24 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:24 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:24 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.


+----------+-------------------+------------------+-----------------+-------------------+-------------------+-------------------+------------------+-----------------+
|      date|Passenger_Count_avg| Trip_Distance_avg|     Fare_Amt_avg|      surcharge_avg|        Tip_Amt_avg|      Tolls_Amt_avg|     Total_Amt_avg|     duration_avg|
+----------+-------------------+------------------+-----------------+-------------------+-------------------+-------------------+------------------+-----------------+
|2009-01-01| 1.8418353329059438| 2.880220105626273|9.835373813230772|0.06561956223097354|0.33303568702872666|0.12587080623988736|10.368288915346342|629.5846689257259|
|2009-01-30| 1.6884121126708327|2.5138183099375095|9.603914551152494|0.22340386077475105| 0.5020155024384818|0.11263689802881714|10.445140838540988|738.3874497005211|
|2009-01-22| 1.6341111956130168|2.5263746747788525|9.592409838690319| 0.2132149861906096| 0.5204180842973222|0.12623283833006493|10.455622823519986|715.0952047392227

                                                                                

In [42]:
numeric_col = ["Passenger_Count", "Trip_Distance", "Fare_Amt", "surcharge", "Tip_Amt", "Tolls_Amt", "Total_Amt", "duration"]
df_per_week = df_cleaned.select("Trip_Pickup_DateTime", *numeric_col)\
.groupBy(F.weekofyear(F.col("Trip_Pickup_DateTime")).alias("WeekOfYear"))\
.agg(*[F.mean(F.col(c)).alias(c+"_avg") for c in numeric_col])

In [43]:
df_per_week.show(3)

23/05/15 21:24:36 WARN TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
23/05/15 21:24:36 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:24:44 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.


+----------+-------------------+------------------+-----------------+-------------------+-------------------+-------------------+------------------+-----------------+
|WeekOfYear|Passenger_Count_avg| Trip_Distance_avg|     Fare_Amt_avg|      surcharge_avg|        Tip_Amt_avg|      Tolls_Amt_avg|     Total_Amt_avg|     duration_avg|
+----------+-------------------+------------------+-----------------+-------------------+-------------------+-------------------+------------------+-----------------+
|         1|  1.848428970842764|2.8394042670749577| 9.88707196112367| 0.1321794701105046|0.37229097323924903|0.13152762614831554|10.527513380375462|675.2425575822127|
|         3| 1.6847972481071636|2.4436628415599015|9.399485385119249|0.18399188264414676|0.47540322874199165| 0.1129132692680994|10.173902154921254|689.8315587992623|
|         5| 1.6651743047129717|2.4685146411891914|9.362059425604409|0.19905577886381662|0.49271878749262815|0.10638172833815433|10.163310603338951|704.8975726806256

                                                                                

In [44]:
numeric_col = ["Passenger_Count", "Trip_Distance", "Fare_Amt", "surcharge", "Tip_Amt", "Tolls_Amt", "Total_Amt", "duration"]
df_per_dayofmonth = df_cleaned.select("Trip_Pickup_DateTime", *numeric_col)\
.groupBy(F.dayofmonth(F.col("Trip_Pickup_DateTime")).alias("DayOfMonth"))\
.agg(*[F.mean(F.col(c)).alias(c+"_avg") for c in numeric_col])

In [45]:
df_per_dayofmonth.show(3)

23/05/15 21:24:55 WARN TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
23/05/15 21:24:55 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:25:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:25:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:25:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:25:03 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.

+----------+-------------------+------------------+-----------------+-------------------+-------------------+-------------------+------------------+-----------------+
|DayOfMonth|Passenger_Count_avg| Trip_Distance_avg|     Fare_Amt_avg|      surcharge_avg|        Tip_Amt_avg|      Tolls_Amt_avg|     Total_Amt_avg|     duration_avg|
+----------+-------------------+------------------+-----------------+-------------------+-------------------+-------------------+------------------+-----------------+
|        31| 1.7918311626019259|2.4959505405305684| 9.24101280301073|0.12173006678227502|0.43194019112275395|0.06843393720252351| 9.864978969117798| 680.225657676272|
|        28| 1.6335729847494553| 2.350026187363829| 9.13503071895426|0.21962962962962962|  0.518329193899782|0.09630675381263602|  9.97617712418301|719.3332897603486|
|        26| 1.6117397127418833|2.5066151803445242|9.272378423313162|0.21411582690620237|  0.479838590495543|0.12951623331639991|10.097802844871353|654.2429455502702

                                                                                

In [46]:
numeric_col = ["Passenger_Count", "Trip_Distance", "Fare_Amt", "surcharge", "Tip_Amt", "Tolls_Amt", "Total_Amt", "duration"]
df_per_dayofweek = df_cleaned.select("Trip_Pickup_DateTime", *numeric_col)\
.groupBy(F.dayofweek(F.col("Trip_Pickup_DateTime")).alias("dayofweek"))\
.agg(*[F.mean(F.col(c)).alias(c+"_avg") for c in numeric_col])

In [47]:
df_per_dayofweek.show(3)

23/05/15 21:25:13 WARN TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
23/05/15 21:25:13 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:25:21 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:25:21 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:25:21 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/05/15 21:25:21 WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.

+---------+-------------------+-----------------+-----------------+-------------------+-------------------+-------------------+------------------+-----------------+
|dayofweek|Passenger_Count_avg|Trip_Distance_avg|     Fare_Amt_avg|      surcharge_avg|        Tip_Amt_avg|      Tolls_Amt_avg|     Total_Amt_avg|     duration_avg|
+---------+-------------------+-----------------+-----------------+-------------------+-------------------+-------------------+------------------+-----------------+
|        1| 1.7937177911138296|2.845371548993836|9.775786608370222|0.11217321795159053|0.46167743576389947|0.12997583765812917|10.481227555970078|664.9007103365167|
|        6| 1.7096297782682854|2.538974576101145|9.577798896357946|0.22210895956656967| 0.4565002909601689|0.11610362195244035|10.375630019062903|716.8463248720778|
|        3| 1.6294155571128242|2.499379582340564|9.361223545900499|0.20972950245269797|0.48244339173090384|0.11964384022424589|10.174955851436597| 692.901359495445|
+---------

                                                                                

In [48]:
df_cnt_per_payment_type = df_cleaned.select("Payment_Type")\
.groupBy(F.col("Payment_Type"))\
.count()

In [49]:
df_cnt_per_payment_type.show()



+------------+------+
|Payment_Type| count|
+------------+------+
|   No Charge|  3990|
|        CASH|602997|
|      Credit|287365|
+------------+------+
only showing top 3 rows





In [50]:
numeric_col = ["Passenger_Count", "Trip_Distance", "Fare_Amt", "surcharge", "Tip_Amt", "Tolls_Amt", "Total_Amt", "duration"]
df_per_payment_type = df_cleaned.select("Payment_Type",*numeric_col)\
.groupBy(F.col("Payment_Type"))\
.agg(*[F.mean(F.col(c)).alias(c+"_avg") for c in numeric_col])

In [51]:
df_per_payment_type.show()

[Stage 186:>                                                        (0 + 4) / 5]

+------------+-------------------+------------------+------------------+-------------------+--------------------+-------------------+------------------+-----------------+
|Payment_Type|Passenger_Count_avg| Trip_Distance_avg|      Fare_Amt_avg|      surcharge_avg|         Tip_Amt_avg|      Tolls_Amt_avg|     Total_Amt_avg|     duration_avg|
+------------+-------------------+------------------+------------------+-------------------+--------------------+-------------------+------------------+-----------------+
|   No Charge| 1.2150375939849625| 2.456716791979948| 10.36890977443609|                0.0|0.007982456140350877| 0.1405087719298246|10.571872180451125|658.1862155388471|
|        CASH|  2.029670131028844|2.3973074973838964| 8.900942475667307|0.32386894130484895|2.487574565047587...|0.09005142645817063| 9.315773710316913|658.9716797927684|
|      Credit| 1.6813947418787953|3.2070011135663887|11.407868390374597| 0.1797818105893202|   2.148675099612018| 0.2152307692307591|13.958098341



In [52]:
df_cnt_payment_cnt_per_vendor = df_cleaned\
.select("vendor_name", "Payment_Type")\
.groupBy("vendor_name", "Payment_Type")\
.count()

In [53]:
df_cnt_payment_cnt_per_vendor.show()

[Stage 192:>                                                        (0 + 4) / 4]

+-----------+------------+------+
|vendor_name|Payment_Type| count|
+-----------+------------+------+
|        VTS|        CASH|532456|
|        DDS|      CREDIT| 15866|
|        CMT|      Credit|134386|
|        DDS|        CASH| 70541|
|        CMT|   No Charge|  3990|
|        CMT|        Cash|498909|
|        VTS|      Credit|152979|
|        CMT|     Dispute|   835|
+-----------+------------+------+



