# NYC Taxi Analytics with PySpark, Parquet, and Delta Lake

This notebook provides a comprehensive introduction to working with the NYC Taxi dataset using PySpark, Parquet, and Delta Lake. The material is intentionally verbose and explanatory.

## 1. Why Parquet Matters

Parquet is a columnar storage format optimized for analytical workloads. It reduces I/O, improves compression, and enables efficient query execution.

## 2. Spark Session Initialization

In [None]:

from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("NYC Taxi Parquet and Delta Lake")     .getOrCreate()

spark


## 3. Loading the NYC Taxi Dataset

In [None]:

df = spark.read.parquet("/content/nyc_taxi_parquet/")
df.printSchema()


## 4. Column Pruning in Parquet

In [None]:

df.select("pickup_datetime", "fare_amount")   .filter("fare_amount > 50")   .explain(True)


## 5. Partitioning the Dataset

In [None]:

from pyspark.sql.functions import year, month

df_part = df.withColumn("pickup_year", year("pickup_datetime"))             .withColumn("pickup_month", month("pickup_datetime"))


In [None]:

df_part.write     .mode("overwrite")     .partitionBy("pickup_year", "pickup_month")     .parquet("/content/nyc_taxi_partitioned/")


## 6. Delta Lake Installation

In [None]:
!pip install delta-spark

## 7. Enable Delta Lake

In [None]:

from pyspark.sql import SparkSession

spark = SparkSession.builder     .appName("NYC Taxi Delta Lake")     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")     .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")     .getOrCreate()


## 8. Writing Delta Table

In [None]:

df_part.write     .format("delta")     .mode("overwrite")     .partitionBy("pickup_year", "pickup_month")     .save("/content/nyc_taxi_delta/")


## 9. Reading Delta Table

In [None]:

df_delta = spark.read.format("delta").load("/content/nyc_taxi_delta/")
df_delta.show(5)


## 10. Time Travel

In [None]:

df_delta.history().show()


## 11. Updates and Deletes

In [None]:

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/content/nyc_taxi_delta/")
delta_table.delete("trip_distance < 0")


## Final Summary

Parquet provides efficient storage, while Delta Lake adds reliability, transactions, and production-grade data management.