Exercise 1 — CSV ↔ Parquet Conversion & Basic EDA
- Dataset: NYC Yellow Taxi 2023‑01 CSV  https://d37ci6vzurychx.cloudfront.net/tripdata/yellow_tripdata_2023-01.csv.gz
- Load first 500 000 rows with Pandas for schema inspection and summary statistics
(describe()).
- Write the full file to Parquet with Snappy compression via Pandas (PyArrow backend).
- Start a local Spark session, read the Parquet back, cache it, and verify matching row
counts

In [1]:
import pyarrow as pa
import pandas as pd
import os 

url = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet'

sample = pd.read_parquet(url, engine='pyarrow').head(5_000_000)

print(sample.dtypes)
print(sample.describe(include='all'))

os.makedirs('./data', exist_ok=True)

table = pa.Table.from_pandas(sample)
pa.parquet.write_table(table, './data/yellow_tripdata_2023-01.snappy.parquet', compression='snappy')

VendorID                          int64
tpep_pickup_datetime     datetime64[us]
tpep_dropoff_datetime    datetime64[us]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
airport_fee                     float64
dtype: object
            VendorID        tpep_pickup_datetime       tpep_dropoff_datetime  \
count   3.066766e+06                     3066766                     3066766   
unique           NaN                         NaN                  

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Taxi") \
    .master("local[*]") \
    .getOrCreate()
df_spark = spark.read.parquet('data/yellow_tripdata_2023-01.snappy.parquet')


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/10 11:08:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
import time

start_spark_count = time.time()
spark_count = df_spark.count()
spark_count_time = time.time() - start_spark_count
print(f"Tempo de count() SEM cache: {spark_count_time:.2f} segundos")

start_cache = time.time()
df_spark.cache()  
cached_count = df_spark.count() 
cache_time = time.time() - start_cache
print(f"Tempo de caching (incluindo 1º count()): {cache_time:.2f} segundos")

start_cached_count = time.time()
spark_count_cached = df_spark.count()
cached_count_time = time.time() - start_cached_count
print(f"Tempo de count() COM cache: {cached_count_time:.2f} segundos")

sample = pd.read_parquet('data/yellow_tripdata_2023-01.snappy.parquet').head(500)

start_pandas = time.time()
pandas_count = len(sample)
pandas_time = time.time() - start_pandas
print(f"\nTempo de len() no Pandas: {pandas_time:.4f} segundos")

print("\nResumo:")
print(f"Spark (sem cache): {spark_count_time:.2f} segundos")
print(f"Spark (com cache): {cached_count_time:.2f} segundos")
print(f"Pandas: {pandas_time:.4f} segundos")
print(f"Contagem Spark: {spark_count}")
print(f"Contagem Pandas: {pandas_count}")

print("\nStatus do cache (Spark):")
print(df_spark.storageLevel) 

Tempo de count() SEM cache: 1.29 segundos


                                                                                

Tempo de caching (incluindo 1º count()): 4.88 segundos
Tempo de count() COM cache: 0.18 segundos

Tempo de len() no Pandas: 0.0001 segundos

Resumo:
Spark (sem cache): 1.29 segundos
Spark (com cache): 0.18 segundos
Pandas: 0.0001 segundos
Contagem Spark: 3066766
Contagem Pandas: 500

Status do cache (Spark):
Disk Memory Deserialized 1x Replicated


25/05/10 11:08:39 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


Exercise 2 — Pandas Profiling vs. Spark SQL Analysis
- Generate a Pandas Profiling (ydata‑profiling) report on the 500 k taxi sample.
- Recreate three key insights in Spark SQL (e.g., mean trip distance, 95‑th percentile fare).
- Compare runtimes and memory; justify Pandas vs. Spark choices for typical HCIE workloads