# Analytics on the NYC Taxi Dataset Using Parquet & Spark

This notebook provides ready-to-use analytical, statistical, and machine-learning workflows on the NYC Taxi dataset stored in Parquet format using Apache Spark.

## 1. Dataset Motivation

The NYC Taxi dataset combines time-based, numerical, and categorical data at scale, making it suitable for statistical analysis and machine learning.

## 2. Load Data from Parquet

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("NYTaxiAnalytics").getOrCreate()
taxi_df = spark.read.parquet("/data/ny_taxi_parquet/")

## 3. Schema Inspection and Profiling

In [None]:
taxi_df.printSchema()
taxi_df.count()

## 4. Summary Statistics

In [None]:
taxi_df.select(
    "trip_distance",
    "fare_amount",
    "passenger_count"
).describe().show()

## 5. Distribution Analysis

In [None]:
from pyspark.sql.functions import floor

taxi_df.withColumn(
    "fare_bucket", floor(col("fare_amount") / 5) * 5
).groupBy("fare_bucket").count().orderBy("fare_bucket").show()

## 6. Time-Based Analysis

In [None]:
from pyspark.sql.functions import hour

taxi_df.withColumn(
    "pickup_hour", hour("pickup_datetime")
).groupBy("pickup_hour").count().orderBy("pickup_hour").show()

## 7. Feature Engineering

In [None]:
from pyspark.sql.functions import unix_timestamp

taxi_df = taxi_df.withColumn(
    "trip_duration_min",
    (unix_timestamp("dropoff_datetime") - unix_timestamp("pickup_datetime")) / 60
)

clean_df = taxi_df.filter(
    (col("trip_duration_min") > 1) &
    (col("trip_duration_min") < 180) &
    (col("fare_amount") > 0) &
    (col("trip_distance") > 0)
)

## 8. Regression: Fare Prediction

In [None]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=["trip_distance", "trip_duration_min", "passenger_count"],
    outputCol="features"
)

ml_df = assembler.transform(clean_df).select("features", "fare_amount")
train_df, test_df = ml_df.randomSplit([0.8, 0.2], seed=42)

In [None]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(labelCol="fare_amount")
lr_model = lr.fit(train_df)
predictions = lr_model.transform(test_df)

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

RegressionEvaluator(
    labelCol="fare_amount",
    predictionCol="prediction",
    metricName="rmse"
).evaluate(predictions)

## 9. Classification: High Fare Trips

In [None]:
from pyspark.sql.functions import when

labeled_df = clean_df.withColumn(
    "high_fare", when(col("fare_amount") > 50, 1).otherwise(0)
)

final_df = assembler.transform(labeled_df).select("features", "high_fare")
train_df, test_df = final_df.randomSplit([0.8, 0.2], seed=42)

In [None]:
from pyspark.ml.classification import LogisticRegression

log_reg = LogisticRegression(labelCol="high_fare")
model = log_reg.fit(train_df)
predictions = model.transform(test_df)

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

BinaryClassificationEvaluator(
    labelCol="high_fare",
    metricName="areaUnderROC"
).evaluate(predictions)

## 10. Clustering Trips

In [None]:
from pyspark.ml.clustering import KMeans

cluster_df = assembler.transform(clean_df).select("features")

kmeans = KMeans(k=5, seed=42)
model = kmeans.fit(cluster_df)

model.clusterCenters()

## Summary

This notebook demonstrated scalable statistical analysis and machine-learning workflows using Spark and Parquet on the NYC Taxi dataset.