# Tip Range Prediction Model (Classification)

This notebook implements a classification model to predict the **range** of the tip amount.

## Classes (Target Bins):
*   **0 (No Tip)**: tip = 0
*   **1 (Low)**: 0 < tip <= 3
*   **2 (Medium)**: 3 < tip <= 6
*   **3 (High)**: 6 < tip <= 10
*   **4 (Very High)**: tip > 10

## Models:
1. Random Forest Classifier
2. XGBoost Classifier
3. Random Forest Classifier with Hyperparameter Tuning

In [1]:
import os
import sys

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler, StringIndexer, Bucketizer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from xgboost.spark import SparkXGBClassifier

In [2]:
from pyspark.sql import SparkSession
import importlib.util
import os

# ========== LOAD CONFIG FIRST ==========
src_path = os.path.join(os.path.dirname(os.getcwd()), 'src')
config_file = os.path.join(src_path, 'config.py')

spec = importlib.util.spec_from_file_location("config", config_file)
config_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(config_module)

Config = config_module.Config
print("✓ Config loaded")

# ========== STOP EXISTING SPARK ==========
try:
    spark.stop()
    print("✓ Stopped existing Spark session")
except:
    print("ℹ No existing Spark session to stop")

# ========== CREATE SPARK SESSION ==========
print(f"Creating Spark session: {Config.APP_NAME}")

spark = SparkSession.builder \
    .appName(Config.APP_NAME) \
    .config("spark.driver.memory", Config.SPARK_DRIVER_MEMORY) \
    .config("spark.executor.memory", Config.SPARK_EXECUTOR_MEMORY) \
    .config("spark.executor.instances", Config.SPARK_EXECUTOR_INSTANCES) \
    .config("spark.executor.cores", Config.SPARK_EXECUTOR_CORES) \
    .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262") \
    .getOrCreate()

print(f"✓ Spark session created successfully (version {spark.version})")

# ========== CONFIGURE HADOOP FOR MINIO ==========
print("Configuring Hadoop for MinIO...")

hadoop_conf = spark._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.endpoint", Config.MINIO_ENDPOINT)
hadoop_conf.set("fs.s3a.access.key", Config.MINIO_ACCESS_KEY)
hadoop_conf.set("fs.s3a.secret.key", Config.MINIO_SECRET_KEY)
hadoop_conf.set("fs.s3a.path.style.access", "true")
hadoop_conf.set("fs.s3a.connection.ssl.enabled", "false")
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

print(f"✓ Spark configured for environment: {Config.ENVIRONMENT}")
print(f"✓ Using MinIO endpoint: {Config.MINIO_ENDPOINT}")
print(f"✓ Reading from bucket: {Config.S3_BUCKET_NAME}")

print("\n" + "="*50)
print("✓ Spark Session Ready")
print("="*50)

# Display configuration
Config.display_config()

✓ Config loaded
ℹ No existing Spark session to stop
Creating Spark session: NYC Taxi EDA
✓ Spark session created successfully (version 3.5.0)
Configuring Hadoop for MinIO...
✓ Spark configured for environment: development
✓ Using MinIO endpoint: http://minio:9000
✓ Reading from bucket: nyc-taxi

✓ Spark Session Ready
Current Configuration:
App Name: NYC Taxi EDA
Environment: development
MinIO Endpoint: http://minio:9000
S3 Bucket: nyc-taxi
Spark Driver Memory: 3g
Spark Executor Memory: 3g
Spark Executor Instances: 3
Spark Executor Cores: 2
Log Level: INFO
Log File: eda.log


## Load Data

In [3]:
data_path = "s3a://nyc-taxi/Tip_Prediction_Model_DF/"
df = spark.read.parquet(data_path)
print(f"Total records: {df.count()}")

Total records: 7975663


## Feature Engineering & Target Binning

In [4]:
# Drop missing values
df = df.dropna(subset=[
    "fare_amount", "trip_distance", "payment_type",
    "tip_amount", "tpep_pickup_datetime", "tpep_dropoff_datetime"
])

# 1. Time Features
df = df.withColumn("pickup_hour", F.hour("tpep_pickup_datetime")) \
       .withColumn("pickup_day", F.dayofweek("tpep_pickup_datetime")) \
       .withColumn("pickup_month", F.month("tpep_pickup_datetime")) \
       .withColumn("is_weekend", F.when(F.col("pickup_day").isin([1, 7]), 1).otherwise(0))

# 2. Trip Duration
df = df.withColumn(
    "trip_duration",
    (F.unix_timestamp("tpep_dropoff_datetime") - F.unix_timestamp("tpep_pickup_datetime")) / 60
).filter((F.col("trip_duration") > 0) & (F.col("trip_duration") < 300))

# 3. Target Binning (Tip Class)
# 0: No Tip (0)
# 1: Low (0 - 3]
# 2: Medium (3 - 6]
# 3: High (6 - 10]
# 4: Very High (> 10)

bucketizer = Bucketizer(
    splits=[-float("inf"), 0.0001, 3, 6, 10, float("inf")],
    inputCol="tip_amount",
    outputCol="tip_class_raw"
)

# Bucketizer outputs double, we cast to int (and handle 0 case separately if needed, but 0.0001 split handles 0)
# Actually, exact 0 needs to be its own class. 
# Range: (-inf, 0.0001) -> Bucket 0 (Includes 0)
# Range: [0.0001, 3) -> Bucket 1
# ...
df = bucketizer.transform(df)
df = df.withColumn("label", F.col("tip_class_raw").cast("double"))

# Check class distribution
df.groupBy("label").count().orderBy("label").show()

+-----+-------+
|label|  count|
+-----+-------+
|  0.0| 430738|
|  1.0|2863423|
|  2.0|3283119|
|  3.0| 654426|
|  4.0| 713790|
+-----+-------+



## Vector Assembly

In [5]:
indexer = StringIndexer(inputCol="payment_type", outputCol="payment_type_idx", handleInvalid="keep")
df = indexer.fit(df).transform(df)

feature_cols = [
    "fare_amount", "trip_distance", "trip_duration",
    "pickup_hour", "pickup_day", "pickup_month",
    "is_weekend", "payment_type_idx"
]

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_ml = assembler.transform(df).select("features", "label")

train_df, test_df = df_ml.randomSplit([0.8, 0.2], seed=42)

## 1. Random Forest Classifier

In [7]:
rf = RandomForestClassifier(
    labelCol="label",
    featuresCol="features",
    numTrees=20,
    maxDepth=5,
    seed=42
)

rf_model = rf.fit(train_df)
rf_preds = rf_model.transform(test_df)

evaluator_acc = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")

print(f"Random Forest Accuracy: {evaluator_acc.evaluate(rf_preds)}")
print(f"Random Forest F1 Score: {evaluator_f1.evaluate(rf_preds)}")

                                                                                

Random Forest Accuracy: 0.7134076066077603
Random Forest F1 Score: 0.6920054707191383


                                                                                

## 2. XGBoost Classifier

In [8]:
xgb = SparkXGBClassifier(
    features_col="features",
    label_col="label",
    num_workers=2,
    use_gpu=False
)

xgb_model = xgb.fit(train_df)
xgb_preds = xgb_model.transform(test_df)

print(f"XGBoost Accuracy: {evaluator_acc.evaluate(xgb_preds)}")
print(f"XGBoost F1 Score: {evaluator_f1.evaluate(xgb_preds)}")

2026-01-29 15:46:18,989 INFO XGBoost-PySpark: _fit Running xgboost-2.1.4 on 2 workers with
	booster params: {'objective': 'multi:softprob', 'device': 'cpu', 'num_class': 5, 'nthread': 1}
	train_call_kwargs_params: {'verbose_eval': True, 'num_boost_round': 100}
	dmatrix_kwargs: {'nthread': 1, 'missing': nan}
2026-01-29 15:46:21,455 INFO XGBoost-PySpark: _train_booster Training on CPUs 2]
[15:46:22] Task 1 got rank 1[15:46:22] Task 0 got rank 0

2026-01-29 15:46:28,809 INFO XGBoost-PySpark: _fit Finished xgboost training!   
2026-01-29 15:46:29,052 INFO XGBoost-PySpark: predict_udf Do the inference on the CPUs
                                                                                

XGBoost Accuracy: 0.7436627346756265


2026-01-29 15:46:30,901 INFO XGBoost-PySpark: predict_udf Do the inference on the CPUs


XGBoost F1 Score: 0.7325345050544245


                                                                                

## 3. Random Forest Classifier with Hyperparameter Tuning

In [None]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

rf = RandomForestClassifier(
    labelCol="label",
    featuresCol="features",
    seed=42
)

paramGrid = (
    ParamGridBuilder()
    .addGrid(rf.numTrees, [50, 100])
    .addGrid(rf.maxDepth, [5, 10])
    .build()
)

cv = CrossValidator(
    estimator=rf,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator_acc,   # accuracy for model selection
    numFolds=3,
    parallelism=2
)

# (Recommended) cache training data for CV
train_df.cache()
train_df.count()

cv_model = cv.fit(train_df)

rf_preds = cv_model.bestModel.transform(test_df)

print(f"Best RF Accuracy: {evaluator_acc.evaluate(rf_preds)}")
print(f"Best RF F1 Score: {evaluator_f1.evaluate(rf_preds)}")

26/01/29 15:40:01 WARN DAGScheduler: Broadcasting large task binary with size 1232.5 KiB
26/01/29 15:40:02 WARN DAGScheduler: Broadcasting large task binary with size 2.3 MiB
26/01/29 15:40:05 WARN DAGScheduler: Broadcasting large task binary with size 4.3 MiB
26/01/29 15:40:07 WARN DAGScheduler: Broadcasting large task binary with size 1267.1 KiB
26/01/29 15:40:08 WARN DAGScheduler: Broadcasting large task binary with size 8.0 MiB
26/01/29 15:40:09 WARN DAGScheduler: Broadcasting large task binary with size 2.2 MiB
26/01/29 15:40:11 WARN DAGScheduler: Broadcasting large task binary with size 4.8 MiB
26/01/29 15:40:20 WARN DAGScheduler: Broadcasting large task binary with size 1224.7 KiB
26/01/29 15:40:22 WARN DAGScheduler: Broadcasting large task binary with size 2.3 MiB
26/01/29 15:40:24 WARN DAGScheduler: Broadcasting large task binary with size 4.4 MiB
26/01/29 15:40:27 WARN DAGScheduler: Broadcasting large task binary with size 1361.7 KiB
26/01/29 15:40:27 WARN DAGScheduler: Broad

Best RF Accuracy: 0.7365146142849045


26/01/29 15:42:45 WARN DAGScheduler: Broadcasting large task binary with size 9.4 MiB

Best RF F1 Score: 0.7225512663308105


                                                                                

Note: SparkXGBClassifier cannot be used with ParamGridBuilder or CrossValidator. That's why we are using Random Forest Classifier only for hyperparameter tuning.