# New York Taxi Tipping Prediction: Yellow Taxi and Green Taxi Classification
After data manipulation, the tabular data is ready to go to the next step, which is the classification. There are two challenges here which are the data size and the imballance dataset. Solution for the first challenge is by using batching. In batching, the calculation will be done every batch size. In here we will use 100000 rows of data for each batch. Meanwhile for the second challenge, we will use tree based model that friendly with imbalance dataset. In addition there will be weightted balancing to focus on minority target.

In [1]:
import duckdb

## Step 1: Create query for Yellow dan Green Taxi
The query is the same as written in the manipulation data.

In [2]:
query_yellow_green = """
WITH CTE_yellow_2009 AS (
    SELECT 
        CAST(Trip_Pickup_DateTime AS TIMESTAMP) AS pick_up_time,
        CAST(Trip_Dropoff_DateTime AS TIMESTAMP) AS drop_off_time,
        CAST(Passenger_Count AS INTEGER) AS passenger_count,
        CAST(Trip_Distance AS INTEGER) AS trip_distance,
        Payment_Type AS payment_type,
        CAST(Total_Amt AS FLOAT)   AS total_amount,
        Tip_Amt AS tip_amount,
        CAST(Tip_Amt AS FLOAT)   AS tip_amount,
        CAST( CASE Payment_Type
            WHEN  'Credit' THEN 1
            WHEN  'CREDIT' THEN 1
            ELSE 2
        END AS INTEGER) AS payment_category,
        CAST( CASE Tip_Amt
            WHEN  0.0 THEN 1
            ELSE 0
        END AS INTEGER) AS tip_category
    FROM 'C:/Users/ekadw/Documents/DATA/NY_Taxi/2009/yellow_taxi_2009/yellow_tripdata_*.parquet'
    WHERE Trip_Pickup_DateTime IS NOT NULL
        AND Trip_Dropoff_DateTime IS NOT NULL
        AND Passenger_Count >= 0
        AND Trip_Distance >= 0 
        AND Trip_Distance <= 50
        AND Payment_Type IS NOT NULL
        AND Total_Amt >= 0
        AND Tip_Amt >= 0
        AND Trip_Pickup_DateTime >= '2009-01-01' 
        AND Trip_Pickup_DateTime < '2010-01-01'
), CTE_duration_yellow_2009 AS (
    SELECT
        pick_up_time,
        drop_off_time,
        passenger_count,
        trip_distance,
        total_amount,
        payment_category,
        tip_category,
        DATE_DIFF('day', pick_up_time, drop_off_time) AS duration_days,
        EPOCH(drop_off_time - pick_up_time) AS duration_seconds
    FROM CTE_yellow_2009
    WHERE payment_category = 1
), CTE_yellow_2010 AS (
    SELECT 
        CAST(pickup_datetime AS TIMESTAMP) AS pick_up_time,
        CAST(dropoff_datetime AS TIMESTAMP) AS drop_off_time,
        CAST(passenger_count AS INTEGER) AS passenger_count,
        CAST(total_amount AS FLOAT)   AS total_amount,
        CAST(trip_distance AS FLOAT)   AS trip_distance,
        payment_type,
        tip_amount,
        CAST( CASE payment_type
            WHEN  'Cre' THEN 1
            WHEN  'CRE' THEN 1
            ELSE 2
        END AS INTEGER) AS payment_category,
        CAST( CASE tip_amount
            WHEN  0.0 THEN 1
            ELSE 0
        END AS INTEGER) AS tip_category
    FROM 'C:/Users/ekadw/Documents/DATA/NY_Taxi/2010/yellow_taxi_2010/yellow_tripdata_*.parquet'
    WHERE pickup_datetime IS NOT NULL
        AND dropoff_datetime  IS NOT NULL
        AND passenger_count >= 0
        AND trip_distance >= 0
        AND trip_distance <= 50
        AND payment_type IS NOT NULL
        AND total_amount >= 0
        AND tip_amount >= 0 
        AND pickup_datetime >= '2010-01-01' 
        AND pickup_datetime < '2011-01-01'
), CTE_duration_yellow_2010 AS (
    SELECT
        pick_up_time,
        drop_off_time,
        passenger_count,
        trip_distance,
        total_amount,
        payment_category,
        tip_category,
        DATE_DIFF('day', pick_up_time, drop_off_time) AS duration_days,
        EPOCH(drop_off_time - pick_up_time) AS duration_seconds
    FROM CTE_yellow_2010
    WHERE payment_category = 1
), CTE_yellow_2011_2023 AS (
    SELECT 
        tpep_pickup_datetime AS pick_up_time,
        tpep_dropoff_datetime AS drop_off_time,
        CAST(passenger_count AS INTEGER) AS passenger_count,
        CAST(total_amount AS FLOAT)   AS total_amount,
        CAST(trip_distance AS FLOAT)   AS trip_distance,
        CAST(payment_type AS INTEGER) AS payment_category,
        tip_amount,
        CAST( CASE tip_amount
            WHEN  0.0 THEN 1
            ELSE 0
        END AS INTEGER) AS tip_category
    FROM 'C:/Users/ekadw/Documents/DATA/NY_Taxi/*/yellow_taxi/yellow_tripdata_*.parquet'
    WHERE tpep_pickup_datetime IS NOT NULL
        AND tpep_dropoff_datetime  IS NOT NULL
        AND passenger_count IS NOT NULL
        AND trip_distance >= 0
        AND trip_distance <= 50
        AND payment_type IS NOT NULL
        AND total_amount >= 0
        AND tip_amount >= 0
        AND tpep_pickup_datetime >= '2011-01-01' 
        AND tpep_pickup_datetime < '2023-10-01'
), CTE_duration_yellow_2011_2023 AS (
    SELECT
        pick_up_time,
        drop_off_time,
        passenger_count,
        trip_distance,
        total_amount,
        payment_category,
        tip_category,
        DATE_DIFF('day', pick_up_time, drop_off_time) AS duration_days,
        EPOCH(drop_off_time - pick_up_time) AS duration_seconds
    FROM CTE_yellow_2011_2023
    WHERE payment_category = 1
), CTE_green_2011_2023 AS (
    SELECT 
        lpep_pickup_datetime AS pick_up_time,
        lpep_dropoff_datetime AS drop_off_time,
        CAST(passenger_count AS INTEGER) AS passenger_count,
        CAST(total_amount AS FLOAT)   AS total_amount,
        CAST(trip_distance AS FLOAT)   AS trip_distance,
        CAST(payment_type AS INTEGER) AS payment_category,
        tip_amount,
        CAST( CASE tip_amount
            WHEN  0.0 THEN 1
            ELSE 0
        END AS INTEGER) AS tip_category
    FROM 'C:/Users/ekadw/Documents/DATA/NY_Taxi/*/green_taxi/green_tripdata_*.parquet'
    WHERE lpep_pickup_datetime IS NOT NULL
        AND lpep_dropoff_datetime  IS NOT NULL
        AND passenger_count IS NOT NULL
        AND trip_distance >= 0
        AND trip_distance <= 50
        AND payment_type IS NOT NULL
        AND total_amount >= 0
        AND tip_amount >= 0
        AND lpep_pickup_datetime >= '2009-01-01' 
        AND lpep_pickup_datetime < '2023-10-01'
), CTE_duration_green_2011_2023 AS (
    SELECT
        pick_up_time,
        drop_off_time,
        passenger_count,
        trip_distance,
        total_amount,
        payment_category,
        tip_category,
        DATE_DIFF('day', pick_up_time, drop_off_time) AS duration_days,
        EPOCH(drop_off_time - pick_up_time) AS duration_seconds
    FROM CTE_green_2011_2023
    WHERE payment_category = 1
), CTE_union_all AS (
    SELECT * FROM CTE_duration_yellow_2009
    UNION ALL
    SELECT * FROM CTE_duration_yellow_2010
    UNION ALL
    SELECT * FROM CTE_duration_yellow_2011_2023
    UNION ALL
    SELECT * FROM CTE_duration_green_2011_2023
)

SELECT 
    passenger_count,
    trip_distance,
    total_amount,
    CAST(duration_seconds AS FLOAT)   AS duration_seconds,
    tip_category
FROM CTE_union_all
WHERE duration_days = 0
"""

## Step 2: There are some step in here:
1. Create connection from duckDB database to query above
2. Create pipeline with scaling and the calling of tree based model method
3. Create the metrics that can be called for every routine. All the metrics are accuracy, recall, precision and F1. The F1 score is good for imbalance dataset.
4. Do the calculation for each batch = 100000 rows of data

In [3]:
import pandas as pd
from river import preprocessing, tree, metrics

# ----------------------------
# 1. Connect to DuckDB and query
# ----------------------------
con = duckdb.connect("my_data.duckdb")
res = con.execute(query_yellow_green)

# ----------------------------
# 2. Build pipeline
# ----------------------------
pipeline = preprocessing.StandardScaler() | tree.HoeffdingTreeClassifier()

# Compact classification report table
all_metrics = metrics.ClassificationReport()

downsample_ratio = 3  # keep 1:3 balance

# ----------------------------
# 3. Streaming loop
# ----------------------------
while True:
    chunk = res.fetch_df_chunk(vectors_per_chunk=100_000)
    if chunk is None or len(chunk) == 0:
        break

    # Balance the batch
    minority = chunk[chunk["tip_category"] == 1]
    majority = chunk[chunk["tip_category"] == 0]

    if len(minority) > 0:
        majority_down = majority.sample(
            n=min(len(majority), downsample_ratio * len(minority)),
            random_state=42
        )
        batch_balanced = pd.concat([minority, majority_down], ignore_index=True)
    else:
        batch_balanced = chunk

    # Convert to dicts for speed
    records = batch_balanced.to_dict(orient="records")

    for r in records:
        x = {k: v for k, v in r.items() if k != "tip_category"}
        y = r["tip_category"]

        # Predict
        y_pred = pipeline.predict_one(x)

        # Update metrics
        if y_pred is not None:
            all_metrics.update(y, y_pred)

        # Train
        pipeline.learn_one(x, y)

    # Print compact classification report after each chunk
    print(all_metrics)


           Precision   Recall    F1       Support  
                                                   
       0     100.00%    99.66%   99.83%   4648635  
       1      99.00%   100.00%   99.50%   1549544  
                                                   
   Macro      99.50%    99.83%   99.67%            
   Micro      99.75%    99.75%   99.75%            
Weighted      99.75%    99.75%   99.75%            

                  99.75% accuracy                  
           Precision   Recall   F1       Support   
                                                   
       0      99.58%   99.67%   99.63%   11222490  
       1      99.00%   98.75%   98.87%    3740829  
                                                   
   Macro      99.29%   99.21%   99.25%             
   Micro      99.44%   99.44%   99.44%             
Weighted      99.44%   99.44%   99.44%             

                  99.44% accuracy                  
           Precision   Recall   F1       Support   
          

##### Result: with the accuracy of 97.60% and F1 score around 97%, the model here is really good to predict the tipping from yellow and green taxi customer in New York.