### About the project
The idea is to do basic ETL demo for a BigData Project. 

The Process Flows like this.

- Raw CSV Data ---> PySpark ---> Basic Transformation ---> Load to MongoDB ---> Read From MongoDB ---> Load to HBase ---> Do Data Analysis

The system setup is like this:
- PySpark (locally setup, used for raw data processing)
- MongoDB (locally setup, used to store the process data into the database)
- HBase (setup in docker, user for data analysis)

About the dataset:
- TLC Trip Record Data
    - Yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.

    - For-Hire Vehicle (“FHV”) trip records include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID (shape file below). These records are generated from the FHV Trip Record submissions made by bases. Note: The TLC publishes base trip record data as submitted by the bases, and we cannot guarantee or confirm their accuracy or completeness. Therefore, this may not represent the total amount of trips dispatched by all TLC-licensed bases. The TLC performs routine reviews of the records and takes enforcement actions when necessary to ensure, to the extent possible, complete and accurate information.

    - The dataset was in parquet format, which is then converted to CSV via pyspark.

    - The dataset contains 3475226 records.

Here's a table for the NYC Taxi Trip data description:

| Field Name | Description |
|------------|-------------|
| **VendorID** | A code indicating the TPEP provider that provided the record.<br>• 1 = Creative Mobile Technologies, LLC<br>• 2 = Curb Mobility, LLC<br>• 6 = Myle Technologies Inc<br>• 7 = Helix tpep |
| **tpep_pickup_datetime** | The date and time when the meter was engaged. |
| **tpep_dropoff_datetime** | The date and time when the meter was disengaged. |
| **passenger_count** | The number of passengers in the vehicle. |
| **trip_distance** | The elapsed trip distance in miles reported by the taximeter. |
| **RatecodeID** | The final rate code in effect at the end of the trip.<br>• 1 = Standard rate<br>• 2 = JFK<br>• 3 = Newark<br>• 4 = Nassau or Westchester<br>• 5 = Negotiated fare<br>• 6 = Group ride<br>• 99 = Null/unknown |
| **store_and_fwd_flag** | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor.<br>• Y = store and forward trip<br>• N = not a store and forward trip |
| **PULocationID** | TLC Taxi Zone in which the taximeter was engaged. |
| **DOLocationID** | TLC Taxi Zone in which the taximeter was disengaged. |
| **payment_type** | A numeric code signifying how the passenger paid for the trip.<br>• 0 = Flex Fare trip<br>• 1 = Credit card<br>• 2 = Cash<br>• 3 = No charge<br>• 4 = Dispute<br>• 5 = Unknown<br>• 6 = Voided trip |
| **fare_amount** | The time-and-distance fare calculated by the meter. |
| **extra** | Miscellaneous extras and surcharges. |
| **mta_tax** | Tax that is automatically triggered based on the metered rate in use. |
| **tip_amount** | Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. |
| **tolls_amount** | Total amount of all tolls paid in trip. |
| **improvement_surcharge** | Improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. |
| **total_amount** | The total amount charged to passengers. Does not include cash tips. |
| **congestion_surcharge** | Total amount collected in trip for NYS congestion surcharge. |
| **airport_fee** | For pick up only at LaGuardia and John F. Kennedy Airports. |
| **cbd_congestion_fee** | Per-trip charge for MTA's Congestion Relief Zone starting Jan. 5, 2025. |


Note that: Spark out of the box doesn't have support for HBase, so I had to use `happybase` library to connect to HBase and do the data analysis, because of the CPU restriction and time dependencies, the dataset used is random 100000 records of original dataset.

## Part I: Data Ingestion to MongoDB

#### Load Necessary Library

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_timestamp, hour, dayofweek, month, year, date_format, rand
import pymongo
import happybase
import uuid
from datetime import datetime
import time
import json
import happybase
import pymongo
import uuid
import time
import concurrent.futures
from datetime import datetime
import multiprocessing
from threading import Lock
import queue
import itertools
import threading

# Initialize Spark with proper configurations
spark = SparkSession.builder \
    .appName("NYC Taxi Data Pipeline") \
    .config("spark.mongodb.output.uri", "mongodb://localhost:27017/bigdata_demo.raw_taxi_data") \
    .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1") \
    .config("spark.mongodb.input.uri", "mongodb://localhost:27017/bigdata_demo.raw_taxi_data") \
    .getOrCreate()
    
spark.sparkContext.setLogLevel("ERROR")


your 131072x1 screen size is bogus. expect trouble
25/04/15 22:35:10 WARN Utils: Your hostname, DESKTOP-U7R862J resolves to a loopback address: 127.0.1.1; using 10.255.255.254 instead (on interface lo)
25/04/15 22:35:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/suman/.ivy2/cache
The jars for the packages stored in: /home/suman/.ivy2/jars
org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-ee934838-392a-482d-8c3f-392062220cdb;1.0
	confs: [default]
	found org.mongodb.spark#mongo-spark-connector_2.12;3.0.1 in central
	found org.mongodb#mongodb-driver-sync;4.0.5 in central
	found org.mongodb#bson;4.0.5 in central
	found org.mongodb#mongodb-driver-core;4.0.5 in central
:: resolution report :: resolve 288ms :: artifacts dl 11ms
	:: modules in use:
	org.mongodb#bson;4.0.5 from central in [default]
	org.mongodb#mongodb-driver-core;4.0.5 from central in [default]
	org.mongodb#mongodb-driver-sync;4.0.5 from central in [default]
	org.mongodb.spark#mongo-spark-connector_2.12;3.0.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts  

#### Load the the Yellow Trip CSV file into Spark

In [2]:
taxi_df  = spark.read.csv("yellow_tripdata_2025-01.csv", inferSchema=True, header=True)
taxi_df.show()

                                                                                

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       2| 2025-01-17 18:08:04|  2025-01-17 18:19:55|           NULL|          0.0|      NULL|              NULL|         161|         186|           0|      13.22|  0.0|    0.5|       0.

#### Check the schema: it should automatically infer the schema

In [3]:
# Display the schema
print("Dataset Schema:")
taxi_df.printSchema()

Dataset Schema:
root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- Airport_fee: double (nullable = true)



#### Basic Data Manupulation: Renaming Column Names

In [4]:
print("Performing data cleaning and transformations...")
cleaned_df = taxi_df \
    .withColumn("pickup_datetime", to_timestamp(col("tpep_pickup_datetime"))) \
    .withColumn("dropoff_datetime", to_timestamp(col("tpep_dropoff_datetime"))) \
    .withColumn("pickup_hour", hour(col("pickup_datetime"))) \
    .withColumn("pickup_day", dayofweek(col("pickup_datetime"))) \
    .withColumn("pickup_month", month(col("pickup_datetime"))) \
    .withColumn("pickup_year", year(col("pickup_datetime"))) \
    .withColumn("trip_duration_minutes", (col("dropoff_datetime").cast("long") - col("pickup_datetime").cast("long")) / 60) \
    .withColumn("date", date_format(col("pickup_datetime"), "yyyy-MM-dd")) \
    .drop("tpep_pickup_datetime", "tpep_dropoff_datetime")  # Drop original timestamp columns
    
    
# I have selected random 100000 samples to handle the large dataset on convservative system (low ram and time is slow processing all 3.7 mil records)
cleaned_df = cleaned_df.orderBy(rand()).limit(100000)

Performing data cleaning and transformations...


In [5]:

print("Sample of transformed data:")
cleaned_df.show(5)

Sample of transformed data:


                                                                                

+--------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+-------------------+-------------------+-----------+----------+------------+-----------+---------------------+----------+
|VendorID|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|    pickup_datetime|   dropoff_datetime|pickup_hour|pickup_day|pickup_month|pickup_year|trip_duration_minutes|      date|
+--------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+-------------------+-------------------+-----------+----------+------------

#### Load the cleaned data in MongoDB

In [6]:
print("Writing data to MongoDB...")
cleaned_df.write.format("mongo").mode("overwrite").save()
print("Data successfully loaded into MongoDB collection: raw_taxi_data")

Writing data to MongoDB...


[Stage 5:>                                                          (0 + 1) / 1]

Data successfully loaded into MongoDB collection: raw_taxi_data


                                                                                

## Part II: Load the data from local database, i.e. MongoDB

#### Load via Spark

In [7]:
# Load data from MongoDB
print("Loading data from MongoDB...")
mongo_df = spark.read.format("mongo").load()

Loading data from MongoDB...


                                                                                

In [8]:
# Show sample data from MongoDB
print("Sample data from MongoDB:")
mongo_df.show(5)

Sample data from MongoDB:
+-----------+------------+------------+----------+--------+--------------------+--------------------+----------+-------------------+-----+-----------+---------------------+-------+---------------+------------+-------------------+----------+-----------+------------+-----------+------------------+----------+------------+------------+-------------+---------------------+
|Airport_fee|DOLocationID|PULocationID|RatecodeID|VendorID|                 _id|congestion_surcharge|      date|   dropoff_datetime|extra|fare_amount|improvement_surcharge|mta_tax|passenger_count|payment_type|    pickup_datetime|pickup_day|pickup_hour|pickup_month|pickup_year|store_and_fwd_flag|tip_amount|tolls_amount|total_amount|trip_distance|trip_duration_minutes|
+-----------+------------+------------+----------+--------+--------------------+--------------------+----------+-------------------+-----+-----------+---------------------+-------+---------------+------------+-------------------+-----

#### Setup MongoDB connection

In [9]:
def connect_to_mongodb():
    """Connect to MongoDB"""
    try:
        # Update with your MongoDB connection details if different
        client = pymongo.MongoClient("mongodb://localhost:27017/")
        # Assuming you have a database called bigdata_demo with raw_taxi_data collection
        db = client["bigdata_demo"]
        collection = db["raw_taxi_data"]
        
        print(f"Successfully connected to MongoDB")
        print(f"Collection count: {collection.count_documents({})}")
        return collection
    except Exception as e:
        print(f"Error connecting to MongoDB: {e}")
        return None


#### Setup HBase connection

In [10]:
def connect_to_hbase():
    """Connect to HBase running in Docker"""
    try:
        # Connect to HBase Thrift server running on default port 9090
        connection = happybase.Connection('localhost', port=9090, timeout=300000, autoconnect=True)
        print("Successfully connected to HBase")
        return connection
    except Exception as e:
        print(f"Error connecting to HBase: {e}")
        return None

#### Create Tables on Hbase

In [11]:

def create_hbase_table(connection):
    """Create HBase table for taxi data"""
    try:
        # Define the table schema
        families = {
            'trips': dict(max_versions=1),     # Trip details
            'payment': dict(max_versions=1),   # Payment information 
            'location': dict(max_versions=1)   # Location information
        }
        
        if b'taxi_trips' in connection.tables():
            print("Dropping existing taxi_trips table for demo...")
            connection.delete_table('taxi_trips', disable=True)
        
        # Create the table
        print("Creating taxi_trips table in HBase...")
        connection.create_table('taxi_trips', families)
        print("Table created successfully!")
        
        return True
    except Exception as e:
        print(f"Error creating HBase table: {e}")
        return False

#### Create equivalent tables in Hbase as per records in MongoDB

In [12]:

def mongodb_to_hbase_record(mongo_doc):
    """Convert a MongoDB document to HBase record format"""
    
    date_str = mongo_doc.get('date', datetime.now().strftime('%Y-%m-%d'))
    date_formatted = date_str.replace('-', '')
    row_key = f"{date_formatted}-{str(uuid.uuid4())}"
    
    # Prepare the HBase record
    hbase_record = {
        'row_key': row_key,
        'trips': {},
        'payment': {},
        'location': {}
    }
    
    # Map MongoDB fields to HBase column families
    # Trips related fields
    trip_fields = ['pickup_datetime', 'dropoff_datetime', 'pickup_hour', 'pickup_day',
                  'pickup_month', 'pickup_year', 'trip_duration_minutes', 'passenger_count']
    
    # Payment related fields
    payment_fields = ['fare_amount', 'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
                     'improvement_surcharge', 'total_amount', 'congestion_surcharge', 'payment_type']
    
    # Location related fields
    location_fields = ['PULocationID', 'DOLocationID', 'trip_distance']
    
    # Fill the HBase record from MongoDB document
    for field in trip_fields:
        if field in mongo_doc:
            hbase_record['trips'][field] = str(mongo_doc[field])
    
    for field in payment_fields:
        if field in mongo_doc:
            hbase_record['payment'][field] = str(mongo_doc[field])
    
    for field in location_fields:
        if field in mongo_doc:
            hbase_record['location'][field] = str(mongo_doc[field])
    
    return hbase_record


#### Insert the batched data in Hbase

In [13]:

processed_count = 0
processed_lock = Lock()
progress_update_interval = 5
last_progress_time = time.time()

def insert_batch_to_hbase(batch, host='localhost', port=9090):
    """
    Insert a batch of records into HBase
    Each worker thread/process will have its own connection
    """
    try:
        # Create a new connection for this thread/process
        connection = happybase.Connection(host=host, port=port, timeout=30000)
        table = connection.table('taxi_trips')
        
        for record in batch:
            # Prepare data in HBase format
            data = {}
            
            # Add trip details
            for column, value in record['trips'].items():
                data[f'trips:{column}'.encode()] = str(value).encode()
                
            # Add payment details
            for column, value in record['payment'].items():
                data[f'payment:{column}'.encode()] = str(value).encode()
                
            # Add location details
            for column, value in record['location'].items():
                data[f'location:{column}'.encode()] = str(value).encode()
            
            # Insert record
            table.put(record['row_key'].encode(), data)
        
        # Update progress counter
        global processed_count, processed_lock, last_progress_time
        with processed_lock:
            processed_count += len(batch)
            current_time = time.time()
            if current_time - last_progress_time >= progress_update_interval:
                print(f"Progress: {processed_count} records processed")
                last_progress_time = current_time
        
        # Close the connection
        connection.close()
        return len(batch)
    except Exception as e:
        print(f"Error in worker thread/process: {e}")
        return 0

##### Since data is big and there is no out of the box Spark support for HBase, I am using aysnchronous batched processing to make the transfer faster.

In [None]:

def chunker(iterable, chunk_size):
    """Yield chunks from an iterable"""
    it = iter(iterable)
    while True:
        chunk = list(itertools.islice(it, chunk_size))
        if not chunk:
            break
        yield chunk

def transfer_data_async(mongo_collection, hbase_connection, batch_size=100, num_workers=None):
    """Transfer data from MongoDB to HBase using multiple threads/processes"""
    try:
        
        
        if num_workers is None:
            num_workers = multiprocessing.cpu_count()
        
        print(f"Using {num_workers} worker threads for data transfer")
        
        
        total_docs = mongo_collection.count_documents({})
        print(f"Transferring {total_docs} documents from MongoDB to HBase...")
        
        
        global processed_count
        processed_count = 0
        
        
        start_time = time.time()
        
        
        cursor = mongo_collection.find({})
        
        
        print("Converting MongoDB documents to HBase records...")
        
        
        with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        
            futures = []
            
        
            for chunk in chunker(cursor, 10000):
                # Convert documents to HBase records
                hbase_records = [mongodb_to_hbase_record(doc) for doc in chunk]
                
                
                for i in range(0, len(hbase_records), batch_size):
                    batch = hbase_records[i:i + batch_size]
                    futures.append(executor.submit(insert_batch_to_hbase, batch))
            
            total_processed = 0
            for future in concurrent.futures.as_completed(futures):
                try:
                    batch_count = future.result()
                    total_processed += batch_count
                except Exception as exc:
                    print(f"Batch processing generated an exception: {exc}")
        
        
        total_time = time.time() - start_time
        print(f"Transfer complete! {total_processed} records transferred to HBase in {total_time:.2f} seconds")
        if total_time > 0:
            print(f"Transfer rate: {total_processed/total_time:.2f} records/second")
        
        return True
    except Exception as e:
        print(f"Error transferring data: {e}")
        return False

def transfer_data_with_queue(mongo_collection, hbase_connection, batch_size=100, num_workers=None):
    """
    Transfer data using a producer-consumer pattern with a queue
    This is an alternative implementation that can be more memory efficient
    """
    try:
        if num_workers is None:
            num_workers = multiprocessing.cpu_count()
        
        print(f"Using producer-consumer pattern with {num_workers} workers")
        
        
        total_docs = mongo_collection.count_documents({})
        print(f"Transferring {total_docs} documents from MongoDB to HBase...")
        
        
        global processed_count
        processed_count = 0     
        
        batch_queue = queue.Queue(maxsize=num_workers * 2)  # Buffer some batches
               
        done_producing = threading.Event()
        
        
        start_time = time.time()
        
        
        def consumer():
            while not (done_producing.is_set() and batch_queue.empty()):
                try:
                    batch = batch_queue.get(timeout=1)
                    insert_batch_to_hbase(batch)
                    batch_queue.task_done()
                except queue.Empty:
                    continue
                except Exception as e:
                    print(f"Consumer error: {e}")
        
        
        consumers = []
        for _ in range(num_workers):
            t = threading.Thread(target=consumer)
            t.daemon = True
            t.start()
            consumers.append(t)
        
        
        try:
            cursor = mongo_collection.find({})
            current_batch = []
            
            for doc in cursor:
                
                record = mongodb_to_hbase_record(doc)
                current_batch.append(record)
                                
                if len(current_batch) >= batch_size:
                    batch_queue.put(current_batch)
                    current_batch = []
            
            
            if current_batch:
                batch_queue.put(current_batch)
            
        finally:
            
            done_producing.set()
                
        batch_queue.join()        
        for t in consumers:
            t.join(timeout=1)
        
        total_time = time.time() - start_time
        print(f"Transfer complete! {processed_count} records transferred to HBase in {total_time:.2f} seconds")
        if total_time > 0:
            print(f"Transfer rate: {processed_count/total_time:.2f} records/second")
        
        return True
    except Exception as e:
        print(f"Error transferring data: {e}")
        return False

#### Load the Data from MongoDB to HBase

In [15]:
mongo_client = pymongo.MongoClient("mongodb://localhost:27017/")
mongo_db = mongo_client["bigdata_demo"]
mongo_collection = mongo_db["raw_taxi_data"]

hbase_connection = connect_to_hbase()
create_hbase_table(hbase_connection)
transfer_data_async(mongo_collection, hbase_connection, batch_size=1000)

Successfully connected to HBase
Dropping existing taxi_trips table for demo...
Creating taxi_trips table in HBase...
Table created successfully!
Using 12 worker threads for data transfer
Transferring 100000 documents from MongoDB to HBase...
Converting MongoDB documents to HBase records...
Progress: 1000 records processed
Waiting for 100 batch operations to complete...
Progress: 13000 records processed
Progress: 25000 records processed
Progress: 37000 records processed
Progress: 49000 records processed
Progress: 61000 records processed
Progress: 73000 records processed
Progress: 85000 records processed
Transfer complete! 100000 records transferred to HBase in 60.92 seconds
Transfer rate: 1641.56 records/second


True

## Part III: Data analysis in HBase

In [18]:
def query_sample_data():
    """Query and display sample data from the HBase table"""
    try:
        connection = connect_to_hbase()
        table = connection.table('taxi_trips')
        
        print("\n" + "="*80)
        print(" "*30 + "SAMPLE RECORDS FROM HBASE")
        print("="*80)
        
        # Scan the first 5 records
        records = []
        for key, data in table.scan(limit=5):
            # Decode binary data to string
            decoded_data = {col_name.decode(): value.decode() 
                           for col_name, value in data.items()}
            
            # Group by column family for readability
            grouped_data = {}
            for col_key, value in decoded_data.items():
                family, qualifier = col_key.split(':')
                if family not in grouped_data:
                    grouped_data[family] = {}
                grouped_data[family][qualifier] = value
            
            # Add to records list
            records.append({
                'key': key.decode(),
                'data': grouped_data
            })
        
        # Display the records in a nicely formatted way
        for i, record in enumerate(records, 1):
            print(f"\n{'='*40} RECORD {i} {'='*40}")
            print(f"KEY: {record['key']}")
            
            for family, columns in record['data'].items():
                print(f"\n  {family.upper()} FAMILY:")
                print("  " + "-"*50)
                
                # Get max column name length for alignment
                max_col_len = max([len(col) for col in columns.keys()]) if columns else 0
                
                # Print columns with aligned values
                for col, val in columns.items():
                    # Truncate very long values
                    display_val = val
                    if len(val) > 50:
                        display_val = val[:47] + "..."
                    
                    print(f"  {col.ljust(max_col_len)} : {display_val}")
        
        # Count total records
        print("\n" + "="*80)
        print(" "*30 + "RECORD COUNT SUMMARY")
        print("="*80)
        
        # Show progress while counting
        row_count = 0
        chunk_size = 1000
        print("Counting records", end="", flush=True)
        
        for _ in table.scan(batch_size=chunk_size, columns=[b'trips:pickup_datetime']):
            row_count += 1
            if row_count % chunk_size == 0:
                print(".", end="", flush=True)
        
        print(f"\n\nTotal records in HBase table: {row_count:,}")
        print("="*80 + "\n")
        
        return True
    except Exception as e:
        print(f"\n❌ Error querying data: {e}")
        return False

In [19]:
query_sample_data()

Successfully connected to HBase

                              SAMPLE RECORDS FROM HBASE

KEY: 20250101-001ccea6-3ffb-4d5b-aa53-f80069f32c3c

  LOCATION FAMILY:
  --------------------------------------------------
  DOLocationID  : 114
  PULocationID  : 211
  trip_distance : 0.72

  PAYMENT FAMILY:
  --------------------------------------------------
  congestion_surcharge  : 2.5
  extra                 : 1.0
  fare_amount           : 5.8
  improvement_surcharge : 1.0
  mta_tax               : 0.5
  payment_type          : 2
  tip_amount            : 0.0
  tolls_amount          : 0.0
  total_amount          : 10.8

  TRIPS FAMILY:
  --------------------------------------------------
  dropoff_datetime      : 2024-12-31 20:47:20
  passenger_count       : 3
  pickup_datetime       : 2024-12-31 20:43:44
  pickup_day            : 4
  pickup_hour           : 2
  pickup_month          : 1
  pickup_year           : 2025
  trip_duration_minutes : 3.6

KEY: 20250101-00330f03-2bc7-41fb-b40b-3751

True