# Introduction

This notebook shows techniques to work with big data.

# What is chunking in Pandas?

In Pandas, chunking means reading or processing large datasets in smaller pieces (called chunks) instead of loading everything into memory at once.

This is especially useful when working with very large files that would otherwise exceed your available RAM.

Instead of loading a huge file all at once with `pd.read_csv()`, you read it incrementally and process each part separately.

üëâ Lower memory usage

If we try to load very large files normally, we‚Äôll likely run out of memory.

# Chunking demo: Reducing memory load

For this demo, I will use data from the NYC TLC (New York City, Taxi & Limousine Commission):

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page 


In [1]:
# Specify file to use
DATASET_DIR = "/home/juebrauer/link_to_vcd/10_datasets/63_nyc_taxis"
FILE = DATASET_DIR + "/yellow_tripdata_2025-01.csv"
FILE

'/home/juebrauer/link_to_vcd/10_datasets/63_nyc_taxis/yellow_tripdata_2025-01.csv'

In [2]:
# How large is this file?
import os
size_bytes = os.path.getsize(FILE)
size_mb = size_bytes / (1024 ** 2)
print(f"File size: {size_mb:.1f} MB")

File size: 359.3 MB


In [3]:
# Let us observe how many seconds we need to read in the file
import pandas
df = pandas.read_csv(FILE)
df.head()

  df = pandas.read_csv(FILE)


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee,cbd_congestion_fee
0,1,2025-01-01 00:18:38,2025-01-01 00:26:59,1.0,1.6,1.0,N,229,237,1,10.0,3.5,0.5,3.0,0.0,1.0,18.0,2.5,0.0,0.0
1,1,2025-01-01 00:32:40,2025-01-01 00:35:13,1.0,0.5,1.0,N,236,237,1,5.1,3.5,0.5,2.02,0.0,1.0,12.12,2.5,0.0,0.0
2,1,2025-01-01 00:44:04,2025-01-01 00:46:01,1.0,0.6,1.0,N,141,141,1,5.1,3.5,0.5,2.0,0.0,1.0,12.1,2.5,0.0,0.0
3,2,2025-01-01 00:14:27,2025-01-01 00:20:01,3.0,0.52,1.0,N,244,244,2,7.2,1.0,0.5,0.0,0.0,1.0,9.7,0.0,0.0,0.0
4,2,2025-01-01 00:21:34,2025-01-01 00:25:06,3.0,0.66,1.0,N,244,116,2,5.8,1.0,0.5,0.0,0.0,1.0,8.3,0.0,0.0,0.0


In [4]:
# Which features are available?
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'congestion_surcharge', 'Airport_fee',
       'cbd_congestion_fee'],
      dtype='object')

What each column means is described here:
https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf 

| Field Name | Description |
|-----------|-------------|
| VendorID | A code indicating the TPEP provider that provided the record.<br>1 = Creative Mobile Technologies, LLC<br>2 = Curb Mobility, LLC<br>6 = Myle Technologies Inc<br>7 = Helix |
| tpep_pickup_datetime | The date and time when the meter was engaged. |
| tpep_dropoff_datetime | The date and time when the meter was disengaged. |
| passenger_count | The number of passengers in the vehicle. |
| trip_distance | The elapsed trip distance in miles reported by the taximeter. |
| RatecodeID | The final rate code in effect at the end of the trip.<br>1 = Standard rate<br>2 = JFK<br>3 = Newark<br>4 = Nassau or Westchester<br>5 = Negotiated fare<br>6 = Group ride<br>99 = Null/unknown |
| store_and_fwd_flag | Indicates whether the trip record was held in vehicle memory before sending to the vendor (‚Äústore and forward‚Äù).<br>Y = store and forward trip<br>N = not a store and forward trip |
| PULocationID | TLC Taxi Zone in which the taximeter was engaged. |
| DOLocationID | TLC Taxi Zone in which the taximeter was disengaged. |
| payment_type | A numeric code signifying how the passenger paid for the trip.<br>0 = Flex Fare trip<br>1 = Credit card<br>2 = Cash<br>3 = No charge<br>4 = Dispute<br>5 = Unknown<br>6 = Voided trip |
| fare_amount | The time-and-distance fare calculated by the meter. |
| extra | Miscellaneous extras and surcharges. |
| mta_tax | Tax that is automatically triggered based on the metered rate in use. |
| tip_amount | Tip amount. Automatically populated for credit card tips. Cash tips are not included. |
| tolls_amount | Total amount of all tolls paid in trip. |
| improvement_surcharge | Improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. |
| total_amount | The total amount charged to passengers. Does not include cash tips. |
| congestion_surcharge | Total amount collected in trip for NYS congestion surcharge. |
| airport_fee | For pick up only at LaGuardia and John F. Kennedy Airports. |
| cbd_congestion_fee | Per-trip charge for MTA‚Äôs Congestion Relief Zone starting Jan 5, 2025. |



In [5]:
# This table is large
df.shape

(3475226, 20)

In [6]:
# Data type per colum seems to be ok
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3475226 entries, 0 to 3475225
Data columns (total 20 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   VendorID               int64  
 1   tpep_pickup_datetime   object 
 2   tpep_dropoff_datetime  object 
 3   passenger_count        float64
 4   trip_distance          float64
 5   RatecodeID             float64
 6   store_and_fwd_flag     object 
 7   PULocationID           int64  
 8   DOLocationID           int64  
 9   payment_type           int64  
 10  fare_amount            float64
 11  extra                  float64
 12  mta_tax                float64
 13  tip_amount             float64
 14  tolls_amount           float64
 15  improvement_surcharge  float64
 16  total_amount           float64
 17  congestion_surcharge   float64
 18  Airport_fee            float64
 19  cbd_congestion_fee     float64
dtypes: float64(13), int64(4), object(3)
memory usage: 530.3+ MB


In [7]:
# However, something in the data is trange...
# -> Why are there negative fare/total amounts?
# -> Why is there a taxi ride with a total amount of $863K?
# -> Why is there a taxi ride with trip distance 276K miles?
# Also note that, e.g., some passenger_count values are missing in the table!
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
VendorID,3475226.0,1.785428,0.426328,1.0,2.0,2.0,2.0,7.0
passenger_count,2935077.0,1.297859,0.75075,0.0,1.0,1.0,1.0,9.0
trip_distance,3475226.0,5.855126,564.6016,0.0,0.98,1.67,3.1,276423.57
RatecodeID,2935077.0,2.482535,11.632772,1.0,1.0,1.0,1.0,99.0
PULocationID,3475226.0,165.191576,64.529483,1.0,132.0,162.0,234.0,265.0
DOLocationID,3475226.0,164.125177,69.401686,1.0,113.0,162.0,234.0,265.0
payment_type,3475226.0,1.036623,0.701333,0.0,1.0,1.0,1.0,5.0
fare_amount,3475226.0,17.081803,463.472918,-900.0,8.6,12.11,19.5,863372.12
extra,3475226.0,1.317737,1.861509,-7.5,0.0,0.0,2.5,15.0
mta_tax,3475226.0,0.478099,0.137462,-0.5,0.5,0.5,0.5,10.5


In [8]:
# Let us compute how much memory and time is needed
# to do some data processing
# We can see a high peak memory load

import tracemalloc
import time

tracemalloc.start()
t0 = time.perf_counter()

df = pandas.read_csv(FILE)
df2 = df.query("total_amount > 0 and total_amount < 1000")
df2 = df2.dropna()
mean_amount = df2["total_amount"].mean()
mean_nr_passengers = df2["passenger_count"].mean()

t1 = time.perf_counter()
current_a, peak_a = tracemalloc.get_traced_memory()
tracemalloc.stop()

print(f"Mean amount: ${mean_amount:.2f}")
print(f"Mean number of passengers: {mean_nr_passengers:.1f}")
print(f"Time: {t1 - t0:.2f} s")
print(f"Peak memory (tracemalloc): {peak_a / (1024**2):.1f} MB")

  df = pandas.read_csv(FILE)


Mean amount: $27.45
Mean number of passengers: 1.3
Time: 5.91 s
Peak memory (tracemalloc): 1996.4 MB


In [9]:
# Now the same with chunking
# The peak memory load will be much smaller!
CHUNK_SIZE = 100_000

import tracemalloc
import time

tracemalloc.start()
t0 = time.perf_counter()

sum_amount = 0
sum_nr_passengers = 0
N = 0
for chunk in pandas.read_csv(FILE, chunksize=CHUNK_SIZE):
    df2 = chunk.query("total_amount > 0 and total_amount < 1000")
    df2 = df2.dropna()
    sum_amount += df2["total_amount"].sum()
    sum_nr_passengers += df2["passenger_count"].sum()
    N += len(df2)
mean_amount = sum_amount / N
mean_nr_passengers = sum_nr_passengers / N    

t1 = time.perf_counter()
current_a, peak_a = tracemalloc.get_traced_memory()
tracemalloc.stop()

print(f"Mean amount: ${mean_amount:.2f}")
print(f"Mean number of passengers: {mean_nr_passengers:.1f}")
print(f"Time: {t1 - t0:.2f} s")
print(f"Peak memory (tracemalloc): {peak_a / (1024**2):.1f} MB")

  for chunk in pandas.read_csv(FILE, chunksize=CHUNK_SIZE):


Mean amount: $27.45
Mean number of passengers: 1.3
Time: 5.69 s
Peak memory (tracemalloc): 98.4 MB


# What are Parquet Files?

Nowadays, many large data files are provided as parquet files.

E.g., the NYC TLC organization switched to providing their data in parquet files:
https://www.nyc.gov/assets/tlc/downloads/pdf/working_parquet_format.pdf

        Working With Parquet Format

        TLC is switching to the Parquet file type for storing raw trip data on our website.
        Parquet is the industry standard for >working with big data.
        Using Parquet format results in reduced file sizes and increased speeds.
        However, we have been using >the CSV format for a while and
        the Parquet format might be new to some users.

Parquet files are a columnar data storage format mainly used in big data and analytics.

They‚Äôre designed to be fast, efficient, and compact‚Äîespecially when working with large datasets.

- Benefit 1: Column-based storage
Instead of storing data row by row (like CSV), Parquet stores data column by column.
This makes reading specific columns much faster.

- Benefit 2: Highly compressed
Because values in a column are often similar, Parquet compresses data very well, saving disk space. File format is a binary data format, i.e., numbers are stored efficiently.

        E.g.,

        1,DE,ACTIVE
        2,DE,ACTIVE
        3,DE,ACTIVE
        4,US,ACTIVE

        can be stored like this:

        id:      1, 2, 3, 4
        country: DE, DE, DE, US
        status:  ACTIVE, ACTIVE, ACTIVE, ACTIVE

        or much more efficient like this using Run-length encoding (RLE):

        id:      1, 2, 3
        country: 3xDE, US
        status:  4xACTIVE


- Benefit 3: Schema-aware
Data types (int, string, timestamp, etc.) are stored with the file, so tools know exactly how to interpret the data.

Common tools that use Parquet:
- Apache Spark
- Apache Hive
- Apache Arrow
- Pandas (via PyArrow or Fastparquet)
- BigQuery, Snowflake, AWS Athena



# Historical background of Parquet files

The Parquet file format was invented by engineers at Twitter and Cloudera.

More specifically:
- Julien Le Dem (then at Twitter)
- Dain Sundstrom (then at Twitter)
- Along with contributors from Cloudera

When and why:
- Created around 2013
- Built to solve performance and storage problems in Hadoop-based analytics
- Designed as a columnar, compressed, analytics-optimized format

What happened next:
- Parquet was open-sourced
- Donated to the Apache Software Foundation
- Became Apache Parquet, now an industry standard

# Parquet file demo: Correct data types, reduced hdd and RAM size

In [10]:
# Specify files to use
DATASET_DIR = "/home/juebrauer/link_to_vcd/10_datasets/63_nyc_taxis"
FILE1 = DATASET_DIR + "/yellow_tripdata_2025-01.csv"
FILE2 = DATASET_DIR + "/yellow_tripdata_2025-01.parquet"
print(FILE1)
print(FILE2)

/home/juebrauer/link_to_vcd/10_datasets/63_nyc_taxis/yellow_tripdata_2025-01.csv
/home/juebrauer/link_to_vcd/10_datasets/63_nyc_taxis/yellow_tripdata_2025-01.parquet


In [11]:
# How large are these files?
import os

size_mb = os.path.getsize(FILE1) / (1024**2)
print(f"File size {FILE1}: {size_mb:.1f} MB")

size_mb = os.path.getsize(FILE2) / (1024**2)
print(f"File size {FILE2}: {size_mb:.1f} MB")

File size /home/juebrauer/link_to_vcd/10_datasets/63_nyc_taxis/yellow_tripdata_2025-01.csv: 359.3 MB
File size /home/juebrauer/link_to_vcd/10_datasets/63_nyc_taxis/yellow_tripdata_2025-01.parquet: 56.4 MB


In [12]:
# Read in .csv file and observe
# reading time and memory usage

import tracemalloc
import time
tracemalloc.start()
t0 = time.perf_counter()

df = pandas.read_csv(FILE1)

t1 = time.perf_counter()
current_a, peak_a = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Time: {t1 - t0:.2f} s")
print(f"Peak memory (tracemalloc): {peak_a / (1024**2):.1f} MB")

  df = pandas.read_csv(FILE1)


Time: 5.36 s
Peak memory (tracemalloc): 1996.3 MB


In [13]:
# Show memory usage and data types
# Please note, ...
# 1. the data types for the pickup/dropoff datetime timestamps are not correct
# 2. since the datetime information is stored internally as object (strings), we have a high memory load
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3475226 entries, 0 to 3475225
Data columns (total 20 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   VendorID               int64  
 1   tpep_pickup_datetime   object 
 2   tpep_dropoff_datetime  object 
 3   passenger_count        float64
 4   trip_distance          float64
 5   RatecodeID             float64
 6   store_and_fwd_flag     object 
 7   PULocationID           int64  
 8   DOLocationID           int64  
 9   payment_type           int64  
 10  fare_amount            float64
 11  extra                  float64
 12  mta_tax                float64
 13  tip_amount             float64
 14  tolls_amount           float64
 15  improvement_surcharge  float64
 16  total_amount           float64
 17  congestion_surcharge   float64
 18  Airport_fee            float64
 19  cbd_congestion_fee     float64
dtypes: float64(13), int64(4), object(3)
memory usage: 1.0 GB


In [14]:
# Show memory usage per column
df.memory_usage(deep=True)

Index                          132
VendorID                  27801808
tpep_pickup_datetime     236315368
tpep_dropoff_datetime    236315368
passenger_count           27801808
trip_distance             27801808
RatecodeID                27801808
store_and_fwd_flag       164038618
PULocationID              27801808
DOLocationID              27801808
payment_type              27801808
fare_amount               27801808
extra                     27801808
mta_tax                   27801808
tip_amount                27801808
tolls_amount              27801808
improvement_surcharge     27801808
total_amount              27801808
congestion_surcharge      27801808
Airport_fee               27801808
cbd_congestion_fee        27801808
dtype: int64

In [None]:
# Compute total memory usage based on the column memory usages
mem_bytes = df.memory_usage(deep=True).sum()
mem_mb = mem_bytes / 1024**2
print(f"{mem_mb:.2f} MB")

1057.91 MB


In [16]:
# Read in .parquet file and observe
# reading time and memory usage

import tracemalloc
import time
tracemalloc.start()
t0 = time.perf_counter()

df = pandas.read_parquet(FILE2)

t1 = time.perf_counter()
current_a, peak_a = tracemalloc.get_traced_memory()
tracemalloc.stop()
print(f"Time: {t1 - t0:.2f} s")
print(f"Peak memory (tracemalloc): {peak_a / (1024**2):.1f} MB")

Time: 0.30 s
Peak memory (tracemalloc): 57.5 MB


In [17]:
# Now, the pickup/dropoff information is correctly stored as datetime timestamps
# resulting in a smaller memory load for the table
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3475226 entries, 0 to 3475225
Data columns (total 20 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int32         
 1   tpep_pickup_datetime   datetime64[us]
 2   tpep_dropoff_datetime  datetime64[us]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   store_and_fwd_flag     object        
 7   PULocationID           int32         
 8   DOLocationID           int32         
 9   payment_type           int64         
 10  fare_amount            float64       
 11  extra                  float64       
 12  mta_tax                float64       
 13  tip_amount             float64       
 14  tolls_amount           float64       
 15  improvement_surcharge  float64       
 16  total_amount           float64       
 17  congestion_surcharge   float64       
 18  Airport_fee           

In [18]:
# Show memory usage per column
df.memory_usage(deep=True)

Index                          132
VendorID                  13900904
tpep_pickup_datetime      27801808
tpep_dropoff_datetime     27801808
passenger_count           27801808
trip_distance             27801808
RatecodeID                27801808
store_and_fwd_flag       159717426
PULocationID              13900904
DOLocationID              13900904
payment_type              27801808
fare_amount               27801808
extra                     27801808
mta_tax                   27801808
tip_amount                27801808
tolls_amount              27801808
improvement_surcharge     27801808
total_amount              27801808
congestion_surcharge      27801808
Airport_fee               27801808
cbd_congestion_fee        27801808
dtype: int64

In [None]:
# Compute total memory usage based on the column memory usages
mem_bytes = df.memory_usage(deep=True).sum()
mem_mb = mem_bytes / 1024**2
print(f"{mem_mb:.2f} MB")

616.31 MB
