# Reading big data file with two different approachs

This notebook compares two methods for reading a large CSV file `(train.csv)` in Python to determine the total number of rows(fake computation). The two methods are:

- Using the **chunksize parameter in Pandas**.
- Using the **Dask** library.

The goal is to measure and compare the time taken by each approach to process a file with 135,000,000 rows(like ~5.5 GB).

In [1]:
import dask.dataframe as dd
import pandas as pd
import time
import psutil 
import os

## Method 1: Using Pandas with chunksize

This method reads the large CSV file in smaller chunks to avoid loading the entire file into memory at once. It iterates through the file piece by piece, counting the rows in each chunk.

In [2]:
process = psutil.Process(os.getpid())
file_path = './data/train.csv'
chunk_size = 150_000
total_rows = 0

# Start timing
start_time = time.time()
start_mem = process.memory_info().rss / (1024 * 1024) 
peak_mem = start_mem
print(f"Starting RAM: {start_mem:.2f} MB")


chunk_iterator = pd.read_csv(file_path, chunksize=chunk_size)

# Loop through each chunk
for chunk in chunk_iterator:
    total_rows += len(chunk)

# Stop timing
end_time = time.time()
time_taken = end_time - start_time
end_mem = process.memory_info().rss / (1024 * 1024)

print(f"Total rows: {total_rows}")
print(f"Time taken with Pandas (chunksize): {time_taken:.2f} seconds")
print(f"Peak RAM usage during processing: {peak_mem:.2f} MB")
print(f"Ending RAM usage: {end_mem:.2f} MB")

Starting RAM: 170.52 MB


  for chunk in chunk_iterator:


Total rows: 135000000
Time taken with Pandas (chunksize): 98.38 seconds
Peak RAM usage during processing: 170.52 MB
Ending RAM usage: 195.41 MB


## Method 2: Using Dask

Dask is a parallel computing library that is designed to scale natively from a single machine to a cluster. It can handle datasets that are larger than memory by breaking them into smaller, manageable pieces (similar to Pandas chunks) and processing them in parallel.

In [3]:
file_path = './data/train.csv'
process = psutil.Process(os.getpid())

# Start timing
start_time = time.time()
start_mem = process.memory_info().rss / (1024 * 1024) # Convert to MB
peak_mem = start_mem

# Missing value in dataset issues.
# now we tell Dask to treat the 'attributed_time' column as text (object)
ddf = dd.read_csv(
    file_path,
    dtype={'attributed_time': 'object'},
    blocksize="8MB"
)

total_rows = len(ddf)

# Stop timing
end_time = time.time()
end_mem = process.memory_info().rss / (1024 * 1024)
time_taken = end_time - start_time

print(f"Total rows: {total_rows}")
print(f"Time taken with Dask: {time_taken:.2f} seconds")
print(f"Peak RAM usage during computation: {peak_mem:.2f} MB")
print(f"Ending RAM usage: {end_mem:.2f} MB")

Total rows: 135000000
Time taken with Dask: 54.76 seconds
Peak RAM usage during computation: 195.41 MB
Ending RAM usage: 1095.54 MB


## Method 3: Pandas with chunksize on Compressed File (.csv.gz)

This method is identical to Method 1, but reads from the compressed file. This saves significantly on disk space, but we expect a performance penalty due to the need to decompress each chunk on the fly.

In [7]:
process = psutil.Process(os.getpid())

zip_file_path = './data/train.csv.gz' 
chunk_size = 150_000
total_rows = 0

# Start timing
start_time = time.time()
start_mem = process.memory_info().rss / (1024 * 1024) 
print(f"Starting RAM: {start_mem:.2f} MB")

# This creates an iterator
with pd.read_csv(
    zip_file_path,
    compression='gzip',
    dtype={'attributed_time': 'object'},
    chunksize=chunk_size,
) as reader:
    
    for chunk in reader:
        total_rows += len(chunk)
 

# Stop timing
end_time = time.time()
time_taken = end_time - start_time
end_mem = process.memory_info().rss / (1024 * 1024)

print(f"Total rows: {total_rows}")
print(f"Time taken with Pandas (chunksize, gzip): {time_taken:.2f} seconds")
print(f"Ending RAM usage: {end_mem:.2f} MB")

Starting RAM: 1110.05 MB
Total rows: 135000000
Time taken with Pandas (chunksize, gzip): 115.04 seconds
Ending RAM usage: 1155.93 MB


## Comparison and Conclusion

Here's a summary of the performance for reading 135 million rows on this local machine:

| Library | Time Taken (seconds) | RAM Usage (MB) |
| :--- | :---: | :---: |
| Pandas (chunksize) | 98.38 | 170.52 |
| Dask | 54.76 | 1095.54 |
| Pandas (chunksize, compressed file) | 115.04 | 1155.93 |

### Analysis

Here are the key observations from these results:

* **Time (Performance) ⏱️:** **Dask is the clear winner**, finishing in just **54.76 seconds**. This is because Dask is designed for parallel computing and uses multiple CPU cores to process chunks of the file simultaneously. The standard Pandas `chunksize` method was nearly twice as slow at **98.38 seconds**, as it processes chunks sequentially. Reading the compressed file was the slowest at **115.04 seconds**, which is expected due to the extra CPU overhead of decompressing each chunk.

* **Storage (RAM) 💾:** The standard **Pandas (chunksize) method was the most memory-efficient**, showing the lowest RAM usage at **~170.52 MB**. The Dask and compressed Pandas methods show much higher RAM usage *in this test*, but this is likely due to memory being held over from previous cell runs. The key takeaway is that the `chunksize` method is designed to keep peak memory low by only holding one chunk at a time.

### Conclusion

Based on this lab:

1.  **For maximum speed**, **Dask** is the best choice, especially on a local machine with multiple CPU cores where its parallelism provides a significant advantage.
2.  **For memory-constrained environments**, **Pandas with `chunksize`** is the most robust and memory-efficient solution.
3.  **For saving disk space**, using a **compressed file (`.gz`)** is effective, but it comes with a performance trade-off, resulting in the slowest read times.