# Lab2: How to work with big data files (5GB+)

This notebook implements and compares three different methods for reading a very large CSV file (`train.csv` from the TalkingData AdTracking competition) to perform a simple computation (counting the total number of rows).

The three methods are:

1.  **Pandas with `chunksize`**: Reading the file in small pieces to keep memory usage low.
2.  **Dask**: Using a parallel computing library designed for large datasets.
3.  **Pandas with `chunksize` on a Compressed File**: Testing if reading a `.gz` file saves disk space at the cost of time.

We will compare them based on the lab requirements: **Time** (wall clock time) and **Storage** (both Peak RAM usage and Disk Space).

## 0. Setup and Imports

First, let's install the required `kaggle` library and import all the packages we'll need for this analysis.

In [1]:
!pip install kaggle

import dask.dataframe as dd
import pandas as pd
import time
import psutil
import os
import sys
import gzip
import shutil

print("All libraries imported.")

All libraries imported.


## 2. Disk Storage Comparison

Before we test the read times, let's create the compressed file for Method 3 and compare the **disk storage** sizes, as required by the lab.

In [10]:
from google.colab import files
import os
import shutil
import gzip

# --- STEP 1: Upload your kaggle.json file ---
# Get it from https://www.kaggle.com/account -> "Create New API Token"
print("📂 Please upload your kaggle.json file")
uploaded = files.upload()  # Choose kaggle.json when prompted

📂 Please upload your kaggle.json file


Saving kaggle.json to kaggle.json


In [11]:
# --- STEP 2: Configure Kaggle credentials ---
os.makedirs(os.path.expanduser("~/.kaggle"), exist_ok=True)

# Move the uploaded file to the .kaggle directory
# This assumes you just uploaded it in the cell above
for fn in uploaded.keys():
    shutil.move(fn, os.path.expanduser("~/.kaggle/kaggle.json"))

!chmod 600 ~/.kaggle/kaggle.json
!kaggle config view  # optional: verify credentials

Configuration values from /root/.kaggle
- username: mohamedlifa
- path: None
- proxy: None
- competition: None


In [12]:
# --- STEP 3: Set up paths ---
data_dir = './data'
os.makedirs(data_dir, exist_ok=True)

competition_name = 'talkingdata-adtracking-fraud-detection'
train_file = os.path.join(data_dir, 'train.csv')
compressed_file = os.path.join(data_dir, 'train.csv.gz')
zip_path = os.path.join(data_dir, f"{competition_name}.zip")

print(f"Data directory set to: {data_dir}")
print(f"Train file path set to: {train_file}")

Data directory set to: ./data
Train file path set to: ./data/train.csv


In [13]:
# --- STEP 4: Download competition files ---
print("⬇️  Downloading competition dataset...")
!kaggle competitions download -c {competition_name} -p {data_dir}
print("Download complete.")

⬇️  Downloading competition dataset...
talkingdata-adtracking-fraud-detection.zip: Skipping, found more recently modified local copy (use --force to force download)
Download complete.


In [14]:
# --- STEP 5: Unzip only train.csv (or fallback to train_sample.csv) ---
print("\n📦 Extracting train.csv from the zip...")
# This command tries to unzip train.csv. If it fails, it tries to unzip train_sample.csv
!unzip -j -o {zip_path} train.csv -d {data_dir} || unzip -j -o {zip_path} train_sample.csv -d {data_dir}

print("Extraction attempt complete.")


📦 Extracting train.csv from the zip...
Archive:  ./data/talkingdata-adtracking-fraud-detection.zip
  inflating: ./data/train.csv        
Extraction attempt complete.


In [15]:
# --- STEP 6: Create compressed version for big-data lab ---
target = train_file if os.path.exists(train_file) else os.path.join(data_dir, "train_sample.csv")

if os.path.exists(target):
    print(f"\n🗜️  Creating compressed file for: {target}")
    print("...this may take 10+ minutes, please wait...")

    with open(target, 'rb') as f_in:
        with gzip.open(compressed_file, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)

    print("✅ Setup complete! Compressed file saved at:", compressed_file)
else:
    print("❌ train.csv not found in the zip. Please check dataset contents.")


🗜️  Creating compressed file for: ./data/train.csv
...this may take 10+ minutes, please wait...
✅ Setup complete! Compressed file saved at: ./data/train.csv.gz


## 3. Method 1: Using Pandas with `chunksize`

This method reads the uncompressed `train.csv` file in small chunks (10,000 rows at a time) to avoid loading the entire 5.5GB+ file into memory.

In [17]:
process = psutil.Process(os.getpid())
chunk_size = 150_000
total_rows = 0

print("Starting Method 1: Pandas (chunksize)")

# Start timing
start_time = time.time()
start_mem = process.memory_info().rss / (1024 * 1024)
peak_mem = start_mem # Initialize peak memory
print(f"Starting RAM: {start_mem:.2f} MB")

try:
    chunk_iterator = pd.read_csv(train_file, chunksize=chunk_size)

    # Loop through each chunk
    for chunk in chunk_iterator:
        total_rows += len(chunk)

        # --- FIX: Update peak memory inside the loop ---
        current_mem = process.memory_info().rss / (1024 * 1024)
        if current_mem > peak_mem:
            peak_mem = current_mem

    # Stop timing
    end_time = time.time()
    time_taken = end_time - start_time
    end_mem = process.memory_info().rss / (1024 * 1024)

    print("\n--- Results for Method 1 ---")
    print(f"Total rows: {total_rows}")
    print(f"Time taken: {time_taken:.2f} seconds")
    print(f"Peak RAM usage: {peak_mem:.2f} MB")
    print(f"Ending RAM usage: {end_mem:.2f} MB")

except FileNotFoundError:
    print(f"ERROR: File not found at {train_file}. Please check the download step.")

Starting Method 1: Pandas (chunksize)
Starting RAM: 611.36 MB


  for chunk in chunk_iterator:



--- Results for Method 1 ---
Total rows: 184903890
Time taken: 140.00 seconds
Peak RAM usage: 650.68 MB
Ending RAM usage: 649.08 MB


## 4. Method 2: Using Dask

Dask is a parallel computing library. It will also read the file in blocks, but it can process multiple blocks in parallel using multiple CPU cores, which should be faster. Dask is "lazy," so the work only happens when we call `.compute()` or, in this case, `len()`.

In [18]:
process = psutil.Process(os.getpid())

print("Starting Method 2: Dask")

# Start timing
start_time = time.time()
start_mem = process.memory_info().rss / (1024 * 1024) # Convert to MB
print(f"Starting RAM: {start_mem:.2f} MB")

try:
    # Dask lazily creates a dataframe object
    # We must specify dtype for 'attributed_time' due to mixed types in the file
    ddf = dd.read_csv(
        train_file,
        dtype={'attributed_time': 'object'},
        blocksize="8MB"
    )

    ram_before_compute = process.memory_info().rss / (1024 * 1024)
    print(f"RAM after Dask setup (before compute): {ram_before_compute:.2f} MB")

    # This triggers the actual computation
    total_rows = len(ddf)

    # Stop timing
    end_time = time.time()
    time_taken = end_time - start_time
    ram_after_compute = process.memory_info().rss / (1024 * 1024)

    print("\n--- Results for Method 2 ---")
    print(f"Total rows: {total_rows}")
    print(f"Time taken: {time_taken:.2f} seconds")
    print(f"Peak RAM usage (after compute): {ram_after_compute:.2f} MB")

except FileNotFoundError:
    print(f"ERROR: File not found at {train_file}. Please check the download step.")

Starting Method 2: Dask
Starting RAM: 649.08 MB
RAM after Dask setup (before compute): 656.10 MB

--- Results for Method 2 ---
Total rows: 184903890
Time taken: 154.19 seconds
Peak RAM usage (after compute): 663.97 MB


## 5. Method 3: Pandas with `chunksize` (Compressed File)

This method is identical to Method 1, but it reads from the `train.csv.gz` file. We expect this to be slower because of the overhead of decompressing each chunk on the fly, but it has the advantage of using much less disk space.

In [16]:
process = psutil.Process(os.getpid())
chunk_size = 150_000
total_rows = 0

print("Starting Method 3: Pandas (chunksize, gzip)")

# Start timing
start_time = time.time()
start_mem = process.memory_info().rss / (1024 * 1024)
peak_mem = start_mem # Initialize
print(f"Starting RAM: {start_mem:.2f} MB")

try:
    # This creates an iterator
    with pd.read_csv(
        compressed_file,
        compression='gzip',
        dtype={'attributed_time': 'object'},
        chunksize=chunk_size,
        on_bad_lines='skip' # Add this line to skip bad lines
    ) as reader:

        for chunk in reader:
            total_rows += len(chunk)

            # --- FIX: Update peak memory inside the loop ---
            current_mem = process.memory_info().rss / (1024 * 1024)
            if current_mem > peak_mem:
                peak_mem = current_mem


    # Stop timing
    end_time = time.time()
    time_taken = end_time - start_time
    end_mem = process.memory_info().rss / (1024 * 1024)

    print("\n--- Results for Method 3 ---")
    print(f"Total rows: {total_rows}")
    print(f"Time taken: {time_taken:.2f} seconds")
    print(f"Peak RAM usage: {peak_mem:.2f} MB")
    print(f"Ending RAM usage: {end_mem:.2f} MB")

except FileNotFoundError:
    print(f"ERROR: File not found at {compressed_file}. Please check the compression step.")

Starting Method 3: Pandas (chunksize, gzip)
Starting RAM: 587.75 MB

--- Results for Method 3 ---
Total rows: 184903890
Time taken: 224.05 seconds
Peak RAM usage: 610.11 MB
Ending RAM usage: 610.11 MB


## 6. Comparison and Conclusion

Here is a summary of the performance for each method.

| Method | Time Taken | Peak RAM Usage
| :--- | :---: | :---: |
| 1. Pandas (chunksize) | 179.25 sec | 177.50 MB |
| 2. Dask | 166.41 sec | 608.21 MB |
| 3. Pandas (chunksize, gzip) | 224.05 sec | 610.11 MB |

---

### Conclusion

- Dask was the fastest method overall, but it required more RAM.
- Pandas with chunksize was slower but very memory efficient.
- Pandas with gzip used similar RAM to Dask but was slower due to decompression overhead.