**Main Question**: Please parse large CSV, `customers-2000000.csv` and keep the memory low.

*Assumptions*:
* In this question, I just need to parse the CSV, I don't need to do EDA like previous question.
* I try to experimenting on the my proposed technique to parse large CSV and compare them based on the memory usage. Lower is better.

In [30]:
import pandas as pd
import csv
import sys
import polars as pl

In [31]:
FILE_PATH = "./data/customers-2000000.csv"

In [32]:
def get_memory_usage(obj, framework="pandas"):
    if framework == "pandas":
        # Pandas DataFrame
        return obj.memory_usage(deep=True).sum() / (1024**2)
    
    elif framework == "csv":
        # List or dict from csv module
        return sys.getsizeof(obj) / (1024**2)
    
    elif framework == "polars":
        # Polars DataFrame
        return obj.estimated_size() / (1024**2)

# Method 1

Using the original `read_csv` from `pandas`.

In [33]:
df = pd.read_csv(FILE_PATH)
memory = get_memory_usage(df, framework="pandas")
print(f"Pandas original read_csv: {memory:.2f} MB")

Pandas original read_csv: 1506.04 MB


# Method 2

I use chunking method for the parser. Although the total memory is the same as original method, with chunking I can get significantly lower memory usage per chunk and do some preprocessing within the chunk to compile the final dataset.

In [38]:
chunksize = 100_000
total_mem_chunk = 0
for chunk in pd.read_csv(FILE_PATH, chunksize=chunksize):
    print(f"Chunk memory: {get_memory_usage(chunk, framework='pandas'):.2f} MB")
    total_mem_chunk += get_memory_usage(chunk, framework="pandas")

print(f"Pandas chunking total memory: {total_mem_chunk:.2f} MB")

Chunk memory: 75.30 MB
Chunk memory: 75.29 MB
Chunk memory: 75.30 MB
Chunk memory: 75.31 MB
Chunk memory: 75.30 MB
Chunk memory: 75.30 MB
Chunk memory: 75.31 MB
Chunk memory: 75.31 MB
Chunk memory: 75.31 MB
Chunk memory: 75.30 MB
Chunk memory: 75.31 MB
Chunk memory: 75.31 MB
Chunk memory: 75.30 MB
Chunk memory: 75.30 MB
Chunk memory: 75.29 MB
Chunk memory: 75.29 MB
Chunk memory: 75.31 MB
Chunk memory: 75.30 MB
Chunk memory: 75.30 MB
Chunk memory: 75.30 MB
Pandas chunking total memory: 1506.04 MB


# Method 3

I change the `dtype` from the previous `df`.

* `category` → for strings with multiple duplicates, significantly saving memory.
* `int32` → sufficient for indexes up to 2 million, more efficient than `int64`. I can't use `int8` or `int16` since it's smaller than 2 million.
* `object` → for unique columns like email/phone, as categories don't help much here.

So, `int64` changed into `int32` and some `object` is changed into `category` if it's more likely to be duplicated.

In [None]:
dtype_map = {
    "Index": "int32",
    "Customer Id": "category",
    "First Name": "category",
    "Last Name": "category",
    "Company": "category",
    "City": "category",
    "Country": "category",
    "Phone 1": "object",                # Unique values, so keep as object
    "Phone 2": "object",                # Unique values, so keep as object
    "Email": "object",                  # Unique values, so keep as object
    "Subscription Date": "object",      # Unique values, so keep as object
    "Website": "category"
}

df_opt = pd.read_csv(FILE_PATH, dtype=dtype_map)
memory_opt = get_memory_usage(df_opt, framework="pandas")
print(f"Pandas optimized dtype: {memory_opt:.2f} MB")

Pandas optimized dtype: 1010.90 MB


# Method 4

Manual approach using `csv` module.

In [41]:
total_mem_csv = 0
with open(FILE_PATH, newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        total_mem_csv += get_memory_usage(row, framework="csv")

print(f"CSV module total memory estimate: {total_mem_csv:.2f} MB")

CSV module total memory estimate: 1220.70 MB


# Method 5

Using the original `read_csv` from `polars`.

In [50]:
df_polars = pl.read_csv(FILE_PATH)
memory_polars = get_memory_usage(df_polars, framework="polars")
print(f"Polars original read_csv: {memory_polars:.2f} MB")

Polars original read_csv: 310.13 MB


# Conclusion

In this experiment, the goal was to parse a large CSV file, `customers-2000000.csv`, while keeping memory usage as low as possible. I compared several approaches and measured how much memory each method consumes.

## Summary of Approaches and Memory Usage

| Method | Description | Memory Usage (MB) |
|--------|-------------|-----------------|
| Pandas default `read_csv` | Loading the full CSV with default settings | 1506.04 |
| Pandas chunking | Reading the CSV in 100k-row chunks | 1506.04 (per chunk ~75 MB) |
| Pandas with optimized dtypes | Converting repeated strings to `category` and indexes to `int32` | 1010.90 |
| CSV module | Reading the CSV manually row by row | 1220.70 |
| Polars `read_csv` | Using Polars, designed for speed and low memory | 310.13 |

## Key Takeaways

Parsing large CSV files can consume a lot of memory if not handled carefully. Default Pandas `read_csv` uses significant memory because all string columns are stored as `object` and indexes as `int64`. Chunking helps process the data in smaller pieces, but doesn’t reduce total memory usage. Optimizing dtypes in Pandas, by converting repeated strings to `category` and indexes to `int32`, can reduce memory by roughly a third. Manual CSV parsing is more memory-efficient than default Pandas but is slower and less convenient for large datasets. Polars, on the other hand, clearly outperforms all methods in memory efficiency, using only about 20% of the memory compared to Pandas default.

For large datasets, **Polars** is the best choice when memory is a concern, being both fast and efficient. If prefer **Pandas**, always apply dtype optimization. Chunking is useful for step-by-step processing on limited-memory machines. Overall, careful selection of library and data types can drastically reduce memory usage, and for this dataset, **Polars is the most memory-efficient**, followed by **Pandas with optimized dtypes**.
