# Reading big data file with two different approachs

This notebook compares two methods for reading a large CSV file `(train.csv)` in Python to determine the total number of rows(fake computation). The two methods are:

- Using the **chunksize parameter in Pandas**.
- Using the **Dask** library.

The goal is to measure and compare the time taken by each approach to process a file with 135,000,000 rows(like ~5.5 GB).

## Method 1: Using Pandas with chunksize

This method reads the large CSV file in smaller chunks to avoid loading the entire file into memory at once. It iterates through the file piece by piece, counting the rows in each chunk.

In [2]:
import pandas as pd
import time

file_path = './data/train.csv'
chunk_size = 1_000_000
total_rows = 0

# Start timing
start_time = time.time()

chunk_iterator = pd.read_csv(file_path, chunksize=chunk_size)

# Loop through each chunk
for chunk in chunk_iterator:
    total_rows += len(chunk)

# Stop timing
end_time = time.time()
time_taken = end_time - start_time

print(f"Total rows: {total_rows}")
print(f"Time taken with Pandas (chunksize): {time_taken:.2f} seconds")


Total rows: 135000000
Time taken with Pandas (chunksize): 111.81 seconds


## Method 2: Using Dask

Dask is a parallel computing library that is designed to scale natively from a single machine to a cluster. It can handle datasets that are larger than memory by breaking them into smaller, manageable pieces (similar to Pandas chunks) and processing them in parallel.

In [3]:
import dask.dataframe as dd
import time

file_path = './data/train.csv'

# Start timing
start_time = time.time()

# Missing value in dataset issues.
# now we tell Dask to treat the 'attributed_time' column as text (object)
ddf = dd.read_csv(
    file_path,
    dtype={'attributed_time': 'object'}
)

total_rows = len(ddf)

# Stop timing
end_time = time.time()
time_taken = end_time - start_time

print(f"Total rows: {total_rows}")
print(f"Time taken with Dask: {time_taken:.2f} seconds")

Total rows: 135000000
Time taken with Dask: 64.61 seconds


Comparison and Conclusion

Here's a summary of the performance for reading 135 million rows:

| Library           | Time Taken (seconds) 
|-------------------|----------------------
| Pandas(chunksize) | 111.81               
| Dask              | 64.61                

For this task, **Dask** was significantly faster, completing the row count approximately 1.7 times quicker than the Pandas chunksize method. Dask's ability to parallelize operations allows it to process the data more efficiently.