<a href="https://colab.research.google.com/github/harrydevforlife/sandbox/blob/main/pyarrow_and_csv_writer_benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Benchmarking between pyarrow's CSV writer and Python's csv.writer can help you understand the trade-offs between speed, ease of use, and flexibility. Below is a Python script to compare the two methods.

## Prerequisites
Install the required libraries if you haven't already:



In [1]:
!pip install pandas pyarrow



In [3]:
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.csv as pcsv
import csv
import time
import os
import tempfile

def generate_sample_data(num_rows=10_000_000, num_cols=10):
    """
    Generates a pandas DataFrame with random data.

    Args:
        num_rows (int): Number of rows.
        num_cols (int): Number of columns.

    Returns:
        pd.DataFrame: Generated DataFrame.
    """
    data = {f'col_{i}': np.random.randn(num_rows) for i in range(num_cols)}
    return pd.DataFrame(data)

def write_csv_pyarrow(df, file_path):
    """
    Writes a pandas DataFrame to a CSV file using PyArrow.

    Args:
        df (pd.DataFrame): Data to write.
        file_path (str): Destination file path.
    """
    table = pa.Table.from_pandas(df)
    pcsv.write_csv(table, file_path)

def write_csv_writer(df, file_path):
    """
    Writes a pandas DataFrame to a CSV file using Python's csv.writer.

    Args:
        df (pd.DataFrame): Data to write.
        file_path (str): Destination file path.
    """
    with open(file_path, mode='w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        # Write the header
        writer.writerow(df.columns)
        # Write the data
        writer.writerows(df.values)

def benchmark_write(func, *args, **kwargs):
    """
    Benchmarks the time taken by a function to execute.

    Args:
        func (callable): Function to benchmark.
        *args: Positional arguments for the function.
        **kwargs: Keyword arguments for the function.

    Returns:
        float: Time taken in seconds.
    """
    start_time = time.time()
    func(*args, **kwargs)
    end_time = time.time()
    return end_time - start_time

def run_benchmark(df, num_runs=3):
    """
    Runs the benchmark for PyArrow and csv.writer.

    Args:
        df (pd.DataFrame): Data to write.
        num_runs (int): Number of times to run each benchmark.

    Returns:
        dict: Average write times for each method.
    """
    results = {'pyarrow': [], 'csv_writer': []}

    for run in range(1, num_runs + 1):
        print(f"\nRun {run} of {num_runs}:")

        with tempfile.TemporaryDirectory() as tmpdirname:
            # Define file paths
            pyarrow_file = os.path.join(tmpdirname, 'data_pyarrow.csv')
            csv_writer_file = os.path.join(tmpdirname, 'data_csv_writer.csv')

            # Benchmark PyArrow
            time_pyarrow = benchmark_write(write_csv_pyarrow, df, pyarrow_file)
            results['pyarrow'].append(time_pyarrow)
            print(f"PyArrow CSV write time: {time_pyarrow:.2f} seconds")

            # Benchmark csv.writer
            time_csv_writer = benchmark_write(write_csv_writer, df, csv_writer_file)
            results['csv_writer'].append(time_csv_writer)
            print(f"csv.writer write time: {time_csv_writer:.2f} seconds")

    # Calculate average times
    avg_pyarrow = sum(results['pyarrow']) / num_runs
    avg_csv_writer = sum(results['csv_writer']) / num_runs

    return {
        'PyArrow Average Time (s)': avg_pyarrow,
        'csv.writer Average Time (s)': avg_csv_writer
    }

def main():
    # Parameters
    NUM_ROWS = 10_000_000  # Number of rows (reduce if memory is limited)
    NUM_COLS = 10          # Number of columns
    NUM_RUNS = 3           # Number of benchmark runs

    print("Generating sample data...")
    df = generate_sample_data(num_rows=NUM_ROWS, num_cols=NUM_COLS)
    print(f"DataFrame with {NUM_ROWS} rows and {NUM_COLS} columns generated.")

    print("\nStarting benchmark...")
    results = run_benchmark(df, num_runs=NUM_RUNS)

    print("\nBenchmark Results:")
    for lib, avg_time in results.items():
        print(f"{lib}: {avg_time:.2f} seconds on average over {NUM_RUNS} runs")

if __name__ == "__main__":
    main()


Generating sample data...
DataFrame with 10000000 rows and 10 columns generated.

Starting benchmark...

Run 1 of 3:
PyArrow CSV write time: 25.32 seconds
csv.writer write time: 163.20 seconds

Run 2 of 3:
PyArrow CSV write time: 26.55 seconds
csv.writer write time: 159.69 seconds

Run 3 of 3:
PyArrow CSV write time: 30.03 seconds
csv.writer write time: 161.15 seconds

Benchmark Results:
PyArrow Average Time (s): 27.30 seconds on average over 3 runs
csv.writer Average Time (s): 161.35 seconds on average over 3 runs


### **Explanation**

1. **Data Generation (`generate_sample_data`)**:
   - Creates a pandas DataFrame with random numerical data. You can adjust the number of rows and columns.

2. **Writing Functions**:
   - **`write_csv_pyarrow`**: Converts the pandas DataFrame to a PyArrow Table and writes it as a CSV file.
   - **`write_csv_writer`**: Uses Python's `csv.writer` to manually write rows to a CSV file.

3. **Benchmarking Function (`benchmark_write`)**:
   - Measures execution time for a specific write function.

4. **Run Benchmark (`run_benchmark`)**:
   - Runs the write operations multiple times and computes the average execution time for each method.
   - Uses `tempfile.TemporaryDirectory` to manage temporary file paths.

5. **Main Function (`main`)**:
   - Defines parameters (number of rows, columns, and runs) and starts the benchmark.
   - Displays average execution times for both methods.

---

### **Running the Benchmark**

1. **Save the Script**: Save the above script to a file, e.g., `csv_benchmark.py`.

2. **Run the Script**:

    ```bash
    python csv_benchmark.py
    ```

---

### **Key Insights**

- **`pyarrow`**: Optimized for high performance, especially on large datasets. It uses C++ backend and multi-threading for faster execution.
- **`csv.writer`**: Simpler but slower, as it processes rows in Python without additional optimizations.

### **Customizations**

- **Dataset Size**: Adjust `NUM_ROWS` and `NUM_COLS` to simulate larger or smaller datasets.
- **Number of Runs**: Modify `NUM_RUNS` for more robust averages.
- **Compression**: Experiment with additional settings like compression when applicable for PyArrow.

---

This script should help you understand the performance trade-offs between the two methods for your use case. Let me know if you need further assistance!