Save CSVs with PyArrow #280

toni-neurosc · 2024-01-16T12:14:07Z

I noticed that saving the output file sub_FEATURES.csv was unreasonably slow and apparently this is a Pandas issue, changed to saving function to use PyArrow instead and got a 7.5x speed-up in saving:. For a 1GB, all np.float64 dataset:

pandas_df.to_csv(): Time to save: 45.128990650177 seconds.
pyarrow.csv.write_csv(): Time to save: 6.1338911056518555 seconds.

Reading CSVs however is faster using Pandas so I did not change that, only removed the index_col=0 argument since PyArrow does not save an extra column with row index, which seems unnecessary to me anyway.

Time to read CSV (pyarrow): 14.770382642745972 seconds.
Time to read CSV (pandas): 8.440594673156738 seconds.

Note 1: that when loading CSVs generated with previous versions of the program, now they will have an unnamed column with the row indexes, which should not break compatibility if accessing columns by name rather than index.

Note 2: I haven't checked the minimum PyArrow version needed to use this very basic functionality, I just added the one that was installed in my environment to the dependencies, but can probably go much farther back

toni-neurosc · 2024-01-16T17:44:04Z

PS. Pandas will at some point in the future add an option to save CSVs with PyArrow, as discussed here, but it does not look like it's being worked on at the time

timonmerk · 2024-01-18T08:54:54Z

@toni-neurosc First of all thanks a lot for contributing to py_neuromodulation! That's really great!

Regarding this PR I could reproduce the timing of saving, but contrary to your results I found that reading a .csv file is also much faster with pyarrow. This is true when reading with pyarrow directly, but I also get the same speed-up when reading with pd.read_csv(PATH, engine="pyarrow").

Could you share how you tested the pyarrow csv read?

I tested it using this minimal example:

import pandas as pd
import numpy as np
import time
import pyarrow
from pyarrow import csv

# Time Write pandas: 30.8 seconds, Time read: 6.6 seconds
# Time Write pyarrow: 7.9 seconds, Time read: 0.65 seconds

# Time Write pandas with pyarrow engine currently not possible, Time read: 0.7 seconds


# Define the size of the DataFrame
n_rows = 10**6
n_cols = 100

# Create a DataFrame with random values
df = pd.DataFrame(
    np.random.randint(0, 100, size=(n_rows, n_cols)),
    columns=["col" + str(i) for i in range(n_cols)],
)

for SAVE_PYARROW in [True, False]:
    print(f"SAVE_PYARROW: {SAVE_PYARROW}")
    start_time = time.time()

    if SAVE_PYARROW:
        csv.write_csv(pyarrow.Table.from_pandas(df), "data.csv")
    else:
        # Save the DataFrame to a CSV file
        df.to_csv("data.csv", index=False)

    # Calculate the elapsed time
    elapsed_time = time.time() - start_time

    print(f"Time elapsed: {elapsed_time} seconds")

for READ_PYARROW in [True, False]:
    print(f"READ_PYARROW: {READ_PYARROW}")
    start_time = time.time()

    if READ_PYARROW:
        table = csv.read_csv("data.csv")
        df_pyarrow = table.to_pandas()
    else:
        # Read the CSV file into a DataFrame
        df_pd = pd.read_csv("data.csv")

    # Calculate the elapsed time
    elapsed_time = time.time() - start_time

    print(f"Time elapsed: {elapsed_time} seconds")

print("Use pandas pyarrow engine to read the csv file time")
start_time = time.time()
df = pd.read_csv("data.csv", engine="pyarrow")
elapsed_time = time.time() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

Probably it's version independent, but I tested it with pyarrow 14.0.2 and pandas 2.1.4.

Also I should probably say that I tested it with a pretty good CPU (Threadripper 2990WX with 32 cores), that might be a factor for the speed-up.

toni-neurosc · 2024-01-20T19:59:59Z

Hi Timon, happy to help!

I checked again and I can reproduce both my results and yours. It seems to depend on file size, or perhaps on the shape of the dataframe, since my file is only 4x bigger but takes way more than 4x longer to load.

My code for testing was this:

import pandas as pd
import numpy as np
import time
from pyarrow import csv
import py_neuromodulation as nm

## Generate chunky features CSV
NUM_CHANNELS = 5
NUM_DATA = 1000000
sfreq = 1000  # Hz
feature_freq = 3  # Hz

np.random.seed(0)

data = np.random.random([NUM_CHANNELS, NUM_DATA])
stream = nm.Stream(sfreq=sfreq, data=data, verbose=False)
stream.run(parallel=True)

# Test PyArrow
def read_csv(file):
    return csv.read_csv(file).to_pandas()
    
t = time.time()
df = read_csv("sub/sub_FEATURES.csv")
print(f"Time to load PyArrow: {time.time() - t} seconds")

# Test Pandas
t = time.time()
df = pd.read_csv("sub/sub_FEATURES.csv")
print(f"Time to load Pandas: {time.time() - t} seconds")

My code generates a 963 MB CSV file, with shape 9991 rows x 5227 cols, and the times are:

Time to load PyArrow: 15.755833625793457 seconds
Time to load Pandas: 8.531198978424072 seconds

When I ran your code in my computer, it generated a 276MB file, and here PyArrow was way faster:

READ_PYARROW: True
Time elapsed: 0.6403310298919678 seconds
READ_PYARROW: False
Time elapsed: 2.7735869884490967 seconds
Use pandas pyarrow engine to read the csv file time
Time elapsed: 0.677858829498291 seconds

Then, changing np.random.randint (integers) for np.random.random (floats) produced a 1.79 GB file, and read times are still faster for PyArrow:

READ_PYARROW: True
Time elapsed: 1.7801580429077148 seconds
READ_PYARROW: False
Time elapsed: 12.708122491836548 seconds
Use pandas pyarrow engine to read the csv file time
Time elapsed: 2.566972017288208 seconds

Just to be sure, reading the file generated by my code with your script:

READ_PYARROW: True
Time elapsed: 10.525209903717041 seconds
READ_PYARROW: False
Time elapsed: 8.005020141601562 seconds
Use pandas pyarrow engine to read the csv file time
Time elapsed: 18.455359935760498 seconds

So your script is faster than mine with the same data, why? Turns out, because this:

table = csv.read_csv("sub/sub_FEATURES.csv")
df = table.to_pandas()

is for some reason faster than this:

df = csv.read_csv("sub/sub_FEATURES.csv").to_pandas()

Python is weird sometimes. Anyway even so Pandas is a bit faster reading the CSV generated by py_neuromodulation stream.run(), so maybe test my code in your computer and see if it's the same for you.

toni-neurosc added 2 commits January 16, 2024 12:59

Save features CSV using pyarrow

717a051

add pyarrow dependency

997e816

timonmerk and others added 4 commits January 18, 2024 09:57

add file for testing

54f37cb

add pd read pyarrow engine

6b84a64

remove test file

59adb14

Merge branch 'main' into add_pyarrow_pr

03b9c92

timonmerk merged commit 702c3d3 into neuromodulation:main Jan 27, 2024
2 checks passed

toni-neurosc deleted the add_pyarrow_pr branch February 3, 2024 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save CSVs with PyArrow #280

Save CSVs with PyArrow #280

toni-neurosc commented Jan 16, 2024

toni-neurosc commented Jan 16, 2024

timonmerk commented Jan 18, 2024

toni-neurosc commented Jan 20, 2024 •

edited

Loading

Save CSVs with PyArrow #280

Save CSVs with PyArrow #280

Conversation

toni-neurosc commented Jan 16, 2024

toni-neurosc commented Jan 16, 2024

timonmerk commented Jan 18, 2024

toni-neurosc commented Jan 20, 2024 • edited Loading

toni-neurosc commented Jan 20, 2024 •

edited

Loading