Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save CSVs with PyArrow #280

Merged
merged 6 commits into from
Jan 27, 2024
Merged

Conversation

toni-neurosc
Copy link
Collaborator

I noticed that saving the output file sub_FEATURES.csv was unreasonably slow and apparently this is a Pandas issue, changed to saving function to use PyArrow instead and got a 7.5x speed-up in saving:. For a 1GB, all np.float64 dataset:

  • pandas_df.to_csv(): Time to save: 45.128990650177 seconds.
  • pyarrow.csv.write_csv(): Time to save: 6.1338911056518555 seconds.

Reading CSVs however is faster using Pandas so I did not change that, only removed the index_col=0 argument since PyArrow does not save an extra column with row index, which seems unnecessary to me anyway.

  • Time to read CSV (pyarrow): 14.770382642745972 seconds.
  • Time to read CSV (pandas): 8.440594673156738 seconds.

Note 1: that when loading CSVs generated with previous versions of the program, now they will have an unnamed column with the row indexes, which should not break compatibility if accessing columns by name rather than index.

Note 2: I haven't checked the minimum PyArrow version needed to use this very basic functionality, I just added the one that was installed in my environment to the dependencies, but can probably go much farther back

@toni-neurosc
Copy link
Collaborator Author

PS. Pandas will at some point in the future add an option to save CSVs with PyArrow, as discussed here, but it does not look like it's being worked on at the time

@timonmerk
Copy link
Contributor

@toni-neurosc First of all thanks a lot for contributing to py_neuromodulation! That's really great!

Regarding this PR I could reproduce the timing of saving, but contrary to your results I found that reading a .csv file is also much faster with pyarrow. This is true when reading with pyarrow directly, but I also get the same speed-up when reading with pd.read_csv(PATH, engine="pyarrow").

Could you share how you tested the pyarrow csv read?

I tested it using this minimal example:

import pandas as pd
import numpy as np
import time
import pyarrow
from pyarrow import csv

# Time Write pandas: 30.8 seconds, Time read: 6.6 seconds
# Time Write pyarrow: 7.9 seconds, Time read: 0.65 seconds

# Time Write pandas with pyarrow engine currently not possible, Time read: 0.7 seconds


# Define the size of the DataFrame
n_rows = 10**6
n_cols = 100

# Create a DataFrame with random values
df = pd.DataFrame(
    np.random.randint(0, 100, size=(n_rows, n_cols)),
    columns=["col" + str(i) for i in range(n_cols)],
)

for SAVE_PYARROW in [True, False]:
    print(f"SAVE_PYARROW: {SAVE_PYARROW}")
    start_time = time.time()

    if SAVE_PYARROW:
        csv.write_csv(pyarrow.Table.from_pandas(df), "data.csv")
    else:
        # Save the DataFrame to a CSV file
        df.to_csv("data.csv", index=False)

    # Calculate the elapsed time
    elapsed_time = time.time() - start_time

    print(f"Time elapsed: {elapsed_time} seconds")

for READ_PYARROW in [True, False]:
    print(f"READ_PYARROW: {READ_PYARROW}")
    start_time = time.time()

    if READ_PYARROW:
        table = csv.read_csv("data.csv")
        df_pyarrow = table.to_pandas()
    else:
        # Read the CSV file into a DataFrame
        df_pd = pd.read_csv("data.csv")

    # Calculate the elapsed time
    elapsed_time = time.time() - start_time

    print(f"Time elapsed: {elapsed_time} seconds")

print("Use pandas pyarrow engine to read the csv file time")
start_time = time.time()
df = pd.read_csv("data.csv", engine="pyarrow")
elapsed_time = time.time() - start_time
print(f"Time elapsed: {elapsed_time} seconds")

Probably it's version independent, but I tested it with pyarrow 14.0.2 and pandas 2.1.4.

Also I should probably say that I tested it with a pretty good CPU (Threadripper 2990WX with 32 cores), that might be a factor for the speed-up.

@toni-neurosc
Copy link
Collaborator Author

toni-neurosc commented Jan 20, 2024

Hi Timon, happy to help!

I checked again and I can reproduce both my results and yours. It seems to depend on file size, or perhaps on the shape of the dataframe, since my file is only 4x bigger but takes way more than 4x longer to load.

My code for testing was this:

import pandas as pd
import numpy as np
import time
from pyarrow import csv
import py_neuromodulation as nm

## Generate chunky features CSV
NUM_CHANNELS = 5
NUM_DATA = 1000000
sfreq = 1000  # Hz
feature_freq = 3  # Hz

np.random.seed(0)

data = np.random.random([NUM_CHANNELS, NUM_DATA])
stream = nm.Stream(sfreq=sfreq, data=data, verbose=False)
stream.run(parallel=True)

# Test PyArrow
def read_csv(file):
    return csv.read_csv(file).to_pandas()
    
t = time.time()
df = read_csv("sub/sub_FEATURES.csv")
print(f"Time to load PyArrow: {time.time() - t} seconds")

# Test Pandas
t = time.time()
df = pd.read_csv("sub/sub_FEATURES.csv")
print(f"Time to load Pandas: {time.time() - t} seconds")

My code generates a 963 MB CSV file, with shape 9991 rows x 5227 cols, and the times are:

Time to load PyArrow: 15.755833625793457 seconds
Time to load Pandas: 8.531198978424072 seconds

When I ran your code in my computer, it generated a 276MB file, and here PyArrow was way faster:

READ_PYARROW: True
Time elapsed: 0.6403310298919678 seconds
READ_PYARROW: False
Time elapsed: 2.7735869884490967 seconds
Use pandas pyarrow engine to read the csv file time
Time elapsed: 0.677858829498291 seconds

Then, changing np.random.randint (integers) for np.random.random (floats) produced a 1.79 GB file, and read times are still faster for PyArrow:

READ_PYARROW: True
Time elapsed: 1.7801580429077148 seconds
READ_PYARROW: False
Time elapsed: 12.708122491836548 seconds
Use pandas pyarrow engine to read the csv file time
Time elapsed: 2.566972017288208 seconds

Just to be sure, reading the file generated by my code with your script:

READ_PYARROW: True
Time elapsed: 10.525209903717041 seconds
READ_PYARROW: False
Time elapsed: 8.005020141601562 seconds
Use pandas pyarrow engine to read the csv file time
Time elapsed: 18.455359935760498 seconds

So your script is faster than mine with the same data, why? Turns out, because this:

table = csv.read_csv("sub/sub_FEATURES.csv")
df = table.to_pandas()

is for some reason faster than this:

df = csv.read_csv("sub/sub_FEATURES.csv").to_pandas()

Python is weird sometimes. Anyway even so Pandas is a bit faster reading the CSV generated by py_neuromodulation stream.run(), so maybe test my code in your computer and see if it's the same for you.

@timonmerk timonmerk merged commit 702c3d3 into neuromodulation:main Jan 27, 2024
2 checks passed
@toni-neurosc toni-neurosc deleted the add_pyarrow_pr branch February 3, 2024 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants