-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save CSVs with PyArrow #280
Conversation
PS. Pandas will at some point in the future add an option to save CSVs with PyArrow, as discussed here, but it does not look like it's being worked on at the time |
@toni-neurosc First of all thanks a lot for contributing to py_neuromodulation! That's really great! Regarding this PR I could reproduce the timing of saving, but contrary to your results I found that reading a .csv file is also much faster with pyarrow. This is true when reading with pyarrow directly, but I also get the same speed-up when reading with Could you share how you tested the pyarrow csv read? I tested it using this minimal example:
Probably it's version independent, but I tested it with pyarrow 14.0.2 and pandas 2.1.4. Also I should probably say that I tested it with a pretty good CPU (Threadripper 2990WX with 32 cores), that might be a factor for the speed-up. |
Hi Timon, happy to help! I checked again and I can reproduce both my results and yours. It seems to depend on file size, or perhaps on the shape of the dataframe, since my file is only 4x bigger but takes way more than 4x longer to load. My code for testing was this: import pandas as pd
import numpy as np
import time
from pyarrow import csv
import py_neuromodulation as nm
## Generate chunky features CSV
NUM_CHANNELS = 5
NUM_DATA = 1000000
sfreq = 1000 # Hz
feature_freq = 3 # Hz
np.random.seed(0)
data = np.random.random([NUM_CHANNELS, NUM_DATA])
stream = nm.Stream(sfreq=sfreq, data=data, verbose=False)
stream.run(parallel=True)
# Test PyArrow
def read_csv(file):
return csv.read_csv(file).to_pandas()
t = time.time()
df = read_csv("sub/sub_FEATURES.csv")
print(f"Time to load PyArrow: {time.time() - t} seconds")
# Test Pandas
t = time.time()
df = pd.read_csv("sub/sub_FEATURES.csv")
print(f"Time to load Pandas: {time.time() - t} seconds") My code generates a 963 MB CSV file, with shape 9991 rows x 5227 cols, and the times are:
When I ran your code in my computer, it generated a 276MB file, and here PyArrow was way faster:
Then, changing np.random.randint (integers) for np.random.random (floats) produced a 1.79 GB file, and read times are still faster for PyArrow:
Just to be sure, reading the file generated by my code with your script:
So your script is faster than mine with the same data, why? Turns out, because this: table = csv.read_csv("sub/sub_FEATURES.csv")
df = table.to_pandas() is for some reason faster than this: df = csv.read_csv("sub/sub_FEATURES.csv").to_pandas() Python is weird sometimes. Anyway even so Pandas is a bit faster reading the CSV generated by py_neuromodulation stream.run(), so maybe test my code in your computer and see if it's the same for you. |
I noticed that saving the output file sub_FEATURES.csv was unreasonably slow and apparently this is a Pandas issue, changed to saving function to use PyArrow instead and got a 7.5x speed-up in saving:. For a 1GB, all np.float64 dataset:
Reading CSVs however is faster using Pandas so I did not change that, only removed the index_col=0 argument since PyArrow does not save an extra column with row index, which seems unnecessary to me anyway.
Note 1: that when loading CSVs generated with previous versions of the program, now they will have an unnamed column with the row indexes, which should not break compatibility if accessing columns by name rather than index.
Note 2: I haven't checked the minimum PyArrow version needed to use this very basic functionality, I just added the one that was installed in my environment to the dependencies, but can probably go much farther back