-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lookup_linke_turbidity is (still) slow #437
Comments
Maybe... A cache sounds like something that might be easy to implement initially but might also have all sorts of unexpected side effects. This is probably more of a concern in low-memory and parallel environments (I know some people use pvlib in these situations). Is it slow because it's a matlab file? Would a h5 file be faster? Would support for a vectorized lookup be a better approach? Could you preload the turbidity data? Maybe joblib or similar could handle the caching for you? |
I'm curious if breaking the single MAT file into smaller chunks (e.g. 18lats X 36lons = 648 mat files) each containing a 10deg x 10deg set of data would be a simple imrpovement. The load function would include lat/lon, and only one smaller file would be opened. |
Chunking might help. hdf5 supports chunking within a single file. |
Or a memory-mapped numpy array: |
In Matlab pvlib we read this file the first time that a Linke turbidity is needed, and hold the data in memory. Reading it each time a value is needed killed performance. If we want to see if the file format is the bottleneck, I can export the matlab file as hdf5. |
Here is what I have trying t1 = datetime.datetime.utcnow()
mat = scipy.io.loadmat('LinkeTurbidities.mat')
data = mat['LinkeTurbidity']
t2 = datetime.datetime.utcnow()
print(t2 - t1)
>>> 0:00:00.483048 from tables import open_file
t1 = datetime.datetime.utcnow()
lt_h5_file = open_file("LinkeTurbidities.h5")
data = lt_h5_file.root.LinkeTurbidity[1,1,:]
lt_h5_file.close()
t2 = datetime.datetime.utcnow()
print(data)
print(type(data))
print(t2 - t1)
>>> [38 38 38 40 41 41 42 42 40 39 38 38]
>>> <class 'numpy.ndarray'>
>>> 0:00:00.003001 Here is what I used for creating the import h5py
mat = scipy.io.loadmat('LinkeTurbidities.mat')
data = mat['LinkeTurbidity']
h5f = h5py.File('LinkeTurbidities.h5', 'w')
h5f.create_dataset('LinkeTurbidity', data=data, compression="gzip", compression_opts=9)
h5f.close()
I used |
That's great! I think an optional tables dependency is the right way to go. tables is widely used and it's a smaller dependency than scipy. You might experiment with the compression_opts value. I find that 4 or 5 is usually the sweet spot for speed vs. file size, but it depends on the data. Can you make a pull request for this improvement? |
closed by #442. |
Related to #368
From
line_profiler
@wholmgren, you stated that loading the matlab file was one of the two bottlenecks.Loading the file is pretty slow, and we are doing it each time we run the function. Could we cache the file in the module to avoid reloading it?
There are actually two ways to see that:
lookup_linke_turbidity
The text was updated successfully, but these errors were encountered: