Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lookup_linke_turbidity is (still) slow #437

Closed
cedricleroy opened this issue Mar 13, 2018 · 8 comments
Closed

lookup_linke_turbidity is (still) slow #437

cedricleroy opened this issue Mar 13, 2018 · 8 comments

Comments

@cedricleroy
Copy link
Contributor

Related to #368

From line_profiler @wholmgren, you stated that loading the matlab file was one of the two bottlenecks.

Loading the file is pretty slow, and we are doing it each time we run the function. Could we cache the file in the module to avoid reloading it?

There are actually two ways to see that:

  1. Loading the file when we first call lookup_linke_turbidity
CACHE = {}

[...]

def lookup_linke_turbidity(...):
    [...]
    # try to get `LinkeTurbidity` data from the `CACHE`
    mat = CACHE.get('LinkeTurbidity')
    if not mat:
        # load `LinkeTurbidity` data and save then to the `CACHE`
        mat = scipy.io.loadmat(filepath)
        CACHE['LinkeTurbidity'] = mat
    [...]
  1. Loading the file at import time, and cache it
CACHE = {}

def load_linke_turbidity():
    # scipy import try except + filepath
    CACHE['LinkeTurbidity'] = scipy.io.loadmat(filepath)

load_linke_turbidity()

[...]

def lookup_linke_turbidity(...):
    [...]
    # try to get `LinkeTurbidity` data from the `CACHE`
    mat = CACHE.get('LinkeTurbidity')
    [...]
@wholmgren
Copy link
Member

Maybe... A cache sounds like something that might be easy to implement initially but might also have all sorts of unexpected side effects. This is probably more of a concern in low-memory and parallel environments (I know some people use pvlib in these situations).

Is it slow because it's a matlab file? Would a h5 file be faster? Would support for a vectorized lookup be a better approach? Could you preload the turbidity data? Maybe joblib or similar could handle the caching for you?

@SunPowerMike
Copy link

I'm curious if breaking the single MAT file into smaller chunks (e.g. 18lats X 36lons = 648 mat files) each containing a 10deg x 10deg set of data would be a simple imrpovement. The load function would include lat/lon, and only one smaller file would be opened.

@wholmgren
Copy link
Member

Chunking might help. hdf5 supports chunking within a single file.

@adriesse
Copy link
Member

@cwhanse
Copy link
Member

cwhanse commented Mar 15, 2018

In Matlab pvlib we read this file the first time that a Linke turbidity is needed, and hold the data in memory. Reading it each time a value is needed killed performance.

If we want to see if the file format is the bottleneck, I can export the matlab file as hdf5.

@cedricleroy
Copy link
Contributor Author

cedricleroy commented Mar 27, 2018

Here is what I have trying hdf5:

t1 = datetime.datetime.utcnow()
mat = scipy.io.loadmat('LinkeTurbidities.mat')
data = mat['LinkeTurbidity']
t2 = datetime.datetime.utcnow()
print(t2 - t1)
>>> 0:00:00.483048
from tables import open_file

t1 = datetime.datetime.utcnow()
lt_h5_file = open_file("LinkeTurbidities.h5")
data = lt_h5_file.root.LinkeTurbidity[1,1,:]
lt_h5_file.close()
t2 = datetime.datetime.utcnow()
print(data)
print(type(data))
print(t2 - t1)
>>> [38 38 38 40 41 41 42 42 40 39 38 38]
>>> <class 'numpy.ndarray'>
>>> 0:00:00.003001

Here is what I used for creating the h5 file:

import h5py

mat = scipy.io.loadmat('LinkeTurbidities.mat')
data = mat['LinkeTurbidity']
h5f = h5py.File('LinkeTurbidities.h5', 'w')
h5f.create_dataset('LinkeTurbidity', data=data, compression="gzip", compression_opts=9)
h5f.close()

compression_opts set to 9 (the maximum) give a better compression than the matlab file (~13,600 KB Vs ~19,900 KB).

I used tables for reading the h5 file as it is what pandas is using. It looks like it is an optional dependency though, so it won't come with pandas and will need to be installed by users (a little like what pvlib is doing with scipy.io today).

@wholmgren
Copy link
Member

That's great! I think an optional tables dependency is the right way to go. tables is widely used and it's a smaller dependency than scipy.

You might experiment with the compression_opts value. I find that 4 or 5 is usually the sweet spot for speed vs. file size, but it depends on the data.

Can you make a pull request for this improvement?

@wholmgren
Copy link
Member

closed by #442.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants