lookup_linke_turbidity is (still) slow #437

cedricleroy · 2018-03-13T22:05:52Z

Related to #368

From line_profiler @wholmgren, you stated that loading the matlab file was one of the two bottlenecks.

Loading the file is pretty slow, and we are doing it each time we run the function. Could we cache the file in the module to avoid reloading it?

There are actually two ways to see that:

Loading the file when we first call lookup_linke_turbidity

CACHE = {}

[...]

def lookup_linke_turbidity(...):
    [...]
    # try to get `LinkeTurbidity` data from the `CACHE`
    mat = CACHE.get('LinkeTurbidity')
    if not mat:
        # load `LinkeTurbidity` data and save then to the `CACHE`
        mat = scipy.io.loadmat(filepath)
        CACHE['LinkeTurbidity'] = mat
    [...]

Loading the file at import time, and cache it

CACHE = {}

def load_linke_turbidity():
    # scipy import try except + filepath
    CACHE['LinkeTurbidity'] = scipy.io.loadmat(filepath)

load_linke_turbidity()

[...]

def lookup_linke_turbidity(...):
    [...]
    # try to get `LinkeTurbidity` data from the `CACHE`
    mat = CACHE.get('LinkeTurbidity')
    [...]

The text was updated successfully, but these errors were encountered:

wholmgren · 2018-03-14T04:16:37Z

Maybe... A cache sounds like something that might be easy to implement initially but might also have all sorts of unexpected side effects. This is probably more of a concern in low-memory and parallel environments (I know some people use pvlib in these situations).

Is it slow because it's a matlab file? Would a h5 file be faster? Would support for a vectorized lookup be a better approach? Could you preload the turbidity data? Maybe joblib or similar could handle the caching for you?

SunPowerMike · 2018-03-14T16:37:52Z

I'm curious if breaking the single MAT file into smaller chunks (e.g. 18lats X 36lons = 648 mat files) each containing a 10deg x 10deg set of data would be a simple imrpovement. The load function would include lat/lon, and only one smaller file would be opened.

wholmgren · 2018-03-15T03:16:39Z

Chunking might help. hdf5 supports chunking within a single file.

adriesse · 2018-03-15T09:50:44Z

Or a memory-mapped numpy array:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html#numpy.memmap

cwhanse · 2018-03-15T15:11:44Z

In Matlab pvlib we read this file the first time that a Linke turbidity is needed, and hold the data in memory. Reading it each time a value is needed killed performance.

If we want to see if the file format is the bottleneck, I can export the matlab file as hdf5.

cedricleroy · 2018-03-27T20:08:49Z

Here is what I have trying hdf5:

t1 = datetime.datetime.utcnow()
mat = scipy.io.loadmat('LinkeTurbidities.mat')
data = mat['LinkeTurbidity']
t2 = datetime.datetime.utcnow()
print(t2 - t1)
>>> 0:00:00.483048

from tables import open_file

t1 = datetime.datetime.utcnow()
lt_h5_file = open_file("LinkeTurbidities.h5")
data = lt_h5_file.root.LinkeTurbidity[1,1,:]
lt_h5_file.close()
t2 = datetime.datetime.utcnow()
print(data)
print(type(data))
print(t2 - t1)
>>> [38 38 38 40 41 41 42 42 40 39 38 38]
>>> <class 'numpy.ndarray'>
>>> 0:00:00.003001

Here is what I used for creating the h5 file:

import h5py

mat = scipy.io.loadmat('LinkeTurbidities.mat')
data = mat['LinkeTurbidity']
h5f = h5py.File('LinkeTurbidities.h5', 'w')
h5f.create_dataset('LinkeTurbidity', data=data, compression="gzip", compression_opts=9)
h5f.close()

compression_opts set to 9 (the maximum) give a better compression than the matlab file (~13,600 KB Vs ~19,900 KB).

I used tables for reading the h5 file as it is what pandas is using. It looks like it is an optional dependency though, so it won't come with pandas and will need to be installed by users (a little like what pvlib is doing with scipy.io today).

wholmgren · 2018-03-27T22:41:31Z

That's great! I think an optional tables dependency is the right way to go. tables is widely used and it's a smaller dependency than scipy.

You might experiment with the compression_opts value. I find that 4 or 5 is usually the sweet spot for speed vs. file size, but it depends on the data.

Can you make a pull request for this improvement?

wholmgren · 2018-04-06T16:30:06Z

closed by #442.

cedricleroy mentioned this issue Mar 28, 2018

lookup_linke_turbidity speed improvement, mat to h5 #442

Merged

6 tasks

wholmgren added this to the 0.5.2 milestone Apr 6, 2018

wholmgren added enhancement api labels Apr 6, 2018

wholmgren closed this as completed Apr 6, 2018

wholmgren mentioned this issue Aug 28, 2021

Missing Module Dependency - Tables #1252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lookup_linke_turbidity is (still) slow #437

lookup_linke_turbidity is (still) slow #437

cedricleroy commented Mar 13, 2018

wholmgren commented Mar 14, 2018

SunPowerMike commented Mar 14, 2018

wholmgren commented Mar 15, 2018

adriesse commented Mar 15, 2018

cwhanse commented Mar 15, 2018

cedricleroy commented Mar 27, 2018 •

edited

Loading

wholmgren commented Mar 27, 2018

wholmgren commented Apr 6, 2018

lookup_linke_turbidity is (still) slow #437

lookup_linke_turbidity is (still) slow #437

Comments

cedricleroy commented Mar 13, 2018

wholmgren commented Mar 14, 2018

SunPowerMike commented Mar 14, 2018

wholmgren commented Mar 15, 2018

adriesse commented Mar 15, 2018

cwhanse commented Mar 15, 2018

cedricleroy commented Mar 27, 2018 • edited Loading

wholmgren commented Mar 27, 2018

wholmgren commented Apr 6, 2018

cedricleroy commented Mar 27, 2018 •

edited

Loading