Skip to content
/ jum Public

Alternative to Joblib's Memory to do file-based cache python function

Notifications You must be signed in to change notification settings

phizaz/jum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jum

"jum" means "remember" in Thai

An alternative to Joblib's Memory to cache python function in-file

It uses dill package to pickle objects and also to help hashing function arguments, so it supports any kind of objects as long as dill supports it.

Use cases

import jum

@jum.cache(cache_dir='.jum')
def a_long_running_function(array):
    ... do some cpu intensive things ...
    return value


import numpy as np
a_long_running_fn(<some_large_np_array>)

## to configure compression level (default 2)
@jum.cache(cache_dir='.jum', compresslevel=<0-9>)

Installation

pip install jum

Features

  1. It supports almost any kind of objects including numpy's ndarray which is its main use case.
  2. Faster and lighter and smaller cache footprints than Joblib's Memory.
  3. It supports file compression using Python's Gzip library.
  4. It uses SHA1 as the main hashing algorithm, to provide the large 256 bit hashing space.
  5. It now uses xxhash to hash the ndarray (specifically) for speed boost.

To be improved

  • use dill to hash the function body instead of the function code, because some function's code cannot be retrieved, esp. in the case of python console.
  • function file path might not work in case of python console, put some default values for it.
  • using some faster hash, xxhash, (update) I have profiled it, found that the slowest, bottleneck, is rather the "pickle" process not hash itself.
  • favor the slower hash (very negligible) to the safer for collisions.
  • by directing hash the ndarray via xxhash, ndarray hashing performance is increased ten-fold.
  • add a verbose mode, showing the time elapsed for hashing (mainly the overhead of caching).
  • add support to F_CONTIGUOUS nd-array by transposing it we can use xxhash to hash.
  • Take function dependencies (i.e. functions that this function calls) into account.

Known Problem

  • null arg problem where a function as no argument.
  • using dill for hashing the function is an overkill, it's far too sensitive, I will fallback to function source lines.
  • ValueError: ndarray is not C-contiguous happens with some specific ndarray, not all ndarrays can be fed to xxhash directly: be treated by pickle for now.

About

Alternative to Joblib's Memory to do file-based cache python function

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages