In [None]:
from pathlib import Path
import sys; sys.path.insert(0, str(Path('src').absolute()))
import os
cwd =os.getcwd()

import ast
import inspect

from IPython.display import Markdown as md

def flink(title: str, name: str=None):
    # name is method name
    if name is None:
        name = title.replace('`', '') # meh
    split = name.rsplit('.', maxsplit=1)
    if len(split) == 1:
        modname = split[0]
        fname = None
    else:
        [modname, fname] = split
    module = sys.modules[modname]

    file = Path(module.__file__).relative_to(cwd)

    if fname is not None:
        func = module
        for p in fname.split('.'):
            func = getattr(func, p)
        _, number = inspect.getsourcelines(func)
        numbers = f'#L{number}'
    else:
        numbers = ''
    return f'[{title}]({file}{numbers})'

dmd = lambda x: display(md(x.strip()))

import cachew
import cachew.extra
import cachew.marshall.cachew
import cachew.tests.test_cachew as tests
sys.modules['tests'] = tests  # meh

In [None]:
dmd(f'''
<!--
THIS FILE IS AUTOGENERATED BY README.ipynb.
Ideally you should edit README.ipynb and use 'generate-readme' to produce README.md. 
But it's okay to edit README.md too directly if you want to fix something -- I can run generate-readme myself later.
-->
''')

# What is Cachew?
TLDR: cachew lets you **cache function calls** into an sqlite database on your disk in a matter of **single decorator** (similar to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache)). The difference from `functools.lru_cache` is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache.
Cache is **invalidated automatically** if your function's arguments change, so you don't have to think about maintaining it.

In order to be cacheable, your function needs to return a simple data type, or an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator) over such types.

A simple type is defined as:

- primitive: `str`/`int`/`float`/`bool`
- JSON-like types (`dict`/`list`/`tuple`)
- `datetime`
- `Exception` (useful for [error handling](https://beepb00p.xyz/mypy-error-handling.html#kiss) )
- [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple)
- [dataclasses](https://docs.python.org/3/library/dataclasses.html)


That allows to **automatically infer schema from type hints** ([PEP 526](https://www.python.org/dev/peps/pep-0526)) and not think about serializing/deserializing.
Thanks to type hints, you don't need to annotate your classes with any special decorators, inherit from some special base classes, etc., as it's often the case for serialization libraries.

## Motivation

I often find myself processing big chunks of data, merging data together, computing some aggregates on it or extracting few bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases.

Conventional way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files,
comparing on the next run and returning cached data if nothing changed.

Simple as it sounds, it is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.


# Examples
## Processing Wikipedia
Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive.
Parsing it (`extract_links` function) takes hours, however, as long as the archive is same you will always get same results. So it would be nice to be able to cache the results somehow.


With this library your can achieve it through single `@cachew` decorator.

In [None]:
doc = inspect.getdoc(cachew.cachew)
doc = doc.split('Usage example:')[-1].lstrip()
dmd(f"""```python
{doc}
```""")

When you call `extract_links` with the same archive, you start getting results in a matter of milliseconds, as fast as sqlite reads it.

When you use newer archive, `archive_path` changes, which will make cachew invalidate old cache and recompute it, so you don't need to think about maintaining it separately.

## Incremental data exports
This is my most common usecase of cachew, which I'll illustrate with example.

I'm using an [environment sensor](https://bluemaestro.com/products/product-details/bluetooth-environmental-monitor-and-logger) to log stats about temperature and humidity.
Data is synchronized via bluetooth in the sqlite database, which is easy to access. However sensor has limited memory (e.g. 1000 latest measurements).
That means that I end up with a new database every few days, each of them containing only a slice of data I need, e.g.:

    ...
    20190715100026.db
    20190716100138.db
    20190717101651.db
    20190718100118.db
    20190719100701.db
    ...

To access **all** of historic temperature data, I have two options:

- Go through all the data chunks every time I wan to access them and 'merge' into a unified stream of measurements, e.g. something like:
  
      def measurements(chunks: List[Path]) -> Iterator[Measurement]:
          for chunk in chunks:
              # read measurements from 'chunk' and yield unseen ones

  This is very **easy, but slow** and you waste CPU for no reason every time you need data.

- Keep a 'master' database and write code to merge chunks in it.

  This is very **efficient, but tedious**:
  
  - requires serializing/deserializing data -- boilerplate
  - requires manually managing sqlite database -- error prone, hard to get right every time
  - requires careful scheduling, ideally you want to access new data without having to refresh cache

  
Cachew gives the best of two worlds and makes it both **easy and efficient**. The only thing you have to do is to decorate your function:

    @cachew      
    def measurements(chunks: List[Path]) -> Iterator[Measurement]:
        # ...
        
- as long as `chunks` stay same, data stays same so you always read from sqlite cache which is very fast
- you don't need to maintain the database, cache is automatically refreshed when `chunks` change (i.e. you got new data)

  All the complexity of handling database is hidden in `cachew` implementation.

In [None]:
[composite] = [x
 for x in ast.walk(ast.parse(inspect.getsource(cachew))) 
 if isinstance(x, ast.FunctionDef) and x.name == 'composite_hash'
]

link = f'{Path(cachew.__file__).relative_to(cwd)}:#L{composite.lineno}'

dmd(f'''
# How it works

- first your objects get {flink('converted', 'cachew.marshall.cachew.CachewMarshall')} into a simpler JSON-like representation 
- after that, they are mapped into byte blobs via [`orjson`](https://github.com/ijl/orjson).

When the function is called, cachew [computes the hash of your function's arguments ]({link})
and compares it against the previously stored hash value.
    
- If they match, it would deserialize and yield whatever is stored in the cache database
- If the hash mismatches, the original function is called and new data is stored along with the new hash
''')

In [None]:
dmd('# Features')
types = [f'`{t}`' for t in ['str', 'int', 'float', 'bool', 'datetime', 'date', 'Exception']]
dmd(f"""
* automatic schema inference: {flink('1', 'tests.test_return_type_inference')}, {flink('2', 'tests.test_return_type_mismatch')}
* supported types:    

    * primitive: {', '.join(types)}
    
      See {flink('tests.test_types')}, {flink('tests.test_primitive')}, {flink('tests.test_dates')}, {flink('tests.test_exceptions')}
    * {flink('@dataclass and NamedTuple', 'tests.test_dataclass')}
    * {flink('Optional', 'tests.test_optional')} types
    * {flink('Union', 'tests.test_union')} types
    * {flink('nested datatypes', 'tests.test_nested')}
    
* detects {flink('datatype schema changes', 'tests.test_schema_change')} and discards old data automatically            
""")
# * custom hash function TODO example with mtime?

# Performance
Updating cache takes certain overhead, but that would depend on how complicated your datatype in the first place, so I'd suggest measuring if you're not sure.

During reading cache all that happens is reading blobls from sqlite/decoding as JSON, and mapping them onto your target datatype, so the overhead depends on each of these steps.

It would almost certainly make your program faster if your computations take more than several seconds.

You can find some of my performance tests in [benchmarks/](benchmarks) dir, and the tests themselves in [src/cachew/tests/marshall.py](src/cachew/tests/marshall.py).

In [None]:
dmd(f"""
# Using
See {flink('docstring', 'cachew.cachew')} for up-to-date documentation on parameters and return types. 
You can also use {flink('extensive unit tests', 'tests')} as a reference.
    
Some useful (but optional) arguments of `@cachew` decorator:
    
* `cache_path` can be a directory, or a callable that {flink('returns a path', 'tests.test_callable_cache_path')} and depends on function's arguments.
    
   By default, `settings.DEFAULT_CACHEW_DIR` is used.
    
* `depends_on` is a function which determines whether your inputs have changed, and the cache needs to be invalidated.
    
   By default it just uses string representation of the arguments, you can also specify a custom callable.
    
   For instance, it can be used to {flink('discard cache', 'tests.test_custom_hash')} if the input file was modified.
    
* `cls` is the type that would be serialized.

   By default, it is inferred from return type annotations, but can be specified explicitly if you don't control the code you want to cache.    
""")

# Installing
Package is available on [pypi](https://pypi.org/project/cachew/).

    pip3 install --user cachew
    
## Developing
I'm using [tox](tox.ini) to run tests, and [Github Actions](.github/workflows/main.yml) for CI.

# Implementation

* why NamedTuples and dataclasses?
  
  `NamedTuple` and `dataclass` provide a very straightforward and self documenting way to represent data in Python.
  Very compact syntax makes it extremely convenient even for one-off means of communicating between couple of functions.
   
  If you want to find out more why you should use more dataclasses in your code I suggest these links:
  
  - [What are data classes?](https://stackoverflow.com/questions/47955263/what-are-data-classes-and-how-are-they-different-from-common-classes)
  - [basic data classes](https://realpython.com/python-data-classes/#basic-data-classes)
   
* why not `pandas.DataFrame`?

  DataFrames are great and can be serialised to csv or pickled.
  They are good to have as one of the ways you can interface with your data, however hardly convenient to think about it abstractly due to their dynamic nature.
  They also can't be nested.

* why not [ORM](https://en.wikipedia.org/wiki/Object-relational_mapping)?
  
  ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It's also somewhat an overkill for such a specific purpose.

  * E.g. [SQLAlchemy](https://docs.sqlalchemy.org/en/13/orm/tutorial.html#declare-a-mapping) requires you using custom sqlalchemy specific types and inheriting a base class.
    Also it doesn't support nested types.
    
* why not [pickle](https://docs.python.org/3/library/pickle.html) or [`marshmallow`](https://marshmallow.readthedocs.io/en/3.0/nesting.html) or `pydantic`?

  Pickling is kinda heavyweigh for plain data class, it's slower just using JSON. Lastly, it can only be loaded via Python, whereas JSON + sqlite has numerous bindings and tools to explore and interface.

  Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations. I've looked at existing projects to utilize type annotations, but didn't find them covering all I wanted:
  
  * https://marshmallow-annotations.readthedocs.io/en/latest/ext/namedtuple.html#namedtuple-type-api
  * https://pypi.org/project/marshmallow-dataclass
 
  I wrote up an extensive review of alternatives I considered: see [doc/serialization.org](doc/serialization.org).
  So far looks like only `cattrs` comes somewhere close to the feature set I need, but still not quite.

* why `sqlite` database for storage?

  It's pretty efficient and iterables (i.e. sequences) map onto database rows in a very straightforward manner, plus we get some concurrency guarantees.

  There is also a somewhat experimental backend which uses a simple file (jsonl-like) for storage, you can use it via `@cache(backend='file')`, or via `settings.DEFAULT_BACKEND`.
  It's slightly faster than sqlite judging by benchmarks, but unless you're caching millions of items this shouldn't really be noticeable.
  
  It would also be interesting to experiment with in-RAM storages.

  I had [a go](https://github.com/karlicoss/cachew/issues/9) at Redis as well, but performance for writing to cache was pretty bad. That said it could still be interesting for distributed caching if you don't care too much about performance.


# Tips and tricks
## Optional dependency
You can benefit from `cachew` even if you don't want to bloat your app's dependencies. Just use the following snippet:

In [None]:
import cachew.extra
dmd(f"""```python
{inspect.getsource(cachew.extra.mcachew)}
```""")

Now you can use `@mcachew` in place of `@cachew`, and be certain things don't break if `cachew` is missing.

## Settings

In [None]:
dmd(f'''
{flink('cachew.settings')} exposes some parameters that allow you to control `cachew` behaviour:
- `ENABLE`: set to `False` if you want to disable caching for without removing the decorators (useful for testing and debugging).
   You can also use {flink('cachew.extra.disabled_cachew')} context manager to do it temporarily.
- `DEFAULT_CACHEW_DIR`: override to set a different base directory. The default is the "user cache directory" (see [appdirs docs](https://github.com/ActiveState/appdirs#some-example-output)).
- `THROW_ON_ERROR`: by default, cachew is defensive and simply attemps to cause the original function on caching issues.
   Set to `True` to catch errors earlier.
- `DEFAULT_BACKEND`: currently supported are `sqlite` and `file` (file is somewhat experimental, although should work too). 

''')

## Updating this readme
This is a literate readme, implemented as a Jupiter notebook: [README.ipynb](README.ipynb). To update the (autogenerated) [README.md](README.md), use [generate-readme](generate-readme) script.