In [None]:
from pathlib import Path
import sys; sys.path.insert(0, str(Path('src').absolute()))
import os
cwd =os.getcwd()

import ast
import inspect

from IPython.display import Markdown as md

def flink(title: str, name: str):
    [modname, fname] = name.split('.', maxsplit=1)
    module = globals()[modname]
    
    func = module
    for p in fname.split('.'):
        func = getattr(func, p)
    file = Path(inspect.getsourcefile(func)).relative_to(cwd)
    _, number = inspect.getsourcelines(func)
    return f'[{title}]({file}:{number})'
    
dmd = lambda x: display(md(x))

import cachew
import cachew.tests.test_cachew as tests

In [None]:
dmd(f'<!--THIS FILE IS AUTOGENERATED BY README.ipynb. Use generate-readme to update it.-->')

# Cachew: quick NamedTuple/dataclass cache
TLDR: cachew can persistently cache any sequence (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator)) over [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple) or [dataclasses](https://docs.python.org/3/library/dataclasses.html) into an sqlite database on your disk.

Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive.
Parsing it takes hours, however, the archive is presumably updated not very frequently.
Normally to get around this, you would have to serialize your pipeline results along with some sort of hash (e.g. md5) of input files,
compare on the next query and return them on matching hash, or discard and compute new ones if the hash (i.e. input data) changed. 

This is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.
This library is meant to solve that problem through a single line of decorator code.

TODO move this^ to Example section?

# Installing

TODO


In [None]:
dmd('# Example')
doc = inspect.getdoc(cachew.cachew)
doc = doc.split('Usage example:')[-1].lstrip()
dmd(f"""```python
{doc}
```""")

In [None]:
dmd('# Features')
types = [f'`{c.__name__}`' for c in cachew.PRIMITIVES.keys()]
dmd('Supported primitive types:' + ', '.join(types))
dmd(f"""
* supports Optional TODO
* supports {flink('nested datatypes', 'tests.test_nested')}
* supports return type inference: {flink('1', 'tests.test_return_type_inference')}, {flink('2', 'tests.test_return_type_mismatch')}
* detects {flink('datatype schema changes', 'tests.test_schema_change')} and discards old data automatically            
""")
# * custom hash function TODO example with mtime?

In [None]:
[composite] = [x
 for x in ast.walk(ast.parse(inspect.getsource(cachew))) 
 if isinstance(x, ast.FunctionDef) and x.name == 'composite_hash'
]

link = f'{Path(cachew.__file__).relative_to(cwd)}:{composite.lineno}'

dmd(f'''
# How it works
Basically, your data objects get {flink('flattened out', 'cachew.NTBinder.to_row')}
and python types are mapped {flink('onto sqlite types and back', 'cachew.NTBinder.iter_columns')}

When the function is called, cachew [computes the hash]({link})  of your function's arguments 
and compares it against the previously stored hash value.
    
If they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.
''')

# Inspiration
Mainly this was inspired by [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache), which is excellent if you need to cache something within a single python process run.

## Implementation

* why tuples and dataclasses?
  
  Tuples are natural in Python for quickly grouping together return results.
  `NamedTuple` and `dataclass` specifically provide a very straighforward and self documenting way way to represent a bit of data in Python.
  Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.
   
  * TODO [2019-07-30 Tue 21:02] some link to data class
    
* why not [pickle](https://docs.python.org/3/library/pickle.html)?

  Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python.

* why `sqlite` database for storage?

  It's pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.

* why not `pandas.DataFrame`?

  DataFrames are great and can be serialised to csv or pickled.
  They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature.
  They also can't be nested.
  
* why not [ORM](https://en.wikipedia.org/wiki/Object-relational_mapping)?
  
  ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It's also somewhat an overkill for such a specific purpose.

  * E.g. [SQLAlchemy](https://docs.sqlalchemy.org/en/13/orm/tutorial.html#declare-a-mapping) requires you using custom sqlalchemy specific types and inheriting a base class.
    Also it doesn't support nested types.

* why not [marshmallow](https://marshmallow.readthedocs.io/en/3.0/nesting.html)?
  
  Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations.
  
  *  https://github.com/justanr/marshmallow-annotations TODO has support for NamedTuples
 https://marshmallow-annotations.readthedocs.io/en/latest/ext/namedtuple.html#namedtuple-type-api

 https://pypi.org/project/marshmallow-dataclass/
  * TODO mention that in code?

* TODO [2019-07-30 Tue 19:00] post some link to data classes?
   
# examples
* [2019-07-30 Tue 20:15] e.g. if hash is date you can ensure you only serve one piece of data a day


