<a href="https://colab.research.google.com/github/scarfboy/wetsuite-dev/blob/main/notebooks/intermediate/data_localdata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Purpose of this notebook

An introduction to storing collections of things in the localdata module, mostly the LocalKV class, that we provide.

## tl;dr for the programmers

It's a key-value store stored in a file, with some self-enforced type checking.

You can get one like 
- `mystore = wetsuite.helpers.localdata.open_store('mystore.db', str,str)`

and then do 
- `mystore.put('foo', 'bar')`, 
- `mystore.get('foo')`
- `mystore.delete('foo')`

There is a helper that lets you do HTTP fetches into these.


## The longer introduction

**Why?**

It would be useful to have a mix of
- saving things to disk
- that is easy to use - loading and saving are few lines of code
- working with moderately large data
    - allows iteration of large data (without loading all in RAM)
    - allows _reasonably_ fast lookup of arbitrary items (without loading all in RAM or scanning the entire file)
- not requiring an external dependency - of something that _requires_ installation (usually only an admin can or wants to set up a database service) and admin
- that allows storage of text, binary data, and _perhaps_ more
- make it easy to find specific known items

This makes it useful for you to store a reasonable amount of data with less worry about such details, 
and is also useful for us to distribute larger data/datasets in the same way.

**A little more practically**

Say you download a bunch of webpages, or gigabytes of PDFs. 

Yes, storing that into separate files is perfectly functional. 
The filesystem _is_ a simple database, after all.

And it checks _most_ of the boxes above - except ease of use / few lines of code.
There is a bunch of boilerplate to write each use, 
you may require you to think about (not) hardcoding paths of where you've actually put it,
about storing unicode _or_ bytes in files, 
about what you name each on the filesystem (because most filesystems won't allow filenames longer than 255),
which also means that finding specific known items must know what kind of naming you did,
you may sometimes with to _also_ keep track of the renaming you did for that reason,
and more such details.

And in some cases, none of that is hard, and this all stays easy.

Yet in other cases that immediately becomes a messy amount of extra work.

So we offer a disk-backed [key-value store](https://en.wikipedia.org/wiki/Key%E2%80%93value_database)
(meaning each unique key is associated with a value - and no more structure than that)
that lets you easily persist certain kinds of data, from relatively brief code.

It's fairly minimal, but that's part of the point. 

In [1]:
import pprint, datetime
import wetsuite.helpers.localdata

### basic key-value use

Not very useful yet, but a minimal example:

In [2]:
# given
mystore = wetsuite.helpers.localdata.open_store('mystore.db', key_type=str,value_type=str)   # we tell it the keys should be string values, and so should the values

# we can do:
mystore.put('fish', 'one')
mystore.put('fork', 'test')
mystore.get('fish')
mystore.delete('fish')

In [3]:
# There are also some functions like those present on python dicts, like
print( 'keys:                    ', list(mystore.keys())    )   # list() because we we actually get a view   (much like python 3 dict behaviour)
print( 'values:                  ', list(mystore.values())  ) 
print( 'items:                   ', list(mystore.items())   )
print( 'len:                     ', len(mystore)      )
print( '"fork" is a present key: ', ('fork' in mystore) )       # via __contains__()

keys:                     ['a', 'fork']
values:                   ['b', 'test']
items:                    [('a', 'b'), ('fork', 'test')]
len:                      2
"fork" is a present key:  True


### Downloading into a store

While collecting data online, we tend to download a multitude of items.

If you have a bunch of URLs, want to fetch the according documents,
and want to cache locally to avoid re-fetching on each run of a script,
then our `cached_fetch( store, url )` can help:

If we have previously fetched that URL, we can return the stored copy. If we did not, we fetch it, save it, then return it.
<!-- (...assuming you don't need auth or other session state, and assuming you don't need them removed based on time and/or cacheing rules. We may fix some of that later) -->

In [4]:
download_store = wetsuite.helpers.localdata.open_store('download_store.db', key_type=str, value_type=bytes)  # values should be bytes, because cached_fetch stores downloads while still in bytestring form

page_bytes, came_from_cache = wetsuite.helpers.localdata.cached_fetch( download_store, 'http://example.com/' )
print( f'Came from cache: {came_from_cache}\n' )
print( page_bytes.decode('u8') )

Came from cache: True

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain

## Typing

What types of python objects  could we store in there? 
In theory we could hand in various python types - will they be that same type once they are stored and fetched out again?
Let's find out.

In [5]:
typestore = wetsuite.helpers.localdata.open_store('typetest.db', str,None)   # keys:strings,   values: not checking so it does whatever sqlite decides

typestore.put('one', 1)
typestore.put('two', 2.2)
typestore.put('three', '3')
typestore.put('four', b'4')
typestore.put('five', datetime.datetime.now())

pprint.pprint( list( typestore.items() ) )

[('five', '2023-10-10 21:03:49.896992'),
 ('four', b'4'),
 ('one', '1'),
 ('three', '3'),
 ('two', '2.2')]


If you peek into the database from a shell using `sqlite3 typetst.db`, then `SELECT key, value, typeof(value) FROM kv`,  it is revealed that's all `text` except for one `blob`

tl;dr for the users: 
- Whenever your use is served by storing only `bytes` (e.g. for downloaded files) or `str`, sticking to that keeps things simple
- We can do more, but that gets confusing  (see the rest of this cell why)
- If you want to store more complex things, particularly nested python data structures, see the rest of this notebook


To the programmers: 
- SQLite is [dynamically typed](https://en.wikipedia.org/wiki/Dynamic_typing), so will
  - store whatever type it decides to store, without adhering to the type you have declared for the column
    - and your language's sqlite library may have a say in that too
  - return the type that got stored, without converting
    - and your language's sqlite library may have a say in that too

- python's sqlite3 library already seems to help [coercing](https://en.wikipedia.org/wiki/Type_conversion), roughly to `str` unless it's `bytes`

This combination makes it easy to mix things and confuse yourself.

This is why opening a store asks you to decide up front what type to put in, 
and why it actively checks, as a sanity check that _we_ are being consistent.

(You _can_ disable that checking by handing in None for both.)

Caveat: the types it checks aren't stored (yet?),
so you better be consistent about opening the same store with the same type checking later.

# Serializing more complex data.

You will often end up creating structured data, at first in the form of nested python types, 
e.g. lists, dicts, ints, of floats, strings, bytes.

SQLite has no reason to natively understand whatever python is doing,
so we need to pack it into something we get back from it verbatim.

Options include
- Python's pickle - stores [more than just the primitives](https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled), though has some security concerns
- JSON - a common interchange format, though it doesn't have a type for `date`/`datetime`
- MsgPack - faster than JSON, though doesn't do datetime out of the box either

Each has their own advantages and drawbacks.

Implementation-wise, these add little more than an encode on `put()`, and decode on `get()`.

In [2]:
msgpack_store = wetsuite.helpers.localdata.MsgpackKV('msgpacktest.db')
msgpack_store.put('one', {'a':['b']})
print( msgpack_store.get('one') )
type( msgpack_store.get('one') ) # the type of what comes out is...   (mostly to point out it's not that as text)

{'a': ['b']}


dict

# open_store() or instantiating directly?

The above uses `open_store()`, and you probably want to as well. 

The _actual_ class being used is LocalKV, so we could equivalently do:
```
    mystore = wetsuite.helpers.localdata.LocalKV('mystore.db', key_type=str, value_type=str)
```
but note that 
 - if you hand in a relative path it'll be placed in the working directory (i.e. where you started python / changed to since), which may mean you sprinkle these files around, and confuse yourself 
 - if you hand in an absolute path it won't be very portable between systems

 So we have a third option in open_store(), e.g.
 ```
     mystore = wetsuite.helpers.localdata.open_store('deleteme.db', key_type=str, value_type=str)
 ```
 which mostly just figures out a path inside your user profile (alongside other things wetsuite stores), so that handing in the same (directory-less) name here should always open the same file.

In [None]:
# see where it decided to put it 
store = wetsuite.helpers.localdata.open_store('deleteme.db', key_type=str, value_type=str)
store.path

# More notes

## Limitations

...you should probably not be using this expecting speed or parallelism.
- We default to commit immediately, so for small things are limited by IOPS. 
- parallelism is limited - multiple processes can read, but only one can write

Assume this is This is basically an append-only database.
If you remove items, assume it will fragment and that the file stays large,
until you do a vacuum, which basically rewrites the entire file.

## No syntax hackery
If you want to store something, you always use put().

Sure we could save you a few keystrokes by making it act like a dict,
but that starts to be a leaky abstraction particularly once we get to storing more complex values,
because altering those would not get mirrored in the database (for hard-to-explain reasons). 

Also, doing bulk things in larger transactions couldn't be exposed that way,
so now we'd have two interfaces with different capabilities.

So we make you do it, to keep things clear. 