The following demos a few things about [composing stores](https://github.com/i2mint/dol/discussions/25).

# Matching extension to specific value decoder

Here, we'll look at the (common) problem where you have a folder with various extensions, 
each requiring a different value decoder. 
How can you make a single store that will automatically look at the extension and decide how to decode the bytes of the file accordingly?

There are two answers to this. 

The first, is to use a composition of multiple stores, each focused on a particular extension.

The second, is to use the `postget` argument of `dol.wrap_kvs`. 

The right solution depends on the context, but I'd start with the `postget` solution most of the time.


## Some test data we'll use

In [5]:
from dol import Files
from dol_cookbook import misc_files_path

s = Files(misc_files_path)
list(s)

['save_here.json',
 'nested.json',
 'AI and Tempo Estimation.pdf',
 'JAMMIN-GPT.pdf',
 'simple.docx',
 'Release Notes.docx',
 'simple.txt']

## Making a text store

In [27]:
from dol import wrap_kvs, filt_iter, Pipe

text_wrap = Pipe(
    filt_iter(filt=lambda x: x.endswith('.txt')),
    wrap_kvs(value_decoder=lambda obj: obj.decode('utf-8'))
)

text_store = text_wrap(s)
list(text_store)

['simple.txt']

In [24]:
store = text_store

k = next(iter(store))
print(f"{k=}")
v = store[k]
print(f"{v=}")

k='simple.txt'
v='This is\nJust some text\nTo test things'


## Making a json store

In [32]:
import json

json_wrap = Pipe(
    filt_iter(filt=lambda x: x.endswith('.json')),
    wrap_kvs(value_decoder=lambda obj: json.loads(obj))
)

json_store = json_wrap(s)
list(json_store)

['save_here.json', 'nested.json']

In [33]:
store = json_store

k = next(iter(store))
print(f"{k=}")
v = store[k]
print(f"{v=}")

k='save_here.json'
v={'the': 'dict', 'you': 'want', 'to': 'save'}


## Making a pdf store

In [85]:
from pdfdol.base import bytes_to_pdf_text_pages

pdf_wrap = Pipe(
    filt_iter(filt=lambda x: x.endswith('.pdf')),
    bytes_to_pdf_text_pages,
    wrap_kvs(value_decoder='\n\n------------\n\n'.join),
)

pdf_store = pdf_wrap(s)
list(pdf_store)

['AI and Tempo Estimation.pdf', 'JAMMIN-GPT.pdf']

In [86]:
store = pdf_store

k = next(iter(store))
print(f"{k=}")
v = store[k]
print(f"{v=}")

k='AI and Tempo Estimation.pdf'
v=' 1 AI and Tempo Estimation: A Review Geoff Luck1 1 Centre of Excellence in Music, Mind, Body and Brain, Department of Music, Art and Culture Studies, University of Jyväskylä, Finland.  1geoff.luck@jyu.fi   Abstract The author’s goal in this paper is to explore how artificial intelligence (AI) has been utilized to inform our understanding of and ability to estimate at scale a critical aspect of musical creativity — musical tempo. The central importance of tempo to musical creativity can be seen in how it is used to express specific emotions (Eerola and Vuoskoski 2013), suggest particular musical styles (Li and Chan 2011), influence perception of expression (Webster and Weir 2005) and mediate the urge to move one’s body in time to the music (Burger et al. 2014). Traditional tempo estimation methods typically detect signal periodicities that reflect the underlying rhythmic structure of the music, often using some form of autocorrelation of the amplitude 

## Making a docx store

In [55]:
from dol import Pipe
from msword import bytes_to_doc, get_text_from_docx, only_files_with_msword_extension

doc_wrap = Pipe(
    only_files_with_msword_extension,
    wrap_kvs(value_decoder=Pipe(bytes_to_doc, get_text_from_docx)),
)

doc_store = doc_wrap(s)
list(doc_store)

['simple.docx', 'Release Notes.docx']

In [57]:
store = doc_store

k = next(iter(store))
print(f"{k=}")
v = store[k]
print(f"{v=}")

k='simple.docx'
v='Just a bit of text to show that is works. Another sentence.\nThis is after a newline.\n\nThis is after two newlines.'


## Making a ChainMap from these

In [67]:
from collections import ChainMap

chained_store = ChainMap(text_store, json_store, pdf_store, doc_store)
list(chained_store)


['simple.docx',
 'Release Notes.docx',
 'AI and Tempo Estimation.pdf',
 'JAMMIN-GPT.pdf',
 'save_here.json',
 'nested.json',
 'simple.txt']

In [59]:
doc = chained_store['simple.docx']
doc

'Just a bit of text to show that is works. Another sentence.\nThis is after a newline.\n\nThis is after two newlines.'

In [68]:
pdf = chained_store['AI and Tempo Estimation.pdf']
pdf

' 1 AI and Tempo Estimation: A Review Geoff Luck1 1 Centre of Excellence in Music, Mind, Body and Brain, Department of Music, Art and Culture Studies, University of Jyväskylä, Finland.  1geoff.luck@jyu.fi   Abstract The author’s goal in this paper is to explore how artificial intelligence (AI) has been utilized to inform our understanding of and ability to estimate at scale a critical aspect of musical creativity — musical tempo. The central importance of tempo to musical creativity can be seen in how it is used to express specific emotions (Eerola and Vuoskoski 2013), suggest particular musical styles (Li and Chan 2011), influence perception of expression (Webster and Weir 2005) and mediate the urge to move one’s body in time to the music (Burger et al. 2014). Traditional tempo estimation methods typically detect signal periodicities that reflect the underlying rhythmic structure of the music, often using some form of autocorrelation of the amplitude envelope (Lartillot and Toiviainen

## Using postget to change decoding according to extension

In [126]:
from dol import Files
from dol_cookbook import misc_files_path

s = Files(misc_files_path)
list(s)

['save_here.json',
 'nested.json',
 'AI and Tempo Estimation.pdf',
 'JAMMIN-GPT.pdf',
 'simple.docx',
 'Release Notes.docx',
 'simple.txt']

In [107]:
from pdfdol.base import bytes_to_pdf_reader_obj, read_pdf_text
from msword import bytes_to_doc, get_text_from_docx

extension_to_decoder = {
    '.txt': lambda obj: obj.decode('utf-8'),
    '.json': json.loads,
    '.pdf': Pipe(
        bytes_to_pdf_reader_obj, read_pdf_text, '\n\n------------\n\n'.join
    ),
    '.docx': Pipe(bytes_to_doc, get_text_from_docx),
}

from dol import wrap_kvs, Pipe

def extension_based_decoding(k, v):
    ext = '.' + k.split('.')[-1]
    decoder = extension_to_decoder.get(ext, None)
    if decoder is None:
        raise ValueError(f"Unknown extension: {ext}")
    return decoder(v)

def extension_base_wrap(store):
    return wrap_kvs(store, postget=extension_based_decoding)

store = extension_base_wrap(s)
list(store)

['save_here.json',
 'nested.json',
 'AI and Tempo Estimation.pdf',
 'JAMMIN-GPT.pdf',
 'simple.docx',
 'Release Notes.docx',
 'simple.txt']

In [108]:
store['simple.docx']

'Just a bit of text to show that is works. Another sentence.\nThis is after a newline.\n\nThis is after two newlines.'

In [109]:
store['nested.json']

{'version': 37,
 'examples': {'apple': ['pie', 'crumble', 'sauce'], 'banana': 'split'}}

In [110]:
store['AI and Tempo Estimation.pdf']

' 1 AI and Tempo Estimation: A Review Geoff Luck1 1 Centre of Excellence in Music, Mind, Body and Brain, Department of Music, Art and Culture Studies, University of Jyväskylä, Finland.  1geoff.luck@jyu.fi   Abstract The author’s goal in this paper is to explore how artificial intelligence (AI) has been utilized to inform our understanding of and ability to estimate at scale a critical aspect of musical creativity — musical tempo. The central importance of tempo to musical creativity can be seen in how it is used to express specific emotions (Eerola and Vuoskoski 2013), suggest particular musical styles (Li and Chan 2011), influence perception of expression (Webster and Weir 2005) and mediate the urge to move one’s body in time to the music (Burger et al. 2014). Traditional tempo estimation methods typically detect signal periodicities that reflect the underlying rhythmic structure of the music, often using some form of autocorrelation of the amplitude envelope (Lartillot and Toiviainen

'FileBytesPersister with relative paths'

In [124]:
# You could also apply extension_base_wrap to a class

@extension_base_wrap
class MySpecialFiles(Files):
    """Decodes files according to some specific extension rules"""

store2 = MySpecialFiles(misc_files_path)
list(store2)

['save_here.json',
 'nested.json',
 'AI and Tempo Estimation.pdf',
 'JAMMIN-GPT.pdf',
 'simple.docx',
 'Release Notes.docx',
 'simple.txt']

In [125]:
store2['Release Notes.docx']

'Latest Release\nVersion: v0.0.20\nFeatures/changes:\nAdded data storage management feature, which will delete uploaded blocks, if memory usage is more than 90 and also, it will show alerts to users.\nAdded more streaming parameters configurable through CLI, they are,\nBuffer duration in seconds\nBlock duration in seconds\nChunk duration in seconds\nCheck the help in CLI for more information on new parameters\nAdded Amplification feature in oedge. \nUsers can see the current amplification value on get-config. \nAlso can update it on set-config \n\nPrevious Releases\n\nVersion: v0.0.19\nFeatures/changes:\nChange in Data storage service hosted VM IP\n\nVersion: v0.0.18\nFeatures/changes:\nUpload Service is now working on periodic trigger instead of Redis Trigger \n\n\nVersion: v0.0.17\nFeatures/changes:\nIncluded new version of AudioStream2py(0.1.21) library with bug fixes\nAdded Session ID details to CLI when starting a capture\n\nVersion: v0.0.16\nFeatures/changes:\nIncluded new versio

In [None]:
from pdfdol.base import bytes_to_pdf_reader_obj, read_pdf_text
from msword import bytes_to_doc, get_text_from_docx

extension_to_decoder = {
    '.txt': lambda obj: obj.decode('utf-8'),
    '.json': json.loads,
    '.pdf': Pipe(
        bytes_to_pdf_reader_obj, read_pdf_text, '\n\n------------\n\n'.join
    ),
    '.docx': Pipe(bytes_to_doc, get_text_from_docx),
}

from dol import wrap_kvs, Pipe

def extension_based_decoding(k, v):
    ext = '.' + k.split('.')[-1]
    decoder = extension_to_decoder.get(ext, None)
    if decoder is None:
        raise ValueError(f"Unknown extension: {ext}")
    return decoder(v)

def extension_base_wrap(store):
    return wrap_kvs(store, postget=extension_based_decoding)

In [None]:
from pdfdol.base import bytes_to_pdf_reader_obj, read_pdf_text
from msword import bytes_to_doc, get_text_from_docx

extension_to_decoder = {
    '.txt': lambda obj: obj.decode('utf-8'),
    '.json': json.loads,
    '.pdf': Pipe(
        bytes_to_pdf_reader_obj, read_pdf_text, '\n\n------------\n\n'.join
    ),
    '.docx': Pipe(bytes_to_doc, get_text_from_docx),
}

from dol import wrap_kvs, Pipe

def mk_extension_based_decoding(extension_to_decoder):
    def extension_based_decoding(k, v):
        ext = '.' + k.split('.')[-1]
        decoder = extension_to_decoder.get(ext, None)
        if decoder is None:
            raise ValueError(f"Unknown extension: {ext}")
        return decoder(v)
    return extension_based_decoding


def extension_base_wrap(store, extension_to_decoder):
    extension_based_decoding = mk_extension_based_decoding(extension_to_decoder)
    return wrap_kvs(store, postget=extension_based_decoding)



In [131]:
from functools import partial

dflt_extension_to_decoder = {
    '.txt': lambda obj: obj.decode('utf-8'),
    '.json': json.loads,
    '.pdf': Pipe(
        bytes_to_pdf_reader_obj, read_pdf_text, '\n\n------------\n\n'.join
    ),
    '.docx': Pipe(bytes_to_doc, get_text_from_docx),
}

def extension_based_decoding(k, v, *, extension_to_decoder=dflt_extension_to_decoder):
    ext = '.' + k.split('.')[-1]
    decoder = extension_to_decoder.get(ext, None)
    if decoder is None:
        raise ValueError(f"Unknown extension: {ext}")
    return decoder(v)


def extension_base_wrap(store, *, extension_to_decoder=dflt_extension_to_decoder):
    _extension_based_decoding = partial(extension_based_decoding, extension_to_decoder=extension_to_decoder)
    return wrap_kvs(store, postget=_extension_based_decoding)



In [132]:
def decode_and_log(v):
    print(f"Decoding: {v}")
    return v.decode()

store = extension_base_wrap(s, extension_to_decoder={'.txt': decode_and_log})
list(store)

['save_here.json',
 'nested.json',
 'AI and Tempo Estimation.pdf',
 'JAMMIN-GPT.pdf',
 'simple.docx',
 'Release Notes.docx',
 'simple.txt']

In [133]:
store['simple.txt']

Decoding: b'This is\nJust some text\nTo test things'


'This is\nJust some text\nTo test things'