If you want to start playing with this without installation, try:
<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/notebooks/intro/wetsuite_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

Explain why wetsuite provides datasets, and give a basic introduction-by-example of which there are, and how to start using them.

Considered advanced intro not because it's overly complex, but because with a little luck you don't have to dive into these lower levels of detail.


## Why readymade datasets at all?

A few different reasons. Some of our considerations follow:

The most up to date information is often to fetch data directly from the source. We have notebooks elsewhere details how that may be done (TODO: link)

If you can express what you want, then such direct fetching is also the most elegant way to get exactly what you want.


Such 'if's matter. When you have a research question, and have gone to a system that probably has the data you are looking for, ask yourself:


**Does it let you _express_ what you want?**

Questions like "give me all pariamentary reports related to animal management" may be very reasonable, but for various data sources, you will not find it easy to be sure the search results are complete.


**Does it let you _fetch_ what you want?**

Even if a site lets you search for what you need, if there is no "download however many matching documents that is", you can't _do_ anythign with it (beyond spending the week pressing Save as)

Put another way: is there an interface to fetch data?

Also, is that interface at least as expressive as the site? For example
- the damocles example (TODO: link) may be an example of "hey, we managed to refine our search to download a fairly precise set of documents", but don't count on that always working.
- whereas the KOOP BWB interface doesn't let you search in the document body, as the website will.


**Maybe you just wanted a lot of data**

If you want to test a text processing method,
maybe you just wanted a lot of text of some different types,
no matter yet what it is exactly. Both site and API will would be slow.


Even if we provide code to make finding-and-fetchin easier,
_to end users_ that may only really move around the problem.

When you just wanted to focus on the _documents_, chances are you now need to stare at our code for an afternoon before you get it to work -- or discover that it would never have done what you expecting.

Or maybe you now have PDF of XML, and now have a different question before you can start working on _text_.


**tl;dr: these datasets represent certain amounts of prep work.**

## What readymade datasets currently exist?

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
!pip3 install -U --no-cache-dir --quiet https://github.com/scarfboy/wetsuite-dev/archive/refs/heads/main.zip

In [2]:
import wetsuite.datasets

# what sort of prepared dataset are currently offered for download?
wetsuite.datasets.list_datasets()

['bwb-mostrecent-meta-struc',
 'bwb-mostrecent-text',
 'bwb-mostrecent-xml',
 'cvdr-mostrecent-html',
 'cvdr-mostrecent-meta-struc',
 'cvdr-mostrecent-text',
 'cvdr-mostrecent-xml',
 'gemeentes-struc',
 'raadvanstate-adviezen-struc',
 'rechtspraaknl-struc',
 'tweedekamer-fractie-membership-struc',
 'tweedekamer-fracties-struc',
 'tweedekamer-kamervragen-struc',
 'wetnamen',
 'woo_besluiten_docs_text',
 'woo_besluiten_meta']

In [23]:
# ...with a little more description?
for name, details in wetsuite.datasets.fetch_index().items():
    #print(details)
    print( details.get("description_short", "XX") )
    #print( f'{name:<40} { or "(TODO: short description)"}' )

    # (you probably do not want to) uncomment the next line to download and load all  (also a test whether their download is not broken)
    #print( '                                          ',wetsuite.datasets.load(name) )














None




In [6]:
xml  = wetsuite.datasets.load('cvdr-mostrecent-xml')
meta = wetsuite.datasets.load('cvdr-mostrecent-meta-struc')
text = wetsuite.datasets.load('cvdr-mostrecent-text')

Downloading 'https://wetsuite.knobs-dials.com/datasets/cvdr-mostrecent-meta-struc.db.xz' to '/root/.wetsuite/datasets/520e84b21fbddf048227c1c4c3576108039311f4'
Decompressing... 259MiB    
Downloading 'https://wetsuite.knobs-dials.com/datasets/cvdr-mostrecent-text.db.xz' to '/root/.wetsuite/datasets/a9a59ccff8644749d378e6ef4094c66d75ff5b1c'
Decompressing... 3.8GiB    


## Example datasets

There are some more targeted examples, including

* `cvdr-`

* `bwb-`

* `kansspelautoriteit-sancties-txt` and ``: [using_dataset_kansspelautoriteit notebook](using_dataset_kansspelautoriteit.ipynb)

* `tweedekamer-fracties-struc` and `tweedekamer-fracties-membership-struc`: [using_dataset_tweedekamer notebook](using_dataset_tweedekamer.ipynb)

* `woobesluit`: [using_dataset_woobesluit notebook](using_dataset_woobesluit.ipynb)


And some things that are little more than lists, including
* `gemeentes`: [using_dataset_gemeentes notebook](using_dataset_gemeentes.ipynb)

* `wetnamen`: [using_dataset_wetnamen notebook](using_dataset_wetnamen.ipynb)



Aside: We have an ongoing discussion of how to provide more varied data without making life harder for you.

For example, in theory the -xml would be enough, in that -meta and -text are only a handful of extra function calls away,
but making you do that is an unnecessary hurdle.

Providing these in a merged, composite way, e.g. with dataset items providing attributes like
item.xml/item.raw, item.metadata, item.text would be -  confusing that they differ per dataset,
and but might be inflexible and confusing in the long run in that datasets _or_ code changing is likely to breaks all previous uses.

So for now, we provide distinct datasets for different views on the same data,
which also means you have at least some chance of loding this data in non-python other contexts.

## "What if I want to feed files into something else?"

The databsets as downloaded are indeed in our own format.

You can have our code export to files,
which will estimate what kind of thing to name those files as well (TODO: ).


In [14]:
import glob

meta = wetsuite.datasets.load('cvdr-mostrecent-meta-struc')
meta.export_files(in_dir='/tmp/t1')
print( glob.glob('/tmp/t1/*')[:10] )

xml  = wetsuite.datasets.load('cvdr-mostrecent-xml')
xml.export_files(in_dir='/tmp/t2')
print( glob.glob('/tmp/t2/*')[:10] )

xml  = wetsuite.datasets.load('cvdr-mostrecent-text')
xml.exportkind_files(in_dir='/tmp/t3')
print( glob.glob('/tmp/t3/*')[:10] )



['/tmp/t1/00000040_7c1923f16648_100112.json', '/tmp/t1/00000035_bfc471dab4be_100093.json', '/tmp/t1/00000096_8ed494913964_100302.json', '/tmp/t1/00000098_caca43f4acfd_100305.json', '/tmp/t1/00000055_35ab087348d6_100169.json', '/tmp/t1/00000087_78bf02174096_100282.json', '/tmp/t1/00000054_29a9902a19d7_100159.json', '/tmp/t1/00000077_610dd370715e_100242.json', '/tmp/t1/00000015_2f5612763123_100034.json', '/tmp/t1/00000050_db358e78ea03_100155.json']
['/tmp/t2/00000027_cae24a4b352d_100070.xml', '/tmp/t2/00000093_694c73ae2fa5_100291.xml', '/tmp/t2/00000079_a0a7ac08ed03_100247.xml', '/tmp/t2/00000052_e90e71b94a64_100157.xml', '/tmp/t2/00000074_7b5d58ffd05b_100235.xml', '/tmp/t2/00000022_a871704cf8e9_100058.xml', '/tmp/t2/00000034_b6132786ff33_100090.xml', '/tmp/t2/00000061_dec92378430e_100198.xml', '/tmp/t2/00000008_5037ffec4488_100014.xml', '/tmp/t2/00000039_31559f5f2006_10011.xml']
['/tmp/t3/00000096_8ed494913964_100302.txt', '/tmp/t3/00000092_fb091f4f20ca_100289.txt', '/tmp/t3/00000083_3b