<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/notebooks/intro/wetsuite_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook

Explain why wetsuite provides datasets, and give a basic introduction-by-example of which there are, and how to start using them.

Considered advanced intro not because it's overly complex, but because with a little luck you don't have to dive into these lower levels of detail.


## Why datasets

The most up to date information may be to fetch data directly from the source, 
but this is also often more clunkier to get a larger and/or more precise set of documents.

- fetching data via a website may be too clunky, particularly if it is not made for it (and most aren't)

- fetching data via code may present too large of a technical learning curve
  - even if we ease that, sites may not allow the type of browsing/searching you may need for your purposes 
    - e.g. any metadata field that isn't made searchable
    - or maybe a workaround exists, but is awkward and/or technically complex again (see e.g. the convoluted damocles search)

- not all sources make it easy to fetch the result in the

- not all sources make it easy to collect

- sometimes collecting is very slow


Also, 
- it might be valuable to share sub-datasets you've collected -- or at the very least share the way you've collected

In [None]:
# (only) in colab, run this first to install wetsuite from (the most recent) source.   For your own setup, see wetsuite's install guidelines.
!pip3 install -U --no-cache-dir --quiet https://github.com/scarfboy/wetsuite-dev/archive/refs/heads/main.zip

In [1]:
import wetsuite.datasets

# what do we currently have?
wetsuite.datasets.list_datasets()

['bwb-mostrecent-meta-struc',
 'bwb-mostrecent-text',
 'bwb-mostrecent-xml',
 'cvdr-mostrecent-html',
 'cvdr-mostrecent-meta-struc',
 'cvdr-mostrecent-text',
 'cvdr-mostrecent-xml',
 'gemeentes-struc',
 'raadvanstate-adviezen-struc',
 'rechtspraaknl-struc',
 'tweedekamer-fractie-membership-struc',
 'tweedekamer-fracties-struc',
 'tweedekamer-kamervragen-struc',
 'wetnamen',
 'woo_besluiten_docs_text',
 'woo_besluiten_meta']

In [22]:
# ...with a little more description?
for name, details in wetsuite.datasets.fetch_index().items():
    #print(details)
    print( f'{name:<40} {details.get("description_short") or "(TODO: short description)"}' )
    print( '                                          ',wetsuite.datasets.load(name) )  # download and load all  (also a test whether their download is not broken)

bwb-mostrecent-meta-struc                (TODO: short description)
                                           <wetsuite.datasets.Dataset name='bwb-mostrecent-meta-struc' num_items=37806>
bwb-mostrecent-text                      (TODO: short description)
                                           <wetsuite.datasets.Dataset name='bwb-mostrecent-text' num_items=37806>
bwb-mostrecent-xml                       (TODO: short description)
                                           <wetsuite.datasets.Dataset name='bwb-mostrecent-xml' num_items=37806>
cvdr-mostrecent-html                     (TODO: short description)
                                           <wetsuite.datasets.Dataset name='cvdr-mostrecent-html' num_items=235495>
cvdr-mostrecent-meta-struc               (TODO: short description)
                                           <wetsuite.datasets.Dataset name='cvdr-mostrecent-meta-struc' num_items=235725>
cvdr-mostrecent-text                     (TODO: short description)
             

In [3]:
xml  = wetsuite.datasets.load('cvdr-mostrecent-xml')
meta = wetsuite.datasets.load('cvdr-mostrecent-meta')
text = wetsuite.datasets.load('cvdr-mostrecent-text')

In [30]:
#reload( wetsuite.datasets )
#meta = wetsuite.datasets.load('cvdr-mostrecent-meta')
#meta.save(in_dir='/tmp/t')

#xml  = wetsuite.datasets.load('cvdr-mostrecent-xml')
#xml.save(in_dir='/tmp/t')

#merged = wetsuite.datasets.load('cvdr-mostrecent-*')
#help(merged)

Aside: We have an ongoing discussion of how to provide more varied data without making life harder for you. 

For example, in theory the -xml would be enough, in that -meta and -text are only a handful of extra function calls away,
but making you do that is an unnecessary hurdle.

Providing these in a merged, composite way, e.g. with dataset items providing attributes like
item.xml/item.raw, item.metadata, item.text would be -  confusing that they differ per dataset, 
and but might be inflexible and confusing in the long run in that datasets _or_ code changing is likely to breaks all previous uses.

So for now, we provide distinct datasets for different views on the same data,
which also means you have at least some chance of loding this data in non-python other contexts.

## Example datasets

There are some more targeted examples, including

* `cvdr-`

* `bwb-`

* `kansspelautoriteit-sancties-txt` and ``: [using_dataset_kansspelautoriteit notebook](using_dataset_kansspelautoriteit.ipynb)

* `tweedekamer-fracties-struc` and `tweedekamer-fracties-membership-struc`: [using_dataset_tweedekamer notebook](using_dataset_tweedekamer.ipynb)

* `woobesluit`: [using_dataset_woobesluit notebook](using_dataset_woobesluit.ipynb)


And some things that are little more than lists, including
* `gemeentes`: [using_dataset_gemeentes notebook](using_dataset_gemeentes.ipynb)

* `wetnamen`: [using_dataset_wetnamen notebook](using_dataset_wetnamen.ipynb)

