# Purpose of this notebook

Explain why we make datasets, and give a basic introduction-by-example of how to start using them.


## Why datasets

The most up to date information is to find and fetch data directly from the source.

However, 
- fetching data via a website may be too clunky, particularly if it is not made for it, and most aren't

- fetching data via code may present too large of a technical learning curve
  - even if we ease that, sites may not allow the type of browsing/searching you may need for your purposes 
    - e.g. any metadata field that isn't made searchable
    - or maybe a workaround exists, but is awkward and/or technically complex again (see e.g. the convoluted damocles search)

- not all sources make it easy to fetch the result in the

- not all sources make it easy to collect

- sometimes collecting is very slow


Also, 
- it might be valuable to share sub-datasets you've collected -- or at the very least share the way you've collected







In [8]:
import wetsuite.datasets
from importlib import reload

wetsuite.datasets.list_datasets()


['rvsadviezen-struc',
 'kamervragen-struc',
 'bwb-mostrecent-xml',
 'bwb-mostrecent-meta',
 'bwb-mostrecent-text',
 'cvdr-mostrecent-xml',
 'cvdr-mostrecent-meta',
 'cvdr-mostrecent-text',
 'woo-besluiten-meta',
 'woo-besluiten-text',
 'kansspelautoriteit-sancties-txt',
 'gemeentes',
 'wetnamen',
 'tweedekamer-fracties-struc',
 'tweedekamer-fracties-membership-struc']

In [19]:
for name, details in wetsuite.datasets.fetch_index().items():
    print( '%-30s %s'%(name, details['short_description']) )

rvsadviezen-struc              The advice under https://raadvanstate.nl/adviezen/ provided as plain text in a nested structure with metadata. 
kamervragen-struc              Questions from ministers to the government. Provided as a nested data structure.
bwb-mostrecent-xml             The latest revision from each BWB-id
bwb-mostrecent-meta            The latest revision from each BWB-id
bwb-mostrecent-text            The latest revision from each BWB-id
cvdr-mostrecent-xml            The latest expression from each CVDR work
cvdr-mostrecent-meta           The latest expression from each CVDR work
cvdr-mostrecent-text           The latest expression from each CVDR work
woo-besluiten-meta             
woo-besluiten-text             
kansspelautoriteit-sancties-txt Sanction decisions, in plain text, extracted from from the case PDFs under https://kansspelautoriteit.nl/aanpak-misstanden/sanctiebesluiten/
gemeentes                      List of municipalities, and some basic information abo

In [19]:
xml  = wetsuite.datasets.load('cvdr-mostrecent-xml')
meta = wetsuite.datasets.load('cvdr-mostrecent-meta')
text = wetsuite.datasets.load('cvdr-mostrecent-text')

We have an ongoing discussion of whether to provide these in merged, composite way, 
with each item

## Example datasets

There are some more targeted examples, including

* CVDR

* BWB

* kansspelautoriteit-sancties-txt

* [`tweedekamer-fracties-struc` and `tweedekamer-fracties-membership-struc`](using_dataset_tweedekamer.ipynb)

* [`wetnamen`](using_dataset_woobesluit.ipynb)


And some things that are little more than lists, including
* [gemeentes](using_dataset_gemeentes.ipynb)

* [wetnamen](using_dataset_wetnamen.ipynb)

