# Purpose of this notebook

Figuring out how to get data out of EUR-Lex. 

Currently aimed specifically at the court judgments, and then mainly the text.

<!-- -->

There are [a few different ways to access different parts of EUR-Lex data](https://eur-lex.europa.eu/content/welcome/data-reuse.html),
including a RESTful API, a SOAP API (requires registration), and a SPARQL endpoint.

Probably the most flexible is the SPARQL endpoint,
particularly when looking for specific selections of documents, specific relations, and such.
At the same time, SPARQL presents a bit of a learning curve unless you're already hardcore into RDF.

SPARQL results refer to a work that is mostly the content text as HTML, e.g. http://publications.europa.eu/resource/cellar/1e3100ce-8a71-433a-8135-15f5cc0e927c.0002.02/DOC_1

Actually, the public-facing web page describing the thing (by CELEX), e.g. https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX%3A61996CJ0080
gives even better detail,
- links to the underlying document
- ...for all translated languages
- the text
- more metadata, like classification, related documents

...so for first experiments, and before learning SPARQL, we could read of details from there.
If we do, we still need a source of CELEX identifers to know what to fetch. The SPARQL endpoint is still quite useful for that.

In [2]:
import pprint
import json
import random
import time

import wetsuite.datacollect.eurlex
import wetsuite.helpers.notebook
import wetsuite.helpers.localdata
import wetsuite.helpers.etree
import wetsuite.helpers.net

# JUDGments fetch

## Fetch judgment identifiers

In [5]:
# first we figure out the CELEX identifiers that exist for this type  (also the workid they point to, though we don't use that yet)
judg_celexes = wetsuite.helpers.localdata.LocalKV('eurlex_judg_celex_workid.db', key_type=str,value_type=str)   # stores CELEX -> work id       (mostly just for the CELEX)
# later we will also fetch the documents for them
judg_docs_en = wetsuite.helpers.localdata.LocalKV('eurlex_judg_en.db', key_type=str,value_type=bytes)           # stores url -> html document
judg_docs_nl = wetsuite.helpers.localdata.LocalKV('eurlex_judg_nl.db', key_type=str,value_type=bytes)           # stores url -> html document

In [7]:

# Fetch the CELEX identifiers for all JUDGments
# Note that as of this writing there are 20K+ results  (referring to roughly 4GB worth of HTML)
judg_dict = wetsuite.datacollect.eurlex.fetch_by_resource_type('JUDG') 
for work in judg_dict['results']['bindings']:
    try:
        celex  = work['celex']['value']
        workid = work['work']['value']
        judg_celexes.put(celex, workid)
    except KeyError as ke:
        print( 'missing %s: %s'%(str(ke), work) )

# describe how many CELEXes we have now
judg_celexes.summary(True)

{'size_bytes': 2973696,
 'size_readable': '2.8MiB',
 'num_items': 23654,
 'avgsize_bytes': 126,
 'avgsize_readable': '126B'}

## Fetch judgment documents

In [8]:
# Fetch the web pages for all those CELEXes, for one or more languages
pbar = wetsuite.helpers.notebook.progress_bar( len(judg_celexes), description='fetching pages...')
count_cached, count_fetched = 0, 0


for celex in judg_celexes:
    # the /ALL/ page gives more metadata than e.g. AUTO, TXT, though we might be interested in fetching specific-language 
    for lang, store, url in (
        ('nl', judg_docs_nl, 'https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:%s'%celex),
        #('en', judg_docs_en, 'https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:%s'%celex),
    ):
        try:
            _, was_cached = wetsuite.helpers.localdata.cached_fetch( store, url )
            if was_cached:
                count_cached += 1
            else:
                count_fetched += 1
                time.sleep( 1 ) # some backoff to be nicer to the servers
        except Exception as e:
            # it seems the server will report overloads as 404, so running it another time should work.
            print( e, url )
            time.sleep( 1 ) # more backoff to be nicer to the servers
    
    pbar.value += 1
    pbar.description = f'{count_fetched} fetched, {count_cached} cached'

# describe and how many documents we have now
display( judg_docs_nl.summary(True) )

fetching pages...:   0%|          | 0/23654 [00:00<?, ?it/s]

404 https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:62006TJ0060


{'size_bytes': 4235132928,
 'size_readable': '3.9GiB',
 'num_items': 23783,
 'avgsize_bytes': 178074,
 'avgsize_readable': '174KiB'}

# REGulation fetch

In [6]:
# for comments, see above
reg_celexes = wetsuite.helpers.localdata.LocalKV('eurlex_reg_celex_workid.db', key_type=str,value_type=str)   
reg_docs_en = wetsuite.helpers.localdata.LocalKV('eurlex_reg_en.db', key_type=str,value_type=bytes)           
reg_docs_nl = wetsuite.helpers.localdata.LocalKV('eurlex_reg_nl.db', key_type=str,value_type=bytes)           


# Identifiers
reg_dict = wetsuite.datacollect.eurlex.fetch_by_resource_type('REG') 
for work in reg_dict['results']['bindings']:
    try:
        celex  = work['celex']['value']
        workid = work['work']['value']
        reg_celexes.put(celex, workid)
    except KeyError as ke:
        pass
        #print( 'missing %s: %s'%(str(ke), work) )
display( reg_celexes.summary(True) )


# Documents
pbar = wetsuite.helpers.notebook.progress_bar( len(reg_celexes), description='fetching pages...')
count_cached, count_fetched = 0, 0
for celex in reg_celexes:
    # the /ALL/ page gives more metadata than e.g. AUTO, TXT, though we might be interested in fetching specific-language 
    for lang, store, url in (
        ('nl', reg_docs_nl, 'https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:%s'%celex),
        #('en', reg_docs_en, 'https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:%s'%celex),
    ):
        try:
            #print(url)
            _, was_cached = wetsuite.helpers.localdata.cached_fetch( store, url )
            if was_cached:
                count_cached += 1
            else:
                count_fetched += 1
                # it seems the server will report overloads as 404, so running it another time should work, but backoff is nicer to the servers
                #time.sleep( 1 )
        except Exception as e:
            print( e, url )
            time.sleep( 1 ) # more backoff to be nicer to the servers

    pbar.value += 1
    pbar.description = f'{count_fetched} fetched, {count_cached} cached'

{'size_bytes': 16109568,
 'size_readable': '16M',
 'num_items': 130177,
 'avgsize_bytes': 124}

fetching pages...:   0%|          | 0/130177 [00:00<?, ?it/s]

# DIRectives fetch

In [7]:

dir_celexes = wetsuite.helpers.localdata.LocalKV('eurlex_dir_celex_workid.db', key_type=str,value_type=str)   
dir_docs_en = wetsuite.helpers.localdata.LocalKV('eurlex_dir_en.db', key_type=str,value_type=bytes)           
dir_docs_nl = wetsuite.helpers.localdata.LocalKV('eurlex_dir_nl.db', key_type=str,value_type=bytes)           


# Identifiers
dir_dict = wetsuite.datacollect.eurlex.fetch_by_resource_type('DIR') 
for work in dir_dict['results']['bindings']:
    try:
        celex  = work['celex']['value']
        workid = work['work']['value']
        dir_celexes.put(celex, workid)
    except KeyError as ke:
        pass
        #print( 'missing %s: %s'%(str(ke), work) )
display( dir_celexes.summary(True) )


# Documents
pbar = wetsuite.helpers.notebook.progress_bar( len(dir_celexes), description='fetching pages...')
count_cached, count_fetched = 0, 0
for celex in dir_celexes:
    # the /ALL/ page gives more metadata than e.g. AUTO, TXT, though we might be interested in fetching specific-language 
    for lang, store, url in (
        ('nl', dir_docs_nl, 'https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:%s'%celex),
        #('en', dir_docs_en, 'https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:%s'%celex),
    ):
        #print(url)
        try:
            _, was_cached = wetsuite.helpers.localdata.cached_fetch( store, url )
            if was_cached:
                count_cached += 1
            else:
                count_fetched += 1
                # it seems the server will report overloads as 404, so running it another time should work, but backoff is nicer to the servers
                time.sleep( 1 )
        except Exception as e:
            print( e, url )
            time.sleep( 1 ) # more backoff to be nicer to the servers
    
    pbar.value += 1
    pbar.description = f'{count_fetched} fetched, {count_cached} cached'


{'size_bytes': 528384,
 'size_readable': '528K',
 'num_items': 4114,
 'avgsize_bytes': 128}

fetching pages...:   0%|          | 0/4114 [00:00<?, ?it/s]

# Parse documents we fetched

Aside from the viewable document's parts adhering to templates so being regular enough for metadata-like extraction, 
there is also explicitly RDFa semantic metadata in these documents (which originate from ELI as a larger project),
which is a more regularized form of certain metadata, and also semantic data.

the a on RDFa means 'attributes'. The following isn't quite the right way to get it out, 
yet a good indication of what is there:
<!--
In JavaScript
it = document.evaluate('//meta[@about]', document);
node = it.iterateNext(); 
while (node) { 
  console.log( node );
  node = it.iterateNext(); 
}
-->

In [10]:
import bs4

htmlbytes = wetsuite.helpers.net.download( 'https://eur-lex.europa.eu/eli/dir/1965/1/oj' )

soup = bs4.BeautifulSoup( htmlbytes, features='lxml' )
for meta in soup.select('meta'):
    if meta.get('about'):
        print(meta)

<meta about="http://data.europa.eu/eli/dir/1965/1/oj" typeof="eli:LegalResource"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" property="eli:uri_schema" resource="http://data.europa.eu/eli/%7Btypedoc%7D/%7Byear%7D/%7Bnatural_number%7D/oj"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" content="31965L0001" lang="" property="eli:id_local"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" property="eli:type_document" resource="http://publications.europa.eu/resource/authority/resource-type/DIR"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" property="eli:passed_by" resource="http://publications.europa.eu/resource/authority/corporate-body/CONSIL"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" content="DG03/F/02" property="eli:responsibility_of"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" property="eli:is_about" resource="http://eurovoc.europa.eu/1638"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" property="eli:is_about" resour

## JUDGment test parse

In [16]:
# During debug, we are not storing anything yet,
#   just picking a handful of documents randomly to see whether our extraction is happy

for random_url, random_doc in judg_docs_nl.random_sample( 10 ): 
    try:
        print( random_url )
        parsed = wetsuite.datacollect.eurlex.extract_html(random_doc)  
        #pprint.pprint( parsed.keys() ) 
        pprint.pprint( parsed ) 
        #pprint.pprint( parsed['text'] )  # uncomment this to see the output that the parsing gives
    except Exception as e:
        print( random_url )
        raise

https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:61993CJ0291
{'celex': '61993CJ0291',
 'classifications': {'Case law directory code': [['B-19.01.04',
                                                  'Europese Gemeenschap '
                                                  '(EEG/EG)',
                                                  '/',
                                                  'EEG/EG - Procedures voor '
                                                  'het Hof * Procedures voor '
                                                  'het Hof',
                                                  '/',
                                                  'Beroep wegens niet-nakoming',
                                                  '/',
                                                  'Beroep wegens niet-nakoming '
                                                  '- Gevolgen van het arrest * '
                                                  'Gevolgen van het arrest']],


## JUDGment real parse

Did that give good text and not error out?
Then we can probably run it on the whole set, and store the results.

In [23]:
parsed_store = wetsuite.helpers.localdata.LocalKV('eurlex_parsed.db', key_type=str,value_type=str)    # stores CELEX -> json as str

# parse and store
#  ...implicitly also a test of whether the parsing trips over any new variation. 
pbar = wetsuite.helpers.notebook.progress_bar( len(judg_docs_nl), description='parsing and storing...')
force = False   # set to True to redo everything, e.g. after you've significantly altered extract_html()
for url in judg_docs_nl.keys():
    if (url not in parsed_store)  or force:
        docbytes = judg_docs_nl[ url ]
        try:
            parsed = wetsuite.datacollect.eurlex.extract_html( docbytes )  # that function is where most of the scraping code sits
            #pprint.pprint( parsed )
            parsed_store.put( url, json.dumps( parsed ) )
        except Exception as e:
            print( 'ERROR for %r: %s'%( url, e ) )
            raise
    pbar.value += 1

parsing and storing...:   0%|          | 0/23688 [00:00<?, ?it/s]

## REGulation parse

Basically the same as above (so look for code comments above), but for regulations instead.

TODO: actually finish

In [21]:
# test parsing again

reg_docs_nl = wetsuite.helpers.localdata.LocalKV('eurlex_reg_nl.db',           key_type=str,value_type=bytes)   # stores url -> html document

for random_url, random_doc in reg_docs_nl.random_sample( 10 ): 
    try:
        print(random_url)
        parsed = wetsuite.datacollect.eurlex.extract_html(random_doc)   # that function is where most of the scraping code sits
        #if random.uniform(0,1)<0.05:
        #pprint.pprint( parsed )
        #pprint.pprint( parsed['text'] )
    except Exception as e:
        print( 'ERROR for %r: %s'%( url, e ) )
        raise

https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:31976R3080
https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:31971R1743
https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:31982R0046
https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:31974R3021
https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:31965R0001(01)
https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:31993R0063
https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:31976R0736
https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:31990R0601
https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:32000R2516
['1999/0200/COD']
https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:32003R1148


## DIRective test parse

In [22]:
# test parsing again

dir_docs_nl = wetsuite.helpers.localdata.LocalKV('eurlex_dir_nl.db', key_type=str,value_type=bytes)   # stores url -> html document

for random_url, random_doc in dir_docs_nl.random_sample( 10 ):
    try:
        print(random_url)
        parsed = wetsuite.datacollect.eurlex.extract_html(random_doc)   # that function is where most of the scraping code sits
        #if random.uniform(0,1)<0.05:
        pprint.pprint( parsed )
        #pprint.pprint( parsed['text'] )
    except Exception as e:
        print( 'ERROR for %r: %s'%( url, e ) )
        raise

https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:31998L0094
['1998/0250/CNS']
{'celex': '31998L0094',
 'classifications': {'Directory code': ['09.30.10.00 \n'
                                        'Fiscale zaken\n'
                                        ' / \n'
                                        'Indirecte belastingen\n'
                                        ' / \n'
                                        'Omzetbelasting/btw',
                                        '09.30.40.00 \n'
                                        'Fiscale zaken\n'
                                        ' / \n'
                                        'Indirecte belastingen\n'
                                        ' / \n'
                                        'Belastingvrijstellingen voor '
                                        'particulieren'],
                     'EUROVOC descriptor': ['douanevrijstelling',
                                            'Duitsland',
                    