<a href="https://colab.research.google.com/github/knobs-dials/wetsuite-dev/blob/main/notebooks/extras/datacollect/datacollect_eurlex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This notebook's goal


Figuring out how to get data out of EUR-Lex. 

Currently aimed specifically at the court judgments, and then mainly the text.

There are [a few different ways to access different parts of EUR-Lex data](https://eur-lex.europa.eu/content/welcome/data-reuse.html),
including a RESTful API, a SOAP API (requires registration), and a SPARQL endpoint.

Probably the most flexible is the SPARQL endpoint,
particularly when looking for specific selections of documents, specific relations, and such.
At the same time, SPARQL presents a bit of a learning curve unless you're already hardcore into RDF.

<!-- -->

SPARQL results refer to a work that is mostly the content text as HTML, e.g. http://publications.europa.eu/resource/cellar/1e3100ce-8a71-433a-8135-15f5cc0e927c.0002.02/DOC_1
Actually, the public-facing web page describing the thing (by CELEX), e.g. https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX%3A61996CJ0080
gives even better detail,
- links to the underlying document
- ...for all translated languages
- the text
- more metadata, like classification, related documents

...so for first experiments, and before learning SPARQL, we could read of details from there.
If we do, we still need a source of CELEX identifers to know what to fetch. The SPARQL endpoint is still quite useful for that.

In [1]:
import random, pprint, json, random, time

import tqdm

import wetsuite.datacollect.eurlex
import wetsuite.helpers.notebook
import wetsuite.helpers.localdata
import wetsuite.helpers.etree
import wetsuite.helpers.net

# Judgments

In [2]:
judgment_celexes = wetsuite.helpers.localdata.LocalKV('eurlex_judg_celex_workid.db', key_type=str,value_type=str)   # stores CELEX -> work id       (mostly just for the CELEX)
judgment_docs_en = wetsuite.helpers.localdata.LocalKV('eurlex_judg_en.db', key_type=str,value_type=bytes)           # stores url -> html document
judgment_docs_nl = wetsuite.helpers.localdata.LocalKV('eurlex_judg_nl.db', key_type=str,value_type=bytes)           # stores url -> html document

## Fetch identifiers and documents

In [None]:
judg_dict = wetsuite.datacollect.eurlex.fetch_by_resource_type('JUDG') # as of this writing there are 27K results  (referring to roughly 4GB worth of HTML)
judg_dict

In [4]:
# Store just the fact that the CELEX identifiers exist   
#   also the workid they point to, though we don't use that yet
for work in judg_dict['results']['bindings']:
    try:
        celex  = work['celex']['value']
        workid = work['work']['value']
        judgment_celexes.put(celex, workid)
    except KeyError as ke:
        print( 'missing %s: %s'%(str(ke), work) )

In [5]:
# Fetch the web pages for all those CELEXes, for one or more languages

pbar = wetsuite.helpers.notebook.progress_bar( len(judgment_celexes), description='fetching pages...')
count_cached, count_fetched = 0, 0

for celex in judgment_celexes:
    # the /ALL/ page gives more metadata than e.g. AUTO, TXT, though we might be interested in fetching specific-language 
    for lang, store, url in (
        ('nl', judgment_docs_nl, 'https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:%s'%celex),
        ('en', judgment_docs_en, 'https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:%s'%celex),
    ):
        try:
            _, was_cached = wetsuite.helpers.localdata.cached_fetch( store, url )
            if was_cached:
                count_cached += 1
            else:
                count_fetched += 1
                # it seems the server will report overloads as 404, so running it another time should work, but backoff is nicer to the servers
                time.sleep( 1 )
        except Exception as e:
            print( e, url )
    
    pbar.value += 1
    pbar.description = f'{count_fetched} fetched, {count_cached} cached'

  import tqdm.autonotebook  # pylint: disable=C0415


fetching pages...:   0%|          | 0/23331 [00:00<?, ?it/s]

404 https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:62006TJ0060


### Test parsing 

In [11]:
# this is some debug, we are not storing yet

for url in random.sample( judgment_docs_nl.keys(), 10 ): # pick a bunch of random documents, 
    random_doc = judgment_docs_nl[ url ]
    try:
        #print(url)
        parsed = wetsuite.datacollect.eurlex.extract_html(random_doc)   # that function is where most of the scraping code sits
        #pprint.pprint( parsed['text'] )
    except Exception as e:
        print( url )
        raise

Did that give good text and not error out?   Then we can probably run it on the whole set and store the results.

In [7]:
parsed_store = wetsuite.helpers.localdata.LocalKV('eurlex_parsed.db', key_type=str,value_type=str)    # stores CELEX -> json as str

In [15]:
# parse and store
pbar = wetsuite.helpers.notebook.progress_bar( len(judgment_docs_nl), description='parsing and storing...')
force = False
for url in judgment_docs_nl.keys():
    if force   or  url not in parsed_store:
        docbytes = judgment_docs_nl[ url ]
        try:
            parsed = wetsuite.datacollect.eurlex.extract_html( docbytes )
            parsed_store.put( url, json.dumps( parsed ) )
        except Exception as e:
            print( url )
            pprint.pprint( parsed )
            raise
    pbar.value += 1

parsing and storing...:   0%|          | 0/23330 [00:00<?, ?it/s]

# Regulations

Basically the same as above (so look for code comments above), but for regulations instead.

TODO: actually finish

In [16]:
reg_celexes = wetsuite.helpers.localdata.LocalKV('eurlex_reg_celex_workid.db', key_type=str,value_type=str)     # stores CELEX -> work id       (mostly just for the CELEX)
reg_docs_en = wetsuite.helpers.localdata.LocalKV('eurlex_reg_en.db', key_type=str,value_type=bytes)   # stores url -> html document
reg_docs_nl = wetsuite.helpers.localdata.LocalKV('eurlex_reg_nl.db', key_type=str,value_type=bytes)   # stores url -> html document

In [17]:
# Fetch current list
reg_dict = wetsuite.datacollect.eurlex.fetch_by_resource_type('REG') # as of this writing there are 130K results (roughly GB worth of HTML)

In [None]:
# take that fetched state and update (mainly) the fact that the CELEX identifiers exist   (also the workid they point to, though we don't use that yet)
for work in reg_dict['results']['bindings']:
    try:
        celex  = work['celex']['value']
        workid = work['work']['value']
        reg_celexes.put(celex, workid, commit=False)  
    except KeyError as ke:
        print( 'missing %s: %s'%(str(ke), work) )
reg_celexes.commit()

In [18]:
len(reg_docs_nl), len(reg_celexes)

(130031, 130031)

In [19]:
# fetch the web pages for all those CELEXes, for EN

pbar = wetsuite.helpers.notebook.progress_bar( len(reg_celexes), description='fetching pages...')

for celex in reg_celexes.keys():
    # the /ALL/ page gives more metadata than e.g. AUTO, TXT, though we might be interested in fetching specific-language 
    if 1:
        url = 'https://eur-lex.europa.eu/legal-content/NL/ALL/?uri=CELEX:%s'%celex
        try:
            wetsuite.helpers.localdata.cached_fetch( reg_docs_nl, url )
            #print(url)
        except Exception as e:
            print( e, url )

    if 0:
        url = 'https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:%s'%celex
        try:
            wetsuite.helpers.localdata.cached_fetch( reg_docs_en, url )
        except Exception as e: # it seems the server will report overloads as 404, so running it another time should
            print( e, url )
    pbar.value += 1

fetching pages...:   0%|          | 0/130031 [00:00<?, ?it/s]

In [21]:
# test parsing again

selection = random.sample( reg_docs_nl.keys(), 100 )  # pick 100 random documents

   
for url in selection: 
    random_doc = reg_docs_nl.get( url )
    try:
        #print(url)
        parsed = wetsuite.datacollect.eurlex.extract_html(random_doc)   # that function is where most of the scraping code sits
        #if random.uniform(0,1)<0.05:
        #    pprint.pprint( parsed )
        #pprint.pprint( parsed['text'] )
    except Exception as e:
        print( url )
        raise

['102551']
['126317']
['1984/1083/CNS']
['177633']
['114925']
['123924']
['129076']
['2004/0157/COD']
['198520']
