# Toolbox for Fedlex
Can 
- fetch the content of a fedlex page (German only for now… might also work with other languages, though)
- parse it and output json
- create a meta information file

and it also includes some tools for debugging/dev

not extensively tested :)

MIT License <br/>
Author: Marc Feldmann (marc@raxal.io)<br/>
Version 0.1<br/>


In [27]:
# run the section below first (the one that includes all the big code and stuff)
# then run this one here with your code…
# you can specifiy the url and the shorthand of the law (e.g. 'zgb'). It will then create the json from it (shorthand.json), move it to the content folder and update the meta_data.json file

url = 'https://www.fedlex.admin.ch/eli/cc/2010/267/de'
shorthand = 'stpo'


url_to_json(url, shorthand)


'''
…starting a list here for future reference/automation…
dsg: https://www.fedlex.admin.ch/eli/cc/2022/491/de
or: https://www.fedlex.admin.ch/eli/cc/27/317_321_377/de
stgb: https://www.fedlex.admin.ch/eli/cc/54/757_781_799/de
zpo: https://www.fedlex.admin.ch/eli/cc/2010/262/de
zgb: https://www.fedlex.admin.ch/eli/cc/24/233_245_233/de
stpo: https://www.fedlex.admin.ch/eli/cc/2010/267/de
bv: https://www.fedlex.admin.ch/eli/cc/1999/404/de
urg: https://www.fedlex.admin.ch/eli/cc/1993/1798_1798_1798/de
uwg: 
fmg: https://www.fedlex.admin.ch/eli/cc/1997/2187_2187_2187/de
desg: https://www.fedlex.admin.ch/eli/cc/2002/226/de
vwvg: https://www.fedlex.admin.ch/eli/cc/1969/737_757_755/de
zertes: https://www.fedlex.admin.ch/eli/cc/2016/752/de

'''




Fetching HTML from https://www.fedlex.admin.ch/eli/cc/2010/267/de… and converting to stpo.json…
Processed HTML data saved to JSON format in: stpo.json
Done. JSON file saved to ./content/stpo.json and meta_data.json updated.


'\n…starting a list here for future reference/automation…\ndsg: https://www.fedlex.admin.ch/eli/cc/2022/491/de\nor: https://www.fedlex.admin.ch/eli/cc/27/317_321_377/de\nstgb: https://www.fedlex.admin.ch/eli/cc/54/757_781_799/de\nzpo: https://www.fedlex.admin.ch/eli/cc/2010/262/de\nzgb: https://www.fedlex.admin.ch/eli/cc/24/233_245_233/de\nstpo: https://www.fedlex.admin.ch/eli/cc/2010/267/de\nbv: https://www.fedlex.admin.ch/eli/cc/1999/404/de\nurg: https://www.fedlex.admin.ch/eli/cc/1993/1798_1798_1798/de\nuwg: \nfmg: https://www.fedlex.admin.ch/eli/cc/1997/2187_2187_2187/de\ndesg: https://www.fedlex.admin.ch/eli/cc/2002/226/de\nvwvg: https://www.fedlex.admin.ch/eli/cc/1969/737_757_755/de\nzertes: https://www.fedlex.admin.ch/eli/cc/2016/752/de\n\n'

# actual code below here… run this stuff first, then the block above.

In [25]:
from bs4 import BeautifulSoup
import json
import re
import os
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
# Fetches the HTML content of a webpage using Selenium (b/c content is generated with javascript…) and saves it to a file
# Example usage
# fetch_html('https://www.fedlex.admin.ch/eli/cc/1993/1798_1798_1798/de', 'urg')

# and this one combines the whole pipeline… run it with the cell above.
def url_to_json(url, shorthand):
    print(f"Fetching HTML from {url}… and converting to {shorthand}.json…")
    fetch_html(url, shorthand)
    #print(f"Converting to {shorthand}.json…")
    fedlex_to_json(f"{shorthand}.html")
    #print(f"Cleaning up…")
    os.remove(f"{shorthand}.html")
    os.rename(f"{shorthand}.json", f"./content/{shorthand}.json")
    create_meta_file()
    print(f"Done. JSON file saved to ./content/{shorthand}.json and meta_data.json updated.")




def fetch_html(url, shorthand):
    def fetch_webpage(url):
        # Setup Chrome WebDriver
        service = Service(ChromeDriverManager().install())
        options = webdriver.ChromeOptions()
        options.add_argument('--headless')  # Runs Chrome in headless mode.
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        driver = webdriver.Chrome(service=service, options=options)
        driver.get(url)
        time.sleep(10)  # can take some time to execute the javascript…
        page_source = driver.page_source
        driver.quit()
        return page_source


    page_content = fetch_webpage(url)

    # Save the fetched webpage content to a file
    with open(f'{shorthand}.html', 'w', encoding='utf-8') as file:
        file.write(page_content)


#processes the html file from the fedlex website and converts them to json format
# Example usage
# fedlex_to_json('./html/bv.html')

def fedlex_to_json(html_file_path):
    with open(html_file_path, 'r', encoding='utf-8') as file:
        html_content = file.read()

    soup = BeautifulSoup(html_content, 'html.parser')

    def clean_text(text):
        return ' '.join(text.split()) if text is not None else ""

    def is_deepest_article_section(section):
        # Check deeper nested sections to ensure they do not contain articles
        subsections = section.find_all('section', recursive=True)
        if not subsections:  # If there are no further subsections, it's the deepest
            return True
        # If any subsection contains articles, current section isn't the deepest
        return not any(sub.find_all('article') for sub in subsections)

    def extract_articles(section):
        articles = []
        # Ensure we are in the deepest article-containing section
        if not is_deepest_article_section(section):
            return articles  # Return empty if not the deepest

        for article in section.find_all('article'):
            article_link = article.find_all('a')[1]['href'] if len(article.find_all('a')) > 1 else "No Link"
            article_number = article.find_all('a')[1]['fragment'].replace("#art","").replace("_","") if len(article.find_all('a')) > 1 else "No Number"
            next_sibling = article.find('b').next_sibling if article.find('b') and article.find('b').next_sibling else None
            article_title = clean_text(next_sibling) if next_sibling and isinstance(next_sibling, str) else "No Title"
            paragraphs = []
            absatz = None
            elements = article.find('div', class_='collapseable').contents
            for element in elements:
                if isinstance(element, str):
                    continue

                if element.name == 'p' and 'absatz' in element.get('class', []):
                    if absatz:
                        if absatz['literas']:
                            paragraphs.append(absatz)
                        else:
                            absatz.pop('literas')
                            paragraphs.append(absatz)
                    absatz = {
                        'Absatz': element.find('sup').text if element.find('sup') else '0',
                        'text': re.sub(r'^[\d\s]+', '', clean_text(element.get_text())),
                        'literas': []
                    }

                elif element.name == 'dl' and absatz:
                    for dt, dd in zip(element.find_all('dt'), element.find_all('dd')):
                        absatz['literas'].append({
                            'litera': dt.text.strip(),
                            'paragraphs': clean_text(dd.get_text())
                        })

            if absatz:
                if absatz['literas']:
                    paragraphs.append(absatz)
                else:
                    absatz.pop('literas')
                    paragraphs.append(absatz)

            articles.append({
                'link': article_link,
                'number': article_number,
                'title': article_title,
                'paragraphs': paragraphs
            })

        return articles


    data = {
        'law_name': '',
        'language': 'DE',
        'SR': '',
        'short_title': '',
        'shorthand': '  ',
        'effective_date': '',
        'law_details': []
    }

    # Extract metadata
    if (law_name := soup.find('h1', class_='erlasstitel')):
        data['law_name'] = law_name.get_text(strip=True)
    if (short_title := soup.find('h2', class_='erlasskurztitel')):
        data['short_title'] = short_title.get_text(strip=True).replace('(','').replace(')','')  
    if (effective_date := soup.find('p', class_='erlassdatum')):
        data['effective_date'] = effective_date.get_text(strip=True)
    data['SR'] = soup.find('p', class_='srnummer').get_text(strip=True)
    annexe_content = soup.find('div', id='annexeContent')
    p_tag = annexe_content.find('p') if annexe_content else None
    data['shorthand'] = p_tag.text if p_tag else 'Abkürzung nicht gefunden'

    

    main_content = soup.find('main', id="maintext")
    for section in main_content.find_all('section', recursive=True):
        chapter_title = clean_text(section.find('a').text) if section.find('a') else "No Chapter Title"
        chapter_link = clean_text(section.find('a')['href']) if section.find('a') else "No Link"
        articles = extract_articles(section)
        if articles:  # Only add articles if they were actually extracted
            data['law_details'].append({
                'chapter_title': chapter_title,
                'chapter_link': chapter_link,
                'articles': articles
            })
        else:
            data['law_details'].append({
                'chapter_title': chapter_title,
                'chapter_link': chapter_link,
                'articles': [] 
            })

    json_data = json.dumps(data, indent=4, ensure_ascii=False)
    json_file_path = os.path.splitext(html_file_path)[0] + '.json'
    with open(json_file_path, 'w', encoding='utf-8') as json_file:
        json_file.write(json_data)
    
    print(f"Processed HTML data saved to JSON format in: {json_file_path}")

# This creates a meta information json file (meta_data.json), containing all the metadata of the available laws in the content directory
# for verbose, just uncomment the last line
def create_meta_file():
    meta_data = [] 

    for file_name in os.listdir('content'): 
        if file_name.endswith('.json'):
            with open(f'./content/{file_name}', 'r', encoding='utf-8') as file:
                law_data = json.load(file) 

            meta_data.append({
                'law_name': law_data['law_name'],
                'language': law_data['language'],
                'SR': law_data['SR'],
                'shorthand': law_data['shorthand'],
                'Kurztitel': law_data['short_title'],  # Assuming 'short_title' is the correct key in your JSON structure
                'effective_date': law_data['effective_date']
            })

    meta_data_json = json.dumps(meta_data, indent=4, ensure_ascii=False)
    with open('meta_data.json', 'w', encoding='utf-8') as json_file: #modify if needed…
        json_file.write(meta_data_json)

    #print(meta_data_json)     #remove/comment out if not needed


In [9]:
# little batch function to parse all html files in the content folder - overwrites the content!
for file in os.listdir('content'):
    if file.endswith('.html'):
        fedlex_to_json('./content/' + file)

Processed HTML data saved to JSON format in: ./content/urg.json
Processed HTML data saved to JSON format in: ./content/uwg.json
Processed HTML data saved to JSON format in: ./content/dsg.json
Processed HTML data saved to JSON format in: ./content/stgb.json
Processed HTML data saved to JSON format in: ./content/bv.json
Processed HTML data saved to JSON format in: ./content/kg.json
Processed HTML data saved to JSON format in: ./content/zgb.json
Processed HTML data saved to JSON format in: ./content/or.json


# Tools for devs & debugging

- check the chapter structure of a json
- see all articles per chapter in a json

In [88]:
#open a json file and check the structure
#open zgb.json
with open('zgb.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

# Extract  lawname, language, SR, shorthand, effective_date and all chapter_titles from law_details
lawname = data['law_name']
language = data['language']
SR = data['SR']
shorthand = data['shorthand']
effective_date = data['effective_date']
chapter_titles = [chapter['chapter_title'] for chapter in data['law_details']]
#rewrite chapter_titles in order to have a newline after each chapter title
chapter_titles = '\n'.join(chapter_titles)
content = (f"Law Name: {lawname}\nLanguage: {language}\nSR: {SR}\nShorthand: {shorthand}\nEffective Date: {effective_date}\nChapter Titles: {chapter_titles}")
print(content)

Law Name: Schweizerisches Zivilgesetzbuch
Language: DE
SR: 210
Shorthand: 
Effective Date: vom 10. Dezember 1907 (Stand am 1. Januar 2024)
Chapter Titles: Einleitung
A. Anwendung des Rechts
B. Inhalt der Rechtsverhältnisse
I. Handeln nach Treu und Glauben
II. Guter Glaube
III. Gerichtliches
C. Verhältnis zu den Kantonen
I. Kantonales Zivilrecht und Ortsübung
II. Öffentliches Recht der Kantone
D. Allgemeine Bestimmungen des Obligationenrechtes
E. Beweisregeln
I. Beweislast
II. Beweis mit öffentlicher Urkunde
Erster Teil: Das Personenrecht
Erster Titel: Die natürlichen Personen
Erster Abschnitt: Das Recht der Persönlichkeit
A. Persönlichkeit im Allgemeinen
I. Rechtsfähigkeit
II. Handlungsfähigkeit
1. Inhalt
2. Voraussetzungen
a. Im Allgemeinen
b. Volljährigkeit
c. …
d. Urteilsfähigkeit
III. Handlungsunfähigkeit
1. Im Allgemeinen
2. Fehlen der Urteilsfähigkeit
3. Urteilsfähige handlungsunfähige Personen
a. Grundsatz
b. Zustimmung des gesetzlichen Vertreters
c. Fehlen der Zustimmung
4. Höc

In [115]:
#lists all the articles per chapter

import json

# Load JSON data from file
with open('./html/bv.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

# Extract law metadata
law_name = data['law_name']
language = data['language']
SR = data['SR']
shorthand = data['shorthand']
effective_date = data['effective_date']

# Print law metadata
print(f"Law Name: {law_name}")
print(f"Language: {language}")
print(f"SR: {SR}")
print(f"Shorthand: {shorthand}")
print(f"Effective Date: {effective_date}\n")

# Print chapter titles and articles
for chapter in data['law_details']:
    chapter_title = chapter['chapter_title']
    articles = chapter['articles']
    article_titles = ", ".join(article['number'] for article in articles if 'title' in article)  # Ensure title key exists
    print(f"{chapter_title}: {article_titles}")


Law Name: Bundesverfassungder Schweizerischen Eidgenossenschaft
Language: DE
SR: 101
Shorthand: BV
Effective Date: vom 18. April 1999 (Stand am 3. März 2024)

1. Titel: Allgemeine Bestimmungen: 1, 2, 3, 4, 5, 5a, 6
2. Titel: Grundrechte, Bürgerrechte und Sozialziele: 
1. Kapitel: Grundrechte: 7, 8, 9, 10, 10a, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 29a, 30, 31, 32, 33, 34, 35, 36
2. Kapitel: Bürgerrecht und politische Rechte: 37, 38, 39, 40
3. Kapitel: Sozialziele: 41
3. Titel: Bund, Kantone und Gemeinden: 
1. Kapitel: Verhältnis von Bund und Kantonen: 
1. Abschnitt: Aufgaben von Bund und Kantonen: 42, 43, 43a
2. Abschnitt: Zusammenwirken von Bund und Kantonen: 44, 45, 46, 47, 48, 48a, 49
3. Abschnitt: Gemeinden: 50
4. Abschnitt: Bundesgarantien: 51, 52, 53
2. Kapitel: Zuständigkeiten: 
1. Abschnitt: Beziehungen zum Ausland: 54, 55, 56
2. Abschnitt: Sicherheit, Landesverteidigung, Zivilschutz: 57, 58, 59, 60, 61
3. Abschnitt: Bildung, Forschung und 

# test