## Serlo Article Scraping for RAG

Um dem Modell im Videoplanungsprozess mehr relevante Informationen an die Hand zu geben, macht es Sinn bereits aufgearbeitete mathematische Artikel als ergänzenden Kontext für die Generierung zu übergeben. Eine Datenquelle für solche Artikel ist de.serlo.org, wo Artikel zu einer Vielzahl mathematischer Themen zur Verfügung stehen. Hier wird exemplarisch ein Ausschnitt von maximal 100 Artikeln dieser Seite gescrapet und in ein RAG System zur Erweiterung des Kontexts des Videoplanungsmodells verwendet.

Serlo bietet eine GraphQL Datenbank, über die Metadaten zu Artikeln, Quizzes, etc. bezogen werden können. Aus diesen Metadaten lassen sich auch die Artikel URLs extrahieren, die dann gescrapet und in das RAG System eingepflegt werden können. Ein ähnliches Prinzip kann beispielsweise auch für Wikipedia Artikel verwendet werden.

### Imports

In [1]:
# Scraping
import httpx
from bs4 import BeautifulSoup
import asyncio

# Data manipulation
import pandas as pd

# Markdown parsing
import re
import markdownify

# Utility
import json

# VectorDB
from sentence_transformers import SentenceTransformer
import chromadb

### Metadata Retrieval

In [2]:
# Retrieve metadata from the Serlo GraphQL API
url = "https://api.serlo.org/graphql"
headers = {
    "Content-Type": "application/json"
}

# Function to fetch resources from the API with pagination
async def fetch_resources(instance, total_limit=100, first=10, modified_after=None):
    async with httpx.AsyncClient() as client:
        all_resources = []
        end_cursor = None
        has_next_page = True
        fetched_count = 0
        article_urls = []

        while has_next_page and fetched_count < total_limit:
            payload = {
                "query": """
                    query($first: Int, $after: String, $instance: Instance, $modifiedAfter: String) {
                        metadata {
                            resources(first: $first, after: $after, instance: $instance, modifiedAfter: $modifiedAfter) {
                                nodes
                                pageInfo {
                                    hasNextPage
                                    endCursor
                                }
                            }
                        }
                    }
                """,
                "variables": {
                    "first": first,
                    "after": end_cursor,
                    "instance": instance,
                    "modifiedAfter": modified_after,
                }
            }

            response = await client.post(url, headers=headers, json=payload)
            response.raise_for_status()  # Raise an error for HTTP issues
            data = response.json()

            resources = data['data']['metadata']['resources']
            nodes = resources['nodes']
            all_resources.extend(nodes)

            # Filter articles related to "Mathematik"
            matematik_articles = filter_articles(nodes)
            for article in matematik_articles:
                article_urls.append(article['id'])  # Store the resource["id"] as URL

            # Print how many Mathematik articles were found in this turn
            print(f"Found {len(matematik_articles)} Mathematik articles in this turn.")

            # Update the count of fetched resources
            fetched_count += len(nodes)

            # Update pagination variables
            has_next_page = resources['pageInfo']['hasNextPage']
            end_cursor = resources['pageInfo']['endCursor']

            # Stop if we have fetched the desired number of resources
            if fetched_count >= total_limit:
                print(f"Reached the total limit of {total_limit} resources.")
                break

        return article_urls

# Function to filter resources by type (Article)
def filter_articles(resources):
    return [resource for resource in resources if "Article" in resource.get('type', [])]

# Main function to run the script
async def main():
    instance = "de"
    about_id_to_filter = "http://w3id.org/kim/schulfaecher/s1017"  # Mathematik about_id

    # Fetch up to 100 resources
    article_urls = await fetch_resources(instance, total_limit=100)

    article_urls_df = pd.DataFrame({"article_urls": article_urls})

    article_urls_df.to_csv("article_urls.csv", index=False)
    
    # Output the list of all Mathematik article URLs
    print("\nList of all Mathematik article URLs found:")
    for url in article_urls:
        print(url)

await main()

Found 10 Mathematik articles in this turn.
Found 10 Mathematik articles in this turn.
Found 10 Mathematik articles in this turn.
Found 10 Mathematik articles in this turn.
Found 10 Mathematik articles in this turn.
Found 10 Mathematik articles in this turn.
Found 10 Mathematik articles in this turn.
Found 10 Mathematik articles in this turn.
Found 10 Mathematik articles in this turn.
Found 10 Mathematik articles in this turn.
Reached the total limit of 100 resources.

List of all Mathematik article URLs found:
https://serlo.org/1495
https://serlo.org/1497
https://serlo.org/1499
https://serlo.org/1501
https://serlo.org/1503
https://serlo.org/1505
https://serlo.org/1507
https://serlo.org/1509
https://serlo.org/1511
https://serlo.org/1513
https://serlo.org/1515
https://serlo.org/1517
https://serlo.org/1523
https://serlo.org/1525
https://serlo.org/1527
https://serlo.org/1529
https://serlo.org/1531
https://serlo.org/1533
https://serlo.org/1535
https://serlo.org/1537
https://serlo.org/1539
h

Im nächsten Schritt können die Artikel gescrapet und in Markdown überführt werden, damit das LLM mit den generierten Chunks besser arbeiten kann.

In [3]:
def chunk_markdown_by_headlines(markdown_content):
    """Chunk the Markdown content based on headers"""
    
    # Regular expression to match Markdown headers (e.g., #, ##, ###)
    header_pattern = re.compile(r'^(#{1,6})\s+(.*)', re.MULTILINE)
    
    # Split the content based on the headers
    chunks = []
    last_idx = 0
    for match in header_pattern.finditer(markdown_content):
        # Get the header level (e.g., #, ##, ###)
        header_level = match.group(1)
        header_text = match.group(2)
        
        # Add the content under each header as a new chunk
        if last_idx != 0:
            chunk = markdown_content[last_idx:match.start()].strip()
            if chunk:
                chunks.append(chunk)
        
        # Update the index to the start of the current header
        last_idx = match.end()

        # Add the header itself to the chunk
        chunks.append(f"{header_level} {header_text}")
    
    # Add the last chunk if any content remains after the last header
    if last_idx != len(markdown_content):
        chunks.append(markdown_content[last_idx:].strip())
    
    return chunks

def clean_markdown(markdown_content):
    """Clean markdown content by removing various markdown elements and normalizing text."""
    def replace_links(match):
        return match.group(1) or match.group(2)
    
    content = markdown_content
    content = re.sub(r'\[([^\]]+)\]\(([^)]+)\)', replace_links, content)
    content = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', content)
    content = re.sub(r'[*_]{1,3}([^*_]+)[*_]{1,3}', r'\1', content)
    content = re.sub(r'```[^`]*```', '', content)
    content = re.sub(r'`[^`]+`', '', content)
    content = re.sub(r'<[^>]+>', '', content)
    content = re.sub(r'^\s*>\s*', '', content, flags=re.MULTILINE)
    
    # Normalize whitespace while preserving line breaks
    content = re.sub(r' +', ' ', content)
    content = re.sub(r'^\s+|\s+$', '', content, flags=re.MULTILINE)
    content = re.sub(r'\n\s*\n', '\n\n', content)
    return content.strip()

def remove_headlines(chunks):
    """Remove headlines from chunks."""
    return [chunk for chunk in chunks if not chunk.startswith("#")]

In [4]:
# Load article urls
article_urls_df = pd.read_csv("article_urls.csv")

article_urls = article_urls_df["article_urls"].to_list()
article_urls[:3]

['https://serlo.org/1495', 'https://serlo.org/1497', 'https://serlo.org/1499']

In [5]:
# Function to scrape the content
async def scrape_content(url: str):
    async with httpx.AsyncClient(follow_redirects=True) as client:
        # Fetch the page content with automatic redirection handling
        response = await client.get(url)
        response.raise_for_status()  # Raise an exception for bad responses
        
        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Find the <main> tag with id="content"
        content = soup.find('main', id='content')
        
        if content:
            return content
        else:
            return None

# Function to convert HTML to markdown
def html_to_markdown(html_content):
    return markdownify.markdownify(str(html_content), heading_style="ATX")

# Check if the environment is already running an event loop
def run_scrape(url: str):
    try:
        return asyncio.ensure_future(scrape_content(url))
    except RuntimeError:
        loop = asyncio.get_event_loop()
        return loop.run_until_complete(scrape_content(url))

In [6]:
# Scrape and chunk all articles
serlo_data = {
    "article_urls": [],
    "markdown_contents": [],
    "content_chunks": []
}

for article_url in article_urls:
    # Scrape content
    content = await scrape_content(article_url)
    
    if content:
        # Convert content to markdown
        markdown_content = html_to_markdown(content)

        # Chunk content
        chunks = chunk_markdown_by_headlines(markdown_content)

        # Clean chunks from markdown links, etc.
        cleaned_chunks = [clean_markdown(chunk) for chunk in chunks]

        # Filter only for non-headline chunks
        content_chunks = remove_headlines(cleaned_chunks)
        
        serlo_data["article_urls"].append(article_url)
        serlo_data["markdown_contents"].append(markdown_content)
        serlo_data["content_chunks"].append(content_chunks)
    else:
        print("Content not found.")

In [7]:
# Store articles in JSON file
with open("serlo_data.json", "w") as f:
    json.dump(serlo_data, f)

In [8]:
# Initialize sentence transformer model
dimensions = 1024
model = SentenceTransformer("mixedbread-ai/deepset-mxbai-embed-de-large-v1", truncate_dim=dimensions)

In [9]:
# Initialize the ChromaDB persistent client
client = chromadb.PersistentClient(path="vectordb")

# Create a collection (if not already created)
collection = client.create_collection("serlo_collection")

In [10]:
# Add content chunks to vector database
for i, article_url in enumerate(serlo_data["article_urls"]):
    content_chunks = serlo_data["content_chunks"][i]
    
    embeddings = model.encode(content_chunks)

    collection.add(
        documents=content_chunks,
        embeddings=embeddings,
        metadatas=[{"source": article_url} for _ in content_chunks],
        ids=[str(article_url.split("/")[-1]) + "_" + str(i) for i, _ in enumerate(content_chunks)]
    )

In [11]:
# Example query
query = "Zeichne den Punkt P(4|6) in ein Koordinatensystem ein."

# Encode the query
query_embedding = model.encode([query])

# Retrieve similar documents from the collection
results = collection.query(
    query_embeddings=query_embedding,
    n_results=3,
    include=['documents', 'metadatas']
)

# Print out the results
print("Query:", query)

print("Top 3 most similar documents:")
print()
for documents in results["documents"]:
    context = "Kontext:\n"
    context += "\n\nKontext:\n".join(documents)
    print(context)

Query: Zeichne den Punkt P(4|6) in ein Koordinatensystem ein.
Top 3 most similar documents:

Kontext:
Die Lageinformation eines Punktes im zweidimensionalen Koordinatensystem wird in runden Klammern geschrieben und durch einen senkrechten Strich getrennt:Nullpunkt: (0∣0)(0|0)(0∣0)
Punkt P=(3∣4)P = (3|4)P=(3∣4)
* Punkt Q=(−2∣1)Q=(-2|1)Q=(−2∣1)Zusätzlich kann man von einem Punkt den Quadranten angeben.!Bild

Kontext:
LadenWeitere Aufgaben zum Thema findest du im folgenden Aufgabenordner:
Aufgaben zum Koordinatensystem

Kontext:
Die Lageinformation eines Punktes im zweidimensionalen Koordinatensystem wird in runden Klammern geschrieben und durch einen senkrechten Strich getrennt.Die Stelle vor dem senkrechten Strich ist die xxx-Koordinate, die man an der xxx-Achse abzählt.Die Stelle nach dem senkrechten Strich ist die yyy-Koordinate und wird an der yyy-Achse abgezählt.Beispiel im Bild:Nullpunkt: O(0∣0)\color{#660099}{O(0|0)}O(0∣0)
Punkt P(2∣3)\color{#cc0000}{P(2|3)}P(2∣3)
Punkt Q(−2∣1)\co