# 🔍 HTML data extraction for TTD Newsletter

This notebook is used to find, test, and evaluate methods of extraction over HTML data.

## Domain extraction

In [1]:
import pandas as pd

good_articles = pd.read_csv('../research/data/good_articles.csv')
good_articles = good_articles[['link_url', 'link_domain']]
good_articles.describe()

Unnamed: 0,link_url,link_domain
count,4426,4379
unique,3137,103
top,https://jobs.ashbyhq.com/tldr.tech,venturebeat.com
freq,20,283


### 3 methods to test

In [66]:
import re
from urllib.parse import urlparse
from publicsuffix2 import get_sld

def extract_domain_urllib(url):
    """
    Extracts the domain name from a URL using urllib.parse.

    Parameters:
    url (str): The URL string.

    Returns:
    str: The domain name.
    """
    parsed_url = urlparse(url)
    parsed_url = parsed_url.netloc
    parsed_url = re.sub(r'^www.', '', parsed_url)
    return parsed_url

def extract_domain_regex(url):
    """
    Extracts the domain name from a URL using regular expressions.

    Parameters:
    url (str): The URL string.

    Returns:
    str: The domain name, or None if no match is found.
    """
    pattern = r'(?<=://)([^/]+)'
    match = re.search(pattern, url)
    if match:
        parsed_url = re.sub(r'^www.', '', match.group(1))
        return parsed_url
    return None

def extract_root_domain_psl(url):
    """
    Extracts the root domain from a URL using publicsuffix2.

    Parameters:
    url (str): The URL string.

    Returns:
    str: The root domain.
    """
    domain = urlparse(url).netloc
    return get_sld(domain)


In [67]:
good_articles['domain_urllib'] = good_articles['link_url'].apply(extract_domain_urllib)
good_articles['domain_regex'] = good_articles['link_url'].apply(extract_domain_regex)
good_articles['root_domain_psl'] = good_articles['link_url'].apply(extract_root_domain_psl)

In [69]:
good_articles['match_urllib'] = good_articles['domain_urllib'] == good_articles['link_domain']
good_articles['match_regex'] = good_articles['domain_regex'] == good_articles['link_domain']
good_articles['match_psl'] = good_articles['root_domain_psl'] == good_articles['link_domain']

In [None]:
pd.set_option("display.max_rows", None)

suffix = '_urllib'
filtered = good_articles[(good_articles['match'+suffix]==False) | (good_articles['link_domain'].isna())]
filtered.drop_duplicates(subset=['link_domain'], keep='first')[['link_url', 'link_domain', 'domain'+suffix]]

Unnamed: 0,link_url,link_domain,domain_urllib
10,https://hai.stanford.edu/news/hallucinating-la...,stanford.edu,hai.stanford.edu
11,https://blog.research.google/2024/02/a-decoder...,research.google,blog.research.google
13,https://aws.amazon.com/blogs/apn/automating-si...,amazon.com,aws.amazon.com
14,https://docs.google.com/presentation/d/14b5gkR...,google.com,docs.google.com
16,https://home.mlops.community/public/videos/mlo...,mlops.community,home.mlops.community
19,https://blog.allenai.org/olmo-open-language-mo...,allenai.org,blog.allenai.org
30,https://docs.llamaindex.ai/en/latest/examples/...,llamaindex.ai,docs.llamaindex.ai
36,https://blog.langchain.dev/winning-in-ai-means...,langchain.dev,blog.langchain.dev
37,https://ai.meta.com/blog/v-jepa-yann-lecun-ai-...,meta.com,ai.meta.com
41,https://bair.berkeley.edu/blog/2024/02/18/comp...,berkeley.edu,bair.berkeley.edu


In [72]:
pd.set_option("display.max_rows", None)

suffix = '_regex'
filtered = good_articles[(good_articles['match'+suffix]==False) | (good_articles['link_domain'].isna())]
filtered.drop_duplicates(subset=['link_domain'], keep='first')[['link_url', 'link_domain', 'domain'+suffix]]


Unnamed: 0,link_url,link_domain,domain_regex
10,https://hai.stanford.edu/news/hallucinating-la...,stanford.edu,hai.stanford.edu
11,https://blog.research.google/2024/02/a-decoder...,research.google,blog.research.google
13,https://aws.amazon.com/blogs/apn/automating-si...,amazon.com,aws.amazon.com
14,https://docs.google.com/presentation/d/14b5gkR...,google.com,docs.google.com
16,https://home.mlops.community/public/videos/mlo...,mlops.community,home.mlops.community
19,https://blog.allenai.org/olmo-open-language-mo...,allenai.org,blog.allenai.org
30,https://docs.llamaindex.ai/en/latest/examples/...,llamaindex.ai,docs.llamaindex.ai
36,https://blog.langchain.dev/winning-in-ai-means...,langchain.dev,blog.langchain.dev
37,https://ai.meta.com/blog/v-jepa-yann-lecun-ai-...,meta.com,ai.meta.com
41,https://bair.berkeley.edu/blog/2024/02/18/comp...,berkeley.edu,bair.berkeley.edu


In [74]:
pd.set_option("display.max_rows", None)

suffix = '_psl'
filtered = good_articles[(good_articles['match'+suffix]==False) | (good_articles['link_domain'].isna())]
filtered.drop_duplicates(subset=['link_domain'], keep='first')[['link_url', 'link_domain', 'root_domain'+suffix]]

Unnamed: 0,link_url,link_domain,root_domain_psl
208,/privacy,,


1. urllib and regex approaches are similar.
2. root_domain_psl and link_domain are similar.
- 1. seems better than 2. -> we pick urllib as our method to extract domain.

## HTML Content extraction

In [227]:
import requests
from pprint import pprint

def fetch_html(url):
    try:
        response = requests.get(url)
        response.encoding = 'utf-8'
        response.raise_for_status()  # Raises HTTPError for bad responses
        return response.text
    except requests.RequestException as e:
        print(f"An error occurred: {e}")
        return None

test_url = "https://neptune.ai/blog/hyperparameter-optimization-for-llms"
html = fetch_html(test_url)

### package html_content_extractor

In [None]:
from html_content_extractor import extract_content

content = extract_content(html, format='plaintext')
print(content)




            TL;DR        











The rise of large language models (LLMs) is bringing advances in text generation and contextual understanding. Hyperparameters control the size of LLMs, their training process, and how they generate outputs.
An optimal combination of hyperparameters is fundamental to efficiently pre-training and fine-tuning LLMs. Since LLM training is computationally intensive, exhaustive experimentation is not viable. This rules out traditional machine-learning hyperparameter optimization (HPO) methods that rely on systematically exploring the hyperparameter space by training many models with slightly different configurations.
When configuring models and training processes, LLM developers rely on a thorough understanding of each hyperparameter’s influence, insights from fundamental research, and empirical evidence gained from training state-of-the-art foundation models. Methods for estimating optimal hyperparameter values with limited compute budgets and adaptin

In [77]:
markdown = extract_content(html, format='markdown')
print(content)




            TL;DR        











The rise of large language models (LLMs) is bringing advances in text generation and contextual understanding. Hyperparameters control the size of LLMs, their training process, and how they generate outputs.
An optimal combination of hyperparameters is fundamental to efficiently pre-training and fine-tuning LLMs. Since LLM training is computationally intensive, exhaustive experimentation is not viable. This rules out traditional machine-learning hyperparameter optimization (HPO) methods that rely on systematically exploring the hyperparameter space by training many models with slightly different configurations.
When configuring models and training processes, LLM developers rely on a thorough understanding of each hyperparameter’s influence, insights from fundamental research, and empirical evidence gained from training state-of-the-art foundation models. Methods for estimating optimal hyperparameter values with limited compute budgets and adaptin

### package main_content_extractor

In [90]:
from main_content_extractor import MainContentExtractor

extracted_html = MainContentExtractor.extract(html)
pprint(extracted_html)

('<article class="single-post">\n'
 ' <div class="l-container l-container--main">\n'
 '  <div class="l-content content-wrapper js-content">\n'
 '   <section class="block-note c-box c-box--default c-box--dark '
 'c-box--no-hover c-box--standard" '
 'id="note-block_0a2efafcec672c32672d6121959d76cf">\n'
 '    <h3 class="block-note__header">\n'
 '     TL;DR\n'
 '    </h3>\n'
 '    <div class="block-note__content">\n'
 '     <div class="c-item c-item--text">\n'
 '      <img alt="" class="c-item__arrow" decoding="async" height="10" '
 'loading="lazy" '
 'src="https://neptune.ai/wp-content/themes/neptune/img/blocks/note/list-arrow.svg" '
 'width="12"/>\n'
 '      <div class="c-item__content">\n'
 '       <p>\n'
 '        Finding an optimal set of hyperparameters is essential for efficient '
 'and effective training of Large Language Models (LLMs).\n'
 '       </p>\n'
 '      </div>\n'
 '     </div>\n'
 '     <div class="c-item c-item--text">\n'
 '      <img alt="" class="c-item__arrow" decodi

In [268]:
import re

def clean_markdown(text):

    # Remove urls
    text = re.sub(r'(https?:\/\/|www\.)([\w\.\/-]+)', r'', text)

    # Remove images but preserve alt text if present
    text = re.sub(r'!\[([^\]]*?)\]\(.*?\)', r'\1', text, flags=re.DOTALL)

    # Remove remaining links but keep the link text
    text = re.sub(r'\[([^\]]*?)\]\(.*?\)', r'\1', text, flags=re.DOTALL)
    pprint(text)

    # Fix dashes separated by line breaks (e.g., "-\nword" → "-word")
    text = re.sub(r'(-)\n(\w)', r'\1\2', text)

    # Merge broken lines that are not paragraph breaks
    text = re.sub(r'(\S)\n(?=\S)', r'\1 ', text)

    # Fix markdown bullet lists
    text = re.sub(r'\s*\*\s*', r'\n* ', text)

    # Fix markdown numbered lists
    text = re.sub(r' +(\d+\.) +', r'\n\1 ', text)

    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)

    # Remove lines full of [ \*#\n]
    text = re.sub(r'\n[ \*#\n]*', r'\n', text, flags=re.DOTALL)

    # Normalize whitespace and line breaks
    text = re.sub(r'\n{2,}', '\n', text)         # Collapse multiple newlines
    text = re.sub(r'[ \t]+', ' ', text)          # Collapse multiple spaces/tabs

    return text.strip()

def extract_markdown_from_html(html):
    extracted = MainContentExtractor.extract(html, output_format="markdown")
    return clean_markdown(extracted)

test_url = "https://github.blog/open-source/world-water-day-how-github-copilot-is-helping-bring-clean-water-to-communities/"
html = fetch_html(test_url)
extracted_markdown = MainContentExtractor.extract(html, output_format="markdown")
cleaned_extracted_markdown = extract_markdown_from_html(html)
print(len(extracted_markdown), len(cleaned_extracted_markdown))
pprint(cleaned_extracted_markdown)

('Paull Young\n'
 '\n'
 '\n'
 '\n'
 'Maintainers\n'
 '\n'
 '###   5 GitHub Actions every maintainer needs to know\n'
 '\n'
 '\n'
 'With these actions, you can keep your open source projects organized, '
 'minimize\n'
 'repetitive and manual tasks, and focus more on writing code.\n'
 '\n'
 '\n'
 '\n'
 'Git\n'
 '\n'
 '###   Highlights from Git 2.49 \n'
 '\n'
 'The open source Git project just released Git 2.49. Here is GitHub’s look '
 'at\n'
 'some of the most interesting features and changes introduced since last '
 'time.\n'
 '\n'
 '\n'
 '\n'
 'Maintainers\n'
 '\n'
 '###   4 steps toward building an open source community\n'
 '\n'
 '\n'
 'Three maintainers talk about how they fostered their open source '
 'communities.\n'
 '\n')
1347 535
('Paull Young\n'
 'Maintainers\n'
 '5 GitHub Actions every maintainer needs to know\n'
 'With these actions, you can keep your open source projects organized, '
 'minimize repetitive and manual tasks, and focus more on writing code.\n'
 'Git\n'
 'Highl

In [269]:
pprint(extracted_markdown)

('![Paull Young](https://avatars.githubusercontent.com/u/157849754?v=4&s=200)\n'
 '\n'
 '![](https://github.blog/wp-content/uploads/2024/04/1200x630-Productivity-\n'
 'Unfurl-LIGHT-Logo@2x.png?resize=400%2C212)\n'
 '\n'
 '[Maintainers](https://github.blog/open-source/maintainers/)\n'
 '\n'
 '###  [ 5 GitHub Actions every maintainer needs to know\n'
 '](https://github.blog/open-source/maintainers/5-github-actions-every-\n'
 'maintainer-needs-to-know/)\n'
 '\n'
 'With these actions, you can keep your open source projects organized, '
 'minimize\n'
 'repetitive and manual tasks, and focus more on writing code.\n'
 '\n'
 '![](https://github.blog/wp-\n'
 'content/uploads/2025/03/Git-249.png?resize=400%2C212)\n'
 '\n'
 '[Git](https://github.blog/open-source/git/)\n'
 '\n'
 '###  [ Highlights from Git 2.49 ](https://github.blog/open-\n'
 'source/git/highlights-from-git-2-49/)\n'
 '\n'
 'The open source Git project just released Git 2.49. Here is GitHub’s look '
 'at\n'
 'some of the most inte

In [204]:
good_articles = pd.read_csv('../research/data/good_articles.csv')
for url in good_articles.sample(5).link_url:
    print(url)

https://www.marktechpost.com/2025/03/11/reka-ai-open-sourced-reka-flash-3-a-21b-general-purpose-reasoning-model-that-was-trained-from-scratch/
https://venturebeat.com/2022/03/15/you-com-partners-with-openai-to-launch-an-ai-powered-writing-tool/
https://ai.facebook.com/blog/blenderbot-3-a-175b-parameter-publicly-available-chatbot-that-improves-its-skills-and-safety-over-time/
https://venturebeat.com/2022/04/18/linkedin-creates-pass-to-tailor-graph-neural-networks-for-social-media/
https://huggingface.co/blog/sagemaker-huggingface-llm
