# Extracting items from HTML

This notebook explores extracting narrative elements from HTML files using multiple methods incuding LangChain's `AsyncHtmlLoader` to convert URLs to to Markdown, LangChain's `UnstructuredHTMLLoader` and `BSHTMLLoader` methods to convert HTML from local folders, as well as a custom content extractor method `ContextExtractor` using BeautifulSoup to parse HTMLMTL content and extract specific elements.

In [1]:
from bs4 import BeautifulSoup
import os
from pathlib import Path
from IPython.display import display, HTML
from bs4.element import Tag
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.document_loaders import BSHTMLLoader
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_transformers import MarkdownifyTransformer
from urllib.parse import urlparse

USER_AGENT environment variable not set, consider setting it to identify your requests.


## Langchain tool (Markdownify) to convert HTML to Markdown

Create a directory to store Markdown files.

In [2]:
# Directory where you want to save the markdown files
output_dir = 'PlanNet/site/markdown'
# Create output directory if it doesn't exist
Path(output_dir).mkdir(parents=True, exist_ok=True)

Install necessary libraries if not already installed

In [3]:
# %pip install --upgrade --quiet  markdownify
# %pip install -U lxml
# %pip install unstructured

In [4]:
def get_filename_from_url(url):
    """Extract filename from URL and convert to markdown filename."""
    # Parse the URL and get the path
    path = urlparse(url).path
    
    # Get the last part of the path (filename)
    filename = path.split('/')[-1]
    
    # Remove .html extension if present
    filename = filename.replace('.html', '')
    
    # Convert to title case and replace special characters
    filename = filename.replace('-', '_')
    
    # Add .md extension
    return f"{filename}.md"

In [5]:
def convert_urls_to_markdown(urls):
    """Convert multiple URLs to markdown files."""
    # Initialize loaders and transformers
    loader = AsyncHtmlLoader(urls)
    md_transformer = MarkdownifyTransformer()
    
    # Load all documents
    docs = loader.load()
    converted_docs = md_transformer.transform_documents(docs)
    
    # Create output directory if it doesn't exist
    output_dir = "PlanNet/site/markdown"
    os.makedirs(output_dir, exist_ok=True)
    
    # Process each document
    for url, doc in zip(urls, converted_docs):
        # Generate filename from URL
        filename = get_filename_from_url(url)
        
        # Create full file path
        file_path = os.path.join(output_dir, filename)
        
        # Write content to file
        with open(file_path, 'w', encoding='utf-8') as f:
            f.write(doc.page_content)
        
        print(f"Created: {filename}")

In [6]:
urls = [
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/index.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/ChangeHistory.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/examples.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/implementation.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/profiles.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/artifacts.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/CapabilityStatement-plan-net.html"
]


In [7]:
convert_urls_to_markdown(urls)

Fetching pages: 100%|##########| 7/7 [00:00<00:00,  8.63it/s]


Created: index.md
Created: ChangeHistory.md
Created: examples.md
Created: implementation.md
Created: profiles.md
Created: artifacts.md
Created: CapabilityStatement_plan_net.md


Files should be stored in the `PlanNet/site/markdown` folder.

## Experimental coverters

### Converting from local HTML files to Markdown

WIP: While this function works, it's not perfect. It's not preserving the structure of the HTML well. Would reccomend using the langchain tool above instead. I've tried using two document loader functions `BSHTMLLoader` as well as `UnstructuredHTMLLoader` but neither seem to work well at this point.

In [11]:
def get_filename_from_html(html_file):
    """Convert HTML filename to markdown filename."""
    # Get the base filename without extension
    base_name = os.path.splitext(os.path.basename(html_file))[0]
    
    # Convert to title case and replace special characters
    filename = base_name.replace('-', '_')
    
    # Add .md extension
    return f"{filename}.md"

In [15]:
def convert_html_to_markdown(html_files):
    """Convert multiple local HTML files to markdown files."""
    # Create output directory if it doesn't exist
    output_dir = "PlanNet/site/markdown" #NOTE: This is the same directory as the url to markdown converter above. Change to keep the results seprate. 
    os.makedirs(output_dir, exist_ok=True)
    
    # Process each HTML file
    for html_file in html_files:
        try:
            # Load and convert the HTML file
            loader = UnstructuredHTMLLoader(html_file)
            doc = loader.load()[0]  # BSHTMLLoader returns a list
            
            # Transform to markdown
            md_transformer = MarkdownifyTransformer()
            converted_doc = md_transformer.transform_documents([doc])[0]
            
            # Generate output filename
            output_filename = get_filename_from_html(html_file)
            output_path = os.path.join(output_dir, output_filename)
            
            # Write content to file
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(converted_doc.page_content)
            
            print(f"Successfully converted {html_file} to {output_filename}")
            
        except Exception as e:
            print(f"Error processing {html_file}: {str(e)}")

In [15]:
html_files = [
    "PlanNet/site/CapabilityStatement-plan-net.html",
    "PlanNet/site/ChangeHistory.html",
    "PlanNet/site/index.html",
    "PlanNet/site/examples.html",
    "PlanNet/site/implementation.html",
    "PlanNet/site/profiles.html",
    "PlanNet/site/artifacts.html"
]

In [None]:
convert_html_to_markdown(html_files)

### Custom Content Extractor

We define a class `ContentExtractor` to to extract content from HTML files. This class has methods to extract text, tables, and list elements from the HTML. The extracted content is then formatted as markdown and written to a file. The class has methods that also check to see if elements have been processed to avoid duplicates. From images, there is a method (`_extract_images`) to pull the src and alt text and format as markdown image with additional source info.

In [2]:
class ContextExtractor:
    def __init__(self):
        """Initialize the context extractor"""
        self.processed_elements = set()
        
    def _has_been_processed(self, element):
        """Check if an element has already been processed"""
        if not isinstance(element, Tag):
            return False
        return element.get('data-processed') == 'true'
    
    def _mark_processed(self, element):
        """Mark an element as processed"""
        if isinstance(element, Tag):
            element['data-processed'] = 'true'
            self.processed_elements.add(element)

    def _extract_images(self, element):
        """
        Extract image information including src and alt text
        
        Args:
            element: BeautifulSoup element containing images
        Returns:
            list: Formatted image information in Markdown
        """
        if self._has_been_processed(element):
            return []
            
        images = []
        for img in element.find_all('img', recursive=False):
            src = img.get('src', '')
            alt = img.get('alt', '')
            if src:
                # Format as Markdown image with additional source info
                images.append(f"![{alt}]({src})")
                images.append(f"*Image source: {src}*")
                images.append(f"*Image description: {alt}*")
                images.append("")  # Add blank line after each image
                
        self._mark_processed(element)
        return images

    def _extract_list_items(self, list_element, level=0, parent_type=None):
        """Extract list items with improved nested list handling"""
        if self._has_been_processed(list_element):
            return []
            
        items = []
        for item in list_element.find_all('li', recursive=False):
            if not self._has_been_processed(item):
                # Get direct text content of the li element (excluding nested list text)
                item_text = ''
                for content in item.children:
                    if isinstance(content, Tag):
                        if content.name not in ['ul', 'ol']:
                            if content.name == 'img':
                                # Handle images within list items
                                image_info = self._extract_images(content.parent)
                                items.extend([f"{'    ' * level}{line}" for line in image_info])
                            else:
                                item_text += content.get_text(strip=True) + ' '
                    else:
                        item_text += content.strip() + ' '
                item_text = item_text.strip()
                
                # Format the list item
                prefix = '    ' * level
                if list_element.name == 'ol':
                    items.append(f"{prefix}1. {item_text}")
                else:
                    items.append(f"{prefix}- {item_text}")
                
                # Handle nested lists
                nested_lists = item.find_all(['ul', 'ol'], recursive=False)
                for nested_list in nested_lists:
                    nested_items = []
                    for nested_item in nested_list.find_all('li', recursive=False):
                        nested_text = nested_item.get_text(strip=True)
                        if nested_list.name == 'ol':
                            nested_items.append(f"{prefix}    * {nested_text}")
                        else:
                            nested_items.append(f"{prefix}    * {nested_text}")
                    items.extend(nested_items)
                
                self._mark_processed(item)
        
        self._mark_processed(list_element)
        return items

    def _extract_table(self, table):
        """Extract table content in Markdown format"""
        if self._has_been_processed(table):
            return ""
            
        rows = []
        headers = []
        
        header_row = table.find('thead') or table.find('tr')
        if header_row:
            headers = [cell.get_text(strip=True) for cell in header_row.find_all(['th', 'td'])]
        
        for row in table.find_all('tr')[1:] if headers else table.find_all('tr'):
            row_data = [cell.get_text(strip=True) for cell in row.find_all(['td', 'th'])]
            if any(row_data):
                rows.append(row_data)
        
        table_str = []
        if headers:
            table_str.append("| " + " | ".join(headers) + " |")
            table_str.append("|" + "|".join([" --- " for _ in headers]) + "|")
        
        for row in rows:
            if headers:
                row.extend([''] * (len(headers) - len(row)))
            table_str.append("| " + " | ".join(row) + " |")
        
        self._mark_processed(table)
        return "\n".join(table_str)

    def extract_context(self, html_content):
        """Extract content with improved list and image handling"""
        soup = BeautifulSoup(html_content, 'html.parser')
        self.processed_elements.clear()
        context_elements = []
        
        # Remove script and style elements
        for script in soup(['script', 'style', 'nav', 'footer']):
            script.decompose()
        
        # Process headers and their content
        for header in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
            if not self._has_been_processed(header):
                level = int(header.name[1])
                header_text = header.get_text().strip()
                
                if header_text:
                    context_elements.append(f"\n{'#' * level} {header_text}\n")
                    self._mark_processed(header)
                    
                    # Process content until next header
                    next_element = header.find_next()
                    while next_element and not next_element.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
                        if not self._has_been_processed(next_element):
                            if next_element.name == 'p':
                                # Handle images within paragraphs
                                if next_element.find('img'):
                                    image_info = self._extract_images(next_element)
                                    context_elements.extend(image_info)
                                else:
                                    text = next_element.get_text().strip()
                                    if text:
                                        context_elements.append(text)
                                        context_elements.append("")
                            elif next_element.name in ['ul', 'ol']:
                                list_items = self._extract_list_items(next_element)
                                if list_items:
                                    context_elements.extend(list_items)
                                    context_elements.append("")
                            elif next_element.name == 'table':
                                table_content = self._extract_table(next_element)
                                if table_content:
                                    context_elements.append(table_content)
                                    context_elements.append("")
                        
                        next_element = next_element.find_next()

        # Process any remaining top-level images
        for img_container in soup.find_all('p'):
            if img_container.find('img') and not self._has_been_processed(img_container):
                image_info = self._extract_images(img_container)
                if image_info:
                    context_elements.extend(image_info)
        
        # Clean up repeated empty lines
        cleaned_elements = []
        prev_empty = False
        for element in context_elements:
            if element.strip() == "":
                if not prev_empty:
                    cleaned_elements.append(element)
                    prev_empty = True
            else:
                cleaned_elements.append(element)
                prev_empty = False
        
        return cleaned_elements

    def save_context(self, context_elements, output_file):
        """Save context to Markdown file"""
        with open(output_file, 'w', encoding='utf-8') as f:
            for element in context_elements:
                f.write(f"{element}\n")

    def process_html_file(self, input_file, output_file):
        """Process HTML file to Markdown"""
        try:
            output_file = os.path.splitext(output_file)[0] + '.md'
            
            with open(input_file, 'r', encoding='utf-8') as f:
                html_content = f.read()
            
            context_elements = self.extract_context(html_content)
            self.save_context(context_elements, output_file)
            self._display_summary(input_file, output_file, len(context_elements))
            
            return context_elements
            
        except Exception as e:
            display(HTML(f'<div style="color: red;">Error processing file: {str(e)}</div>'))
            return []

    def process_directory(self, input_dir, output_dir):
        """Process directory of HTML files"""
        Path(output_dir).mkdir(parents=True, exist_ok=True)
        
        for file in Path(input_dir).glob('*.html'):
            output_file = Path(output_dir) / f"{file.stem}.md"
            self.process_html_file(str(file), str(output_file))

    def _display_summary(self, input_file, output_file, num_elements):
        """Display processing summary"""
        summary_html = f"""
        <div style="background-color: #f0f0f0; padding: 10px; border-radius: 5px; margin: 10px 0;">
            <p><strong>Processed file:</strong> {os.path.basename(input_file)}</p>
            <p><strong>Output saved to:</strong> {os.path.basename(output_file)}</p>
            <p><strong>Extracted elements:</strong> {num_elements}</p>
        </div>
        """
        display(HTML(summary_html))

### Example Use of Custom Content Extractor

`ContentExtractor` is called by creating an instance of the class and then calling the `process_html_file` or `process_directory` method. The `process_html_file` method takes two arguments: the path to the input HTML file and the desired name of the output Markdown file. The `process_directory` method takes two arguments: the path to the input directory containing HTML files and the path to the output directory where the Markdown files will be saved.

In [None]:
extractor = ContextExtractor()
context = extractor.process_html_file(
    input_file='PlanNet/site/index.html',
    output_file='context'
)

In [None]:
for i, element in enumerate(context[:5]):  # Show first 5 elements
    print(f"\nElement {i+1}:")
    print(element[:200] + "..." if len(element) > 200 else element)

In [9]:
# Directory containing your HTML files
input_dir = 'PlanNet/site'

In [7]:
# Process all HTML files in the input directory
for html_file in Path(input_dir).glob('**/*.html'):
    # Create corresponding output path while preserving directory structure
    relative_path = html_file.relative_to(input_dir)
    output_path = Path(output_dir) / relative_path.with_suffix('.md')
    
    # Create necessary subdirectories
    output_path.parent.mkdir(parents=True, exist_ok=True)
    
    # Process the file
    context = extractor.process_html_file(
        input_file=str(html_file),
        output_file=str(output_path)
    )