# Step 4: HTML Content Extraction with LLM

This notebook is responsible for extracting relevant information from HTML content that was downloaded in Step 3. It uses a smaller Llama model to analyze each HTML page and extract:

1. The main content (filtering out navigation, ads, etc.)
2. Key points from the content
3. A relevance score based on the query

## Process Flow:
1. Load downloaded HTML files from Step 3
2. Clean and process each HTML file
3. Use a Llama model to extract structured information
4. Filter results by relevance
5. Combine everything into a comprehensive report structure
6. Generate summary insights for each report

The final output will be enriched JSON files with all the research content needed for report generation.

## Import Dependencies and Set Up Directories

First, we'll import the necessary libraries and set up our directory structure:

In [2]:
import json
import os
import torch
from pathlib import Path
import pandas as pd
from bs4 import BeautifulSoup
from transformers import pipeline
import re

## Configuration Settings

We'll set our configuration variables, including:

1. The LLM model to use (a smaller 3B parameter model is sufficient for this task)
2. Directory paths for data organization
3. Minimum relevance threshold for filtering low-quality content

In [ ]:
DEFAULT_MODEL = "meta-llama/Llama-3.2-3B-Instruct" 
base_dir = Path("llama_data")
results_dir = base_dir / "results"
parsed_dir = base_dir / "parsed_content"
parsed_dir.mkdir(exist_ok=True)

# Minimum relevance score to keep content (used throughout the notebook)
MIN_RELEVANCE_SCORE = 6

## Define System Prompt for the LLM

The system prompt is crucial - it instructs the LLM on how to analyze the HTML content.
We're creating a "smart AI Intern" persona that will:
1. Extract the main content
2. Identify key points
3. Rate relevance on a scale of 0-10

The persona makes the task more engaging for the LLM and often improves performance.

In [4]:
SYS_PROMPT = """
You are a smart AI Intern, you work with dumb AIs that dont know how to parse HTML. 

This is your moment to make mama GPU proud and secure a data centre! Remember shine and do your job well-you got this!

Your task is to analyze the provided HTML content and extract the following in JSON format:
1. main_content: The main article or content text (exclude navigation, footers, sidebars, ads)
2. key_points: A list of 3-5 key points or takeaways from the content
3. relevance_score: A score from 0-10 indicating relevance to the search query

Return ONLY a valid JSON object with these fields, no additional text.
If you cannot parse the HTML properly, return a JSON with error_message field.
"""

## Initialize the LLM Pipeline

We'll use the Hugging Face transformers library to set up our LLM pipeline.
For this task, we're using a smaller Llama 3 model (3B parameters) as it's:
1. Faster than larger models
2. Sufficient for simple extraction tasks
3. More memory-efficient

The model will run on the first available CUDA-compatible GPU.

In [5]:
text_pipeline = pipeline(
    "text-generation",
    model=DEFAULT_MODEL,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


## HTML Cleaning Function

Before we can analyze HTML content, we need to clean it by:
1. Removing scripts, style tags, navigation elements, etc.
2. Extracting the main text content
3. Handling whitespace
4. Truncating very large content that exceeds model limits

BeautifulSoup handles most of this work efficiently.

In [6]:
def clean_html_content(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    #rmv these
    for script in soup(["script", "style", "nav", "footer", "aside"]):
        script.extract()
    
    text = soup.get_text(separator=' ', strip=True)
    text = re.sub(r'\s+', ' ', text).strip()
    if len(text) > 110000:
        text = text[:110000] + "... [content truncated]"
    
    return text

## HTML Content Analysis Function

This function is the heart of our processor - it:
1. Loads an HTML file
2. Cleans the content
3. Creates a targeted prompt combining the query and cleaned HTML
4. Sends it to the LLM
5. Extracts and parses the JSON response
6. Handles errors gracefully

Note: We're setting max_new_tokens=4000 which is usually sufficient. 
The low temperature (0.01) ensures consistent, factual extraction.

In [ ]:
def parse_html_with_llm(html_path, query, purpose):
    try:
        # Load HTML
        with open(html_path, "r", encoding="utf-8") as f:
            html_content = f.read()
        cleaned_text = clean_html_content(html_content)
        
        # Construct prompt
        conversation = [
            {"role": "system", "content": SYS_PROMPT},
            {"role": "user", "content": f"""
Search Query: {query}
Query Purpose: {purpose}

HTML Content (cleaned):
{cleaned_text}

Extract the key information from this content in JSON format according to the instructions.
"""}
        ]
        
        output = text_pipeline(
            conversation,
            max_new_tokens=4000,  # Reduced from 32000 to a more reasonable size
            temperature=0.01, # cool llm = smart extraction
            do_sample=True,
        )
        
        # Extract the assistant's response
        assistant_response = output[0]["generated_text"][-1]
        response_content = assistant_response["content"]
        
        # Print short progress indicator instead of full content
        print(f"Processing {os.path.basename(html_path)}")
        
        try:
            json_match = re.search(r'({[\s\S]*})', response_content)
            if json_match:
                json_str = json_match.group(1)
                parsed_data = json.loads(json_str)
            else:
                parsed_data = {"error_message": "Failed to extract JSON from LLM response"}
        except json.JSONDecodeError:
            parsed_data = {"error_message": "Invalid JSON in LLM response", "raw_response": response_content[:500]}
        
        return parsed_data
        
    except Exception as e:
        print(f"Error processing file: {str(e)}")
        return {"error_message": f"Error processing file: {str(e)}"}

## Main Processing Function

This function orchestrates the entire processing workflow:
1. Loads search results from Step 3
2. Organizes results by report and query
3. Processes each HTML file through the LLM
4. Saves individual results and combined query results
5. Creates a comprehensive dataset of all parsed content

The function follows this hierarchy:
- Reports (research topics)
  - Queries (search queries for each topic)
    - Results (individual web pages)

Each level is saved separately for easy retrieval and analysis.

In [8]:
def process_all_search_results():
    with open(base_dir / "results_so_far.json", "r") as f:
        search_results = json.load(f)
    
    all_parsed_results = []
    
    for query_data in search_results:
        report_index = query_data["report_index"]
        report_title = query_data["report_title"]
        query_index = query_data["query_index"]
        query = query_data["query"]
        purpose = query_data["purpose"]

        report_dir_name = f"report_{report_index}_{report_title.replace(' ', '_').replace(':', '').replace('/', '')[:30]}"
        query_dir_name = f"query_{query_index}_{query.replace(' ', '_').replace(':', '').replace('/', '')[:30]}"
        parsed_report_dir = parsed_dir / report_dir_name
        parsed_report_dir.mkdir(exist_ok=True)
        
        parsed_query_results = []
        
        print(f"\nProcessing results for query: {query}")
        
        for result in query_data["results"]:
            result_index = result["result_index"]
            title = result["title"]
            url = result["url"]
            filepath = result["filepath"]
            
            print(f"  Processing result {result_index + 1}: {title[:50]}...")
            
            if filepath and os.path.exists(filepath):
                parsed_data = parse_html_with_llm(filepath, query, purpose)
                parsed_data.update({
                    "result_index": result_index,
                    "title": title,
                    "url": url,
                    "query": query,
                    "purpose": purpose
                })
                
                result_filename = f"parsed_result_{result_index}.json"
                with open(parsed_report_dir / result_filename, "w") as f:
                    json.dump(parsed_data, f, indent=2)
                
                parsed_query_results.append(parsed_data)
            else:
                print(f"    Warning: File not found - {filepath}")
        
        query_results = {
            "report_index": report_index,
            "report_title": report_title,
            "query_index": query_index,
            "query": query,
            "purpose": purpose,
            "parsed_results": parsed_query_results
        }
        
        query_filename = f"parsed_query_{query_index}.json"
        with open(parsed_report_dir / query_filename, "w") as f:
            json.dump(query_results, f, indent=2)
        
        all_parsed_results.append(query_results)
    
    with open(parsed_dir / "all_parsed_results.json", "w") as f:
        json.dump(all_parsed_results, f, indent=2)
    
    return all_parsed_results

## Report Summary Generation

After processing all individual pages, this function:
1. Organizes content by report
2. Loads original report outlines and metadata
3. Filters low-relevance content using our threshold
4. Combines research results with original structure
5. Generates a concise summary for each report
6. Creates an enriched dataset for the next steps

The enriched_reports.json output from this function contains everything needed to generate 
complete research reports in the next step.

In [ ]:
def generate_report_summaries(all_parsed_results):
    # Load original outlines and metadata
    try:
        with open('generated_outlines.json', 'r') as f:
            original_outlines = json.load(f)
    except FileNotFoundError:
        print("Warning: generated_outlines.json not found. Proceeding without original outlines.")
        original_outlines = []
    
    # Create a dictionary mapping report_index to original outline data
    original_data_by_index = {}
    if original_outlines:
        for i, outline in enumerate(original_outlines):
            original_data_by_index[i] = {
                "original_goal": outline.get("original_goal", {}),
                "personality": outline.get("personality", {}),
                "vibe": outline.get("vibe", ""),
                "outline_structure": outline.get("outline", []),
                "web_queries": outline.get("Web Queries", [])
            }
    
    report_summaries = {}
    
    # Minimum relevance score to keep content
    MIN_RELEVANCE_SCORE = 6
    
    for query_result in all_parsed_results:
        report_index = query_result["report_index"]
        report_title = query_result["report_title"]
        
        if report_index not in report_summaries:
            report_summaries[report_index] = {
                "report_title": report_title,
                "queries": [],
                # Add original data if available
                **(original_data_by_index.get(report_index, {}))
            }
        
        # Filter out low-relevance results
        filtered_results = [
            r for r in query_result["parsed_results"] 
            if r.get("relevance_score", 0) >= MIN_RELEVANCE_SCORE
        ]
        
        if not filtered_results:
            print(f"Warning: No high-relevance results for query: {query_result['query']}")
            # Skip this query if it has no relevant results
            continue
        
        query_summary = {
            "query": query_result["query"],
            "purpose": query_result["purpose"],
            "result_count": len(filtered_results),
            "average_relevance": sum(r.get("relevance_score", 0) for r in filtered_results) / 
                              max(1, len(filtered_results)),
            "relevant_results": [
                {
                    "title": r["title"],
                    "url": r["url"],
                    "main_content": r.get("main_content", "No content available"),
                    "key_points": r.get("key_points", []),
                    "relevance_score": r.get("relevance_score", 0)
                }
                for r in sorted(
                    filtered_results, 
                    key=lambda x: x.get("relevance_score", 0), 
                    reverse=True
                )
            ]
        }
        
        report_summaries[report_index]["queries"].append(query_summary)
    
    for report_index, report_data in report_summaries.items():
        print(f"\nGenerating summary for report: {report_data['report_title']}")
        
        # Construct summary prompt
        queries_info = "\n\n".join([
            f"Query: {q['query']}\nPurpose: {q['purpose']}\nTop Results:\n" + 
            "\n".join([f"- {r['title']}: {' '.join(r['key_points'][:2])}" for r in q["relevant_results"][:3]])
            for q in report_data["queries"]
        ])
        
        summary_prompt = f"""
Report Title: {report_data['report_title']}

The following searches were conducted for this report:

{queries_info}

Based on these search results, generate a brief report outline with:
1. Key findings across all queries
2. Important data points uncovered
3. Suggested sections for the final report
4. Areas where more research might be needed

Return this as a JSON with fields: key_findings, data_points, suggested_sections, and research_gaps.
"""
        
        conversation = [
            {"role": "system", "content": "You are a research assistant who helps summarize findings from web searches into structured report outlines."},
            {"role": "user", "content": summary_prompt}
        ]
        
        # Generate report summary
        output = text_pipeline(
            conversation,
            max_new_tokens=4000,  # Reduced from 32000
            temperature=0.1,
        )
        
        # Extract the assistant's response
        assistant_response = output[0]["generated_text"][-1]
        response_content = assistant_response["content"]
        
        # Extract JSON from response
        try:
            json_match = re.search(r'({[\s\S]*})', response_content)
            if json_match:
                json_str = json_match.group(1)
                report_summary = json.loads(json_str)
            else:
                report_summary = {"error": "Failed to extract JSON from LLM response"}
        except json.JSONDecodeError:
            report_summary = {"error": "Invalid JSON in LLM response"}
        
        report_data["generated_summary"] = report_summary
    
    # Save enriched report summaries with all original context
    enriched_reports_path = parsed_dir / "enriched_reports.json"
    with open(enriched_reports_path, "w") as f:
        json.dump(report_summaries, f, indent=2)
    
    # Also save a simplified version for backward compatibility
    with open(parsed_dir / "report_summaries.json", "w") as f:
        json.dump(report_summaries, f, indent=2)
    
    print(f"\nEnriched reports saved to: {enriched_reports_path}")
    return report_summaries

## Main Execution

Now we'll run the entire HTML extraction process:
1. Process all search results to extract key information
2. Generate summarized insights for each report
3. Calculate and display statistics about the process

This may take some time depending on:
- How many HTML files need processing
- The size of each file
- The speed of your GPU

The results will be saved in organized JSON files for the next step.

In [ ]:
print("Starting HTML parsing process with LLM...")
all_parsed_results = process_all_search_results()

print("\nGenerating report summaries...")
report_summaries = generate_report_summaries(all_parsed_results)

print("\nProcessing complete. Results saved to:")
print(f"- All parsed results: {parsed_dir / 'all_parsed_results.json'}")
print(f"- Enriched reports with original context: {parsed_dir / 'enriched_reports.json'}")

total_queries = len(all_parsed_results)
total_results = sum(len(query["parsed_results"]) for query in all_parsed_results)
total_reports = len(report_summaries)

print(f"\nSummary Statistics:")
print(f"- Total Reports: {total_reports}")
print(f"- Total Queries: {total_queries}")
print(f"- Total Results Parsed: {total_results}")

# Calculate statistics only on high-relevance results
MIN_RELEVANCE_SCORE = 6
high_relevance_results = [
    result
    for query in all_parsed_results 
    for result in query["parsed_results"]
    if result.get("relevance_score", 0) >= MIN_RELEVANCE_SCORE
]

total_high_relevance = len(high_relevance_results)
print(f"- High Relevance Results (score >= {MIN_RELEVANCE_SCORE}): {total_high_relevance}")

if high_relevance_results:
    avg_relevance = sum(result.get("relevance_score", 0) for result in high_relevance_results) / total_high_relevance
    print(f"- Average Relevance Score (high relevance only): {avg_relevance:.2f}/10")

# Display a sample of one report's structure (to verify format)
if report_summaries:
    sample_report_index = list(report_summaries.keys())[0]
    sample_report = report_summaries[sample_report_index]
    print(f"\nSample Structure for Report '{sample_report['report_title']}':")
    print("- Original metadata included")
    print("- Queries with filtered high-relevance results")
    print("- Generated summary included")
    print("Ready for the next step in the workflow!")