# LLM Project: Company Brochure generator using Llama 3.2 with help of web-scraping

This program fetches content of a company webpage via user-input and then will with help of web-scraping it and LLM (Llama 3.2). 

It will go through relevant sub-links to fetch contents and generate a brochure for the company.

### Step 1: Install Required Libraries
To begin, we need the following Python libraries:
- `requests`: To fetch the webpage content.
- `beautifulsoup4`: To parse and clean up the webpage HTML.
- `ollama`: To interface with the locally installed Llama 3.2 model.

Once the libraries has been installed in your environment, open up a Jupyter notebook and proceed to next steps.

### Step 2: Fetch Webpage Content
A class 'Website' is created. This class:
- Takes a URL as input.
- Sends a request to fetch the webpage.
- Uses a user-agent header to mimic a real browser request.
- Returns the HTML content and gets it parsed using BeautifulSoup.
- Removes redundant contents from the parsed HTML.
- Extracts sub-links within the main URL page.

In [213]:
# Creating a class to fetch main webpage 

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Defining a dictionary "headers" to mimic a real web browser request.
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

# Creating a class that will store webpage content, title and links.

class Website:
    
    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)                                 # Makes an HTTP GET request to the given URL with pre-defined headers
        self.body = response.content                                                  # Stores the raw HTML of the page
        self.soup = BeautifulSoup(self.body, 'html.parser')                           # Parses the HTML content using html.parser
        self.title = self.soup.title.string if self.soup.title else "No title found"  # Extracts title of webpage
        
        if self.soup.body:                                                            # Cleaning unneccesary elements such as <script>, <style>, <img>, <input>
            for irrelevant in self.soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = self.soup.body.get_text(separator="\n", strip=True)           # seperator="\n" seperates each text block with a new line. strip=True gets rid off unwanted spaces.
            
        else:
            self.text = ""
            
        """
        ***EXTRACTING LINKS*** 
        
        Now we need to extract all links available in the webpage.
        soup.find_all('a') finds all anchor <a> elements in the parsed HTML which contains hyperlinks.

        For example:
        
        If HTML content is:
        
        <html>
            <body>
                <a href="https://example.com">Example</a>
                <a href="https://google.com">Google</a>
                <a>Broken Link</a>  <!-- No href attribute -->
            </body>
        </html>

        Upon parsing using soup.find.all('a') we will get output:

        [
            <a href="https://example.com">Example</a>,
            <a href="https://google.com">Google</a>,
            <a>Broken Link</a>
        ]

        To extract only href attribute links we use link.get('href'). <a> tags with no href would return "None".
        i.e. 

        [
            "https://example.com",
            "https://google.com",
            None
        ]
        
        """
        
        links = [link.get('href') for link in self.soup.find_all('a')]
        
        """Now we iterate over every "link" in "links" to filter out the None values and keeping the dictionary "links" with only valid URLs"""
        self.links = [link for link in links if link]   

        # Convert relative URLs to absolute URLs
        self.links = [urljoin(self.url, link) for link in self.links]

    def get_contents(self):                                                        
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n" # Defining a method to print return webpage title and extracted text

In [214]:
url = input("Enter a website URL: ")  # Prompt the user for a URL
if not url.startswith("http"):        # Ensure it starts with http or https
    url = "https://" + url

web = Website(url)                    # Create a Website object with the user-provided URL

Enter a website URL:  https://huggingface.co/


### Testing and analysis
Following cells are to test output of 'Website' class using user-input link ('web').
**You can proceed to Step 3 if no testing is required.**

In [215]:
# Title of webpage
web.title

'Hugging Face – The AI community building the future.'

In [None]:
# Raw HTML content
print(web.body)

In [None]:
# Parsed HTML content
print(web.soup)

In [None]:
# Cleaned parsed text
print(web.text)

In [216]:
# List of links
web.links

['https://huggingface.co/',
 'https://huggingface.co/models',
 'https://huggingface.co/datasets',
 'https://huggingface.co/spaces',
 'https://huggingface.co/posts',
 'https://huggingface.co/docs',
 'https://huggingface.co/enterprise',
 'https://huggingface.co/pricing',
 'https://huggingface.co/login',
 'https://huggingface.co/join',
 'https://huggingface.co/spaces',
 'https://huggingface.co/models',
 'https://huggingface.co/Wan-AI/Wan2.1-T2V-14B',
 'https://huggingface.co/microsoft/Phi-4-multimodal-instruct',
 'https://huggingface.co/deepseek-ai/DeepSeek-R1',
 'https://huggingface.co/perplexity-ai/r1-1776',
 'https://huggingface.co/allenai/olmOCR-7B-0225-preview',
 'https://huggingface.co/models',
 'https://huggingface.co/spaces/Wan-AI/Wan2.1',
 'https://huggingface.co/spaces/nanotron/ultrascale-playbook',
 'https://huggingface.co/spaces/huggingface/ai-deadlines',
 'https://huggingface.co/spaces/black-forest-labs/FLUX.1-dev',
 'https://huggingface.co/spaces/lllyasviel/LuminaBrush',
 'h

In [None]:
# Testing get_contents() method
print(web.get_contents())

### Step 3: Using LLM to decide the relevant links

We need to extract only relevant website URLs from the scraped contents instead of going through every link for further scraping. 
Therefore we use the help of LLM to decide which links are relevant.

- Create a LLM function called `filter_relevant_links` in which the LLM model is assigned to decide which links are relevant
- Using this created function, we fetch the relevant links in `web.links`.

In [217]:
import ollama

# Function to filter relevant links using Ollama
def filter_relevant_links(links, prompt="Find relevant URLs related to company services and products."):
    """
    Uses Ollama to analyze extracted URLs and filter relevant ones based on a prompt.
    """
    # Format links into a text block for LLM processing
    links_text = "\n".join(links)
    
    # Create a structured prompt for Ollama
    full_prompt = f"""
    Here is a list of URLs extracted from a company's website:

    {links_text}

    Your task:
    - Identify URLs that would be most relevant to include in a brochure about the company, such as About page, Careers/Job page, Services.
    - Ignore links to terms of service, privacy, login pages, or external sites.
    - Return only relevant URLs without any additional text or explanations.

    Respond with the filtered list.
    """

    # Query Ollama
    response = ollama.chat(model="llama3.2", messages=[{"role": "user", "content": full_prompt}])
    
    # Extract response and return
    relevant_links = response['message']['content'].split("\n")

    # Filter out empty lines and unwanted text
    return [link.strip() for link in relevant_links if link.strip() and link.startswith("http")]

In [219]:
filter_links = filter_relevant_links(web.links)
print("WARNING: If list below is empty, try again after few seconds since the LLM is still generating output.\n")
print(filter_links) # Prints out the relevant links decided by LLM


['https://huggingface.co/about', 'https://huggingface.co/join', 'https://huggingface.co/pricing#services', 'https://huggingface.co/learn', 'https://huggingface.co/blog', 'https://discuss.huggingface.co', 'https://status.huggingface.co', 'https://github.com/huggingface', 'https://twitter.com/huggingface', 'https://www.linkedin.com/company/huggingface/', 'https://huggingface.co/join/discord']


## Step 4: Accessing every relevant link and scraping contents

Every link from the LLM decision (`filter_links`) is accessed for scrapping using the `Website()` function created in Step 2.

The title, link and content of each of these websites is listed and stored in `web_contents`.

In [220]:
# Create a list to store extracted titles and cleaned texts
web_contents = []

# Loop through each filtered link and extract its title and text
for link in filter_links:
    web_page = Website(link)  # Create a Website object
    
    # Filter out lines with at most 3 words
    filtered_text = "\n".join(line for line in web_page.text.split("\n") if len(line.split()) > 3)
    
    # Store title and filtered text
    web_contents.append((web_page.title, link, filtered_text))

# Print the extracted titles, links, and cleaned texts
for title, link, text in web_contents:
    print(f"Title: {title}\nLink: {link}\nContent: {text[:500]}...\n")  # Print first 500 characters as preview

Title: about (Sergei)
Link: https://huggingface.co/about
Content: AI & ML interests...

Title: No title found
Link: https://huggingface.co/join
Content: ...

Title: Hugging Face – Pricing
Link: https://huggingface.co/pricing#services
Content: Leveling up AI collaboration and compute.
Users and organizations already use the Hub as a collaboration platform,
we’re making it easy to seamlessly and scalably launch ML compute directly from the Hub.
Collaborate on Machine Learning
Host unlimited public models, datasets
Create unlimited orgs with no member limits
Access the latest ML tools and open source
Unlock advanced HF features
ZeroGPU and Dev Mode for Spaces
Higher rate limits for serverless inference
Get early access to upcoming featu...

Title: Hugging Face - Learn
Link: https://huggingface.co/learn
Content: This course will teach you about natural language processing using libraries from the HF ecosystem
Learn to build and deploy your own AI agents
This course will teach you about dee

## Step 4b (Optional): Combining all the scraped contents together to be used in prompt

LLM prompts typically require a single string of text rather than a list of individual items. To create this, the contents from `web_contents` (which is a list of titles, links and texts) are combined into `content_text`. This forms a singular string that includes all the relevant information to be used in the model prompt.

**NOTE:** However, from testing later it is found that the LLM works better with the listed prompt input. So you can **skip this part and proceed to Step 5.**

In [None]:
# Prepare content_text from web_contents
content_text = "\n\n".join([f"Title: {title}\nLink: {link}\nContent: {text}" for title, link, text in web_contents])

# Print content_text to inspect the combined website content
print("Content Text being sent to Ollama:\n\n", content_text)

## Step 5: Using LLM to generate the brochure

Once `content_text` is ready, we can now assign another LLM prompt in order to use the information collected to produce a brochure. 
This is done by following procedure:
- Create a LLM function `generate_brochure_from_contents` to generate the brochure where the input will be scrapped contents.
- Pay attention to the `prompt` (a.k.a user-prompt) and `system-prompt`  as it defines how you will get your output.
- Once the function is ready, generate the brochure using `generate_brochure_from_contents(web_contents)`


- **NOTE #1:** `generate_brochure_from_contents(content_text)` from **Step 4b** should also work. However `generate_brochure_from_contents(web_contents)` seem to produce better output despite being a list. This might be because the structured data in the list is easier to parse through.
  
- **NOTE #2:** If you are not satisfied with the generated brochure, try running `brochure = generate_brochure_from_contents(web_contents)` again. You should get a good output in 2 to 3 tries for most cases.

In [221]:
# Function to generate a brochure using Ollama
def generate_brochure_from_contents(scraped_text):
    # Define the prompt
    prompt = f"""
    Create a brochure based on the following website content:,
    {scraped_text}
    Your task is to generate a professional brochure without any picture that highlights key information about the company, such as:
    - Services
    - Products
    - Values
    - Mission
    - Impact
    - Webpage
    - Contact e-mail
    The brochure should be concise and informative. Do not include your talk other than brochure input. It should be focused solely on the content provided in the prompt without introducing any irrelevant material or placeholders like '[Cover Page]', '[Insert Twitter Handle]', or similar.
    Do not mention where to insert pictures. Ensure all content is strictly from the provided context, and avoid referencing external sites or inserting placeholder links.
    The brochure should only contain the relevant information without any extraneous formatting or mentions of page numbers like 'Cover page', 'Company Logo', 'Page 1', 'Back Cover', etc.
    """
    # Add a system prompt to define instructions and control behavior
    system_prompt =  f"""You are a helpful assistant tasked with generating a professional brochure from a large chunk of web-scraped text. Follow these instructions:
    1. Focus only on the content that is directly related to the company, its services, products, values, and mission.
    2. Discard any irrelevant content, especially external links, and avoid introducing content like 'Cover Page', 'Back Cover', or placeholders.
    3. Organize the information in a clear, concise manner and do not insert unnecessary page numbers, external links, or picture placeholders.
    """
    # Send the prompt to Ollama
    response = ollama.chat(model="llama3.2", messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}])
    
    # Extract and return the response content
    brochure_content = response['message']['content']
    return brochure_content


In [226]:
# Generate the brochure
brochure = generate_brochure_from_contents(web_contents)

In [227]:
print(brochure)

**Welcome to Hugging Face**

**Empowering AI and Machine Learning**

At Hugging Face, we are dedicated to making cutting-edge machine learning and natural language processing (NLP) technologies accessible to everyone. Our mission is to democratize access to the most advanced AI models and tools, enabling researchers, developers, and organizations to build and deploy innovative applications.

**Services**

* **Model Hub**: We provide a vast repository of pre-trained models for various NLP tasks, making it easy for users to integrate state-of-the-art models into their applications.
* **Inference Providers**: Our partnership with leading providers enables seamless deployment of AI models as web applications without requiring any coding expertise.
* **Gradio**: Our open-source framework allows developers to build interactive web applications with the most advanced AI models, making it easier to showcase and share research.

**Products**

* **Distill-Any-Depth**: A new SOTA monocular depth 

## Step 6: Get your brochure in Markdown format

Congratulations! You have already generated the brochure in last step. This step is just to make the text appear neater to be shown in Markdown format.

In [228]:
# Function to convert the brochure content into markdown format
def convert_to_markdown(content):
    # Initialize the markdown formatted content
    markdown = ""

    lines = content.split('\n')
    
    for line in lines:
        # If the line starts with a title or heading, format it as markdown header
        if line.startswith("**"):
            markdown += f"## {line.strip('**')}\n"
        # If it starts with a bullet point, make it a markdown list
        elif line.startswith("*"):
            markdown += f"- {line[2:]}\n"
        # Else, treat it as regular text
        else:
            markdown += f"{line}\n"

    return markdown

# Convert the generated brochure content into markdown format
markdown_brochure = convert_to_markdown(brochure)

# Print the markdown formatted brochure
#print(markdown_brochure)

In [229]:
from IPython.display import Markdown, display

# Display the markdown content in the Jupyter notebook
display(Markdown(markdown_brochure))

## Welcome to Hugging Face

## Empowering AI and Machine Learning

At Hugging Face, we are dedicated to making cutting-edge machine learning and natural language processing (NLP) technologies accessible to everyone. Our mission is to democratize access to the most advanced AI models and tools, enabling researchers, developers, and organizations to build and deploy innovative applications.

## Services

- **Model Hub**: We provide a vast repository of pre-trained models for various NLP tasks, making it easy for users to integrate state-of-the-art models into their applications.
- **Inference Providers**: Our partnership with leading providers enables seamless deployment of AI models as web applications without requiring any coding expertise.
- **Gradio**: Our open-source framework allows developers to build interactive web applications with the most advanced AI models, making it easier to showcase and share research.

## Products

- **Distill-Any-Depth**: A new SOTA monocular depth estimation model trained on a large dataset, achieving state-of-the-art performance across both indoor and outdoor scenes.
- **FastRTC**: A breakthrough technology eliminating hours of AI deployment work with a single button on Space creation from any HuggingFace model.

## Values

- **Democratization**: We believe that everyone should have access to the most advanced AI models and tools, regardless of their background or expertise.
- **Collaboration**: We foster a community of developers, researchers, and organizations working together to advance the field of NLP and machine learning.
- **Innovation**: We are committed to pushing the boundaries of what is possible with AI and machine learning.

## Mission

Our mission is to empower the world with cutting-edge AI and machine learning technologies, making it easier for everyone to build and deploy innovative applications.

## Impact

By democratizing access to advanced AI models and tools, we aim to accelerate progress in various fields, including but not limited to:

- Natural Language Processing
- Computer Vision
- Healthcare
- Education

## Webpage

Learn more about our products, services, and community at [https://huggingface.co/](https://huggingface.co/)

## Contact Email

For any inquiries or collaborations, please contact us at [info@huggingface.com](mailto:info@huggingface.com)
