# LLM Project: Company Brochure generator using Llama 3.2 with help of web-scraping

This program fetches content of a company webpage via user-input and then will with help of web-scraping it and LLM (Llama 3.2). 

It will go through relevant sub-links to fetch contents and generate a brochure for the company.

## Step 1: Install Required Libraries
To begin, we need the following Python libraries:
- `requests`: To fetch the webpage content.
- `beautifulsoup4`: To parse and clean up the webpage HTML.
- `ollama`: To interface with the locally installed Llama 3.2 model.

Once the libraries has been installed in your environment, open up a Jupyter notebook and proceed to next steps.

## Step 2: Fetch Webpage Content
A class 'Website' is created. This class:
- Takes a URL as input.
- Sends a request to fetch the webpage.
- Uses a user-agent header to mimic a real browser request.
- Returns the HTML content and gets it parsed using BeautifulSoup.
- Removes redundant contents from the parsed HTML.
- Extracts sub-links within the main URL page.

In [176]:
# Creating a class to fetch main webpage 

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

# Defining a dictionary "headers" to mimic a real web browser request.
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

# Creating a class that will store webpage content, title and links.

class Website:
    
    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)                                 # Makes an HTTP GET request to the given URL with pre-defined headers
        self.body = response.content                                                  # Stores the raw HTML of the page
        self.soup = BeautifulSoup(self.body, 'html.parser')                           # Parses the HTML content using html.parser
        self.title = self.soup.title.string if self.soup.title else "No title found"  # Extracts title of webpage
        
        if self.soup.body:                                                            # Cleaning unneccesary elements such as <script>, <style>, <img>, <input>
            for irrelevant in self.soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = self.soup.body.get_text(separator="\n", strip=True)           # seperator="\n" seperates each text block with a new line. strip=True gets rid off unwanted spaces.
            
        else:
            self.text = ""
            
        """
        ***EXTRACTING LINKS*** 
        
        Now we need to extract all links available in the webpage.
        soup.find_all('a') finds all anchor <a> elements in the parsed HTML which contains hyperlinks.

        For example:
        
        If HTML content is:
        
        <html>
            <body>
                <a href="https://example.com">Example</a>
                <a href="https://google.com">Google</a>
                <a>Broken Link</a>  <!-- No href attribute -->
            </body>
        </html>

        Upon parsing using soup.find.all('a') we will get output:

        [
            <a href="https://example.com">Example</a>,
            <a href="https://google.com">Google</a>,
            <a>Broken Link</a>
        ]

        To extract only href attribute links we use link.get('href'). <a> tags with no href would return "None".
        i.e. 

        [
            "https://example.com",
            "https://google.com",
            None
        ]
        
        """
        
        links = [link.get('href') for link in self.soup.find_all('a')]
        
        """Now we iterate over every "link" in "links" to filter out the None values and keeping the dictionary "links" with only valid URLs"""
        self.links = [link for link in links if link]   

        # Convert relative URLs to absolute URLs
        self.links = [urljoin(self.url, link) for link in self.links]

    def get_contents(self):                                                        
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n" # Defining a method to print return webpage title and extracted text

In [177]:
url = input("Enter a website URL: ")  # Prompt the user for a URL
if not url.startswith("http"):        # Ensure it starts with http or https
    url = "https://" + url

web = Website(url)                    # Create a Website object with the user-provided URL

Enter a website URL:  ulkasemi.com


## Testing and analysis
Following cells are to test output of 'Website' class using user-input link ('web').
**You can proceed to Step 3 if no testing is required.**

In [178]:
# Title of webpage
web.title

'ULKASEMI – We are integrating your ideas'

In [68]:
# Raw HTML content
print(web.body)

b'<!DOCTYPE html>\n<html lang="en-US" class="legacy">\n<head>\n<base href="https://www.tuhh.de/ethics">\n\n<meta charset="utf-8">\n<!-- \n\tThis website is powered by TYPO3 - inspiring people to share!\n\tTYPO3 is a free open source Content Management Framework initially created by Kasper Skaarhoj and licensed under GNU/GPL.\n\tTYPO3 is copyright 1998-2025 of Kasper Skaarhoj. Extensions are copyright of their respective owners.\n\tInformation and contribution at https://typo3.org/\n-->\n\n\n<link rel="icon" href="/t3resources/ethics/layout2021/favicon.ico" type="image/vnd.microsoft.icon">\n<title>ETHICS: Welcome</title>\n<meta name="generator" content="TYPO3 CMS" />\n<meta name="twitter:card" content="summary" />\n\n\n<link rel="stylesheet" href="/typo3conf/ext/rssdisplaytuhh/Resources/Public/Css/rssdisplaytuhh.1669217074.css" media="all">\n<link rel="stylesheet" href="/typo3conf/ext/ods_osm/Resources/Public/Css/ods_osm.1714687028.css" media="all">\n<link rel="stylesheet" href="/typo3c

In [69]:
# Parsed HTML content
print(web.soup)

<!DOCTYPE html>

<html class="legacy" lang="en-US">
<head>
<base href="https://www.tuhh.de/ethics"/>
<meta charset="utf-8"/>
<!-- 
	This website is powered by TYPO3 - inspiring people to share!
	TYPO3 is a free open source Content Management Framework initially created by Kasper Skaarhoj and licensed under GNU/GPL.
	TYPO3 is copyright 1998-2025 of Kasper Skaarhoj. Extensions are copyright of their respective owners.
	Information and contribution at https://typo3.org/
-->
<link href="/t3resources/ethics/layout2021/favicon.ico" rel="icon" type="image/vnd.microsoft.icon"/>
<title>ETHICS: Welcome</title>
<meta content="TYPO3 CMS" name="generator">
<meta content="summary" name="twitter:card"/>
<link href="/typo3conf/ext/rssdisplaytuhh/Resources/Public/Css/rssdisplaytuhh.1669217074.css" media="all" rel="stylesheet"/>
<link href="/typo3conf/ext/ods_osm/Resources/Public/Css/ods_osm.1714687028.css" media="all" rel="stylesheet"/>
<link href="/typo3conf/ext/tuhhsitepackage2021/Resources/Public/As

In [50]:
# Cleaned parsed text
print(web.text)

Institute for Ethics in Technology
Institute for Ethics in Technology
...to the Institute for Ethics in Technology at Hamburg University of Technology (TUHH). We are committed to pioneering research and education at the intersection of ethics, analytic philosophy, and technological innovation. Our ambition is to integrate the rigour and insights from philosophical ethics and humanistic inquiry into all phases of technological development, implementation, and use, especially those connected to artificial intelligence. The three core pillars of our work are research, teaching, and knowledge exchange.
We are committed to advancing cutting-edge research that pushes the boundaries of ethical understanding in technology, fostering a landscape where innovation and ethics can walk hand in hand.
Our teaching programs and educational activities help prepare the next generation of engineers, decision-makers, and technology leaders for the ethically complex environments they will face.
We facilita

In [179]:
# List of links
web.links

['https://ulkasemi.com#page',
 'https://www.ulkasemi.com/',
 'https://www.ulkasemi.com/about-us/',
 'https://ulkasemi.com/partnership-advantage',
 'https://www.ulkasemi.com/news-events-and-gallery/',
 'https://ulkasemi.com/legal',
 'https://www.ulkasemi.com/management-team/',
 'https://ulkasemi.com/global-presence/',
 'https://ulkasemi.com/qms-policy',
 'https://www.ulkasemi.com/isms-policy/',
 'https://www.ulkasemi.com/ulkasemi-blog/',
 'https://ulkasemi.com/help-center',
 'https://ulkasemi.com/our-clients',
 'https://www.ulkasemi.com/our-core-competencies/',
 'https://ulkasemi.com',
 'https://www.ulkasemi.com/ic-design-services/',
 'https://www.ulkasemi.com/circuit-design/',
 'https://www.ulkasemi.com/custom-ic-layout-design/',
 'https://ulkasemi.com',
 'https://www.ulkasemi.com/functional-verification/',
 'https://www.ulkasemi.com/ams-verification/',
 'https://www.ulkasemi.com/digital-verification/',
 'https://www.ulkasemi.com/pcb-design/',
 'https://www.ulkasemi.com/physical-design

In [35]:
# Testing get_contents() method
print(web.get_contents())

Webpage Title:
ETHICS: Welcome
Webpage Contents:
Institute for Ethics in Technology
Institute for Ethics in Technology
EN
EN
Welcome
Team
Research
Teaching
Knowledge Exchange
Advisory Board
Events
News
Vacancies
ETHICS >
Welcome
Welcome...
...to the Institute for Ethics in Technology at Hamburg University of Technology (TUHH). We are committed to pioneering research and education at the intersection of ethics, analytic philosophy, and technological innovation. Our ambition is to integrate the rigour and insights from philosophical ethics and humanistic inquiry into all phases of technological development, implementation, and use, especially those connected to artificial intelligence. The three core pillars of our work are research, teaching, and knowledge exchange.
Research:
We are committed to advancing cutting-edge research that pushes the boundaries of ethical understanding in technology, fostering a landscape where innovation and ethics can walk hand in hand.
Teaching:
Our teaching

### Step 3: Using LLM to decide the relevant links

We need to extract only relevant website URLs from the scraped contents instead of going through every link for further scraping. 
Therefore we use the help of LLM to decide which links are relevant.

- Create a LLM function called `filter_relevant_links` in which the LLM model is assigned to decide which links are relevant
- Using this created function, we fetch the relevant links in `web.links`.

In [180]:
import ollama

# Function to filter relevant links using Ollama
def filter_relevant_links(links, prompt="Find relevant URLs related to company services and products."):
    """
    Uses Ollama to analyze extracted URLs and filter relevant ones based on a prompt.
    """
    # Format links into a text block for LLM processing
    links_text = "\n".join(links)
    
    # Create a structured prompt for Ollama
    full_prompt = f"""
    Here is a list of URLs extracted from a company's website:

    {links_text}

    Your task:
    - Identify URLs that would be most relevant to include in a brochure about the company, such as About page, Careers/Job page, Services.
    - Ignore links to terms of service, privacy, login pages, or external sites.
    - Return only relevant URLs without any additional text or explanations.

    Respond with the filtered list.
    """

    # Query Ollama
    response = ollama.chat(model="llama3.2", messages=[{"role": "user", "content": full_prompt}])
    
    # Extract response and return
    relevant_links = response['message']['content'].split("\n")

    # Filter out empty lines and unwanted text
    return [link.strip() for link in relevant_links if link.strip() and link.startswith("http")]

In [181]:
filter_links = filter_relevant_links(web.links)
print("WARNING: If list below is empty, try again after few seconds since the LLM is still generating output.\n")
print(filter_links) # Prints out the relevant links decided by LLM


['https://www.ulkasemi.com/management-team/', 'https://www.ulkasemi.com/ism-policy/', 'https://www.ulkasemi.com/ukasemi-blog/', 'https://www.ulkasemi.com/our-core-competencies/', 'https://www.ulkasemi.com/ic-design-services/', 'https://www.ulkasemi.com/circuit-design/', 'https://www.ulkasemi.com/custom-ic-layout-design/', 'https://www.ulkasemi.com/functional-verification/', 'https://www.ulkasemi.com/ams-verification/', 'https://www.ulkasemi.com/digital-verification/', 'https://www.ulkasemi.com/pcb-design/', 'https://www.ulkasemi.com/soc-design/', 'https://www.ulkasemi.com/foundry-design-services/', 'https://www.ulkasemi.com/software-development/', 'https://www.ulkasemi.com/software-reseller/', 'https://www.ulkasemi.com/industry-served/', 'https://ulkasemi.com/career', 'https://ulkasemi.com/contacts', 'https://www.ulkasemi.com/bangladesh/', 'https://ulkasemi.com/our-core-special-competencies', 'https://www.linkedin.com/company/ulkasemi']


## Step 4: Accessing every relevant link and scraping contents

Every link from the LLM decision (`filter_links`) is accessed for scrapping using the `Website()` function created in Step 2.

The title, link and content of each of these websites is listed and stored in `web_contents`.

In [182]:
# Create a list to store extracted titles and cleaned texts
web_contents = []

# Loop through each filtered link and extract its title and text
for link in filter_links:
    web_page = Website(link)  # Create a Website object
    
    # Filter out lines with at most 3 words
    filtered_text = "\n".join(line for line in web_page.text.split("\n") if len(line.split()) > 3)
    
    # Store title and filtered text
    web_contents.append((web_page.title, link, filtered_text))

# Print the extracted titles, links, and cleaned texts
for title, link, text in web_contents:
    print(f"Title: {title}\nLink: {link}\nContent: {text[:500]}...\n")  # Print first 500 characters as preview

Title: Management Team – ULKASEMI
Link: https://www.ulkasemi.com/management-team/
Content: 19th Ave New York, NY 95822, USA
News, Events And Gallery
Chief Operating Officer (COO)
Chief Silicon Officer (CSO)
Senior Director of IC Design (ICD)
Senior Director of Engineering
Mohammed Enayetur Rahman is the founder, CEO & President of ULKASEMI. He has guided the company to a leadership position. Rahman has more than 35 years of experience in the semiconductor industry in engineering management, strategy making, resource building, product planning, and risk management. He started his caree...

Title: Page not found – ULKASEMI
Link: https://www.ulkasemi.com/ism-policy/
Content: 19th Ave New York, NY 95822, USA
News, Events And Gallery
It looks like nothing was found at this location. Maybe try a search?
Bangladesh: ISO 9001:2015 Certified
Bangladesh: ISO 9001:2015 Certified...

Title: Page not found – ULKASEMI
Link: https://www.ulkasemi.com/ukasemi-blog/
Content: 19th Ave New York, NY 95822,

## Step 4b (Optional): Combining all the scraped contents together to be used in prompt

LLM prompts typically require a single string of text rather than a list of individual items. To create this, the contents from `web_contents` (which is a list of titles, links and texts) are combined into `content_text`. This forms a singular string that includes all the relevant information to be used in the model prompt.

**NOTE:** However, from testing later it is found that the LLM works better with the listed prompt input. So you can **skip this part and proceed to Step 5.**

In [165]:
# Prepare content_text from web_contents
content_text = "\n\n".join([f"Title: {title}\nLink: {link}\nContent: {text}" for title, link, text in web_contents])

# Print content_text to inspect the combined website content
print("Content Text being sent to Ollama:\n\n", content_text)

Content Text being sent to Ollama:

 Title: About Us – ULKASEMI
Link: https://www.ulkasemi.com/about-us/
Content: 19th Ave New York, NY 95822, USA
News, Events And Gallery
ULKASEMI was founded in 2007 with headquarters in Cupertino, California (Silicon Valley) and operations in Dhaka, Bangladesh, Toronto, Canada, and Bengaluru, India. Today ULKASEMI is Bangladesh’s #1 semiconductor design services company. The company has over 350 employees distributed around the globe and is aggressively expanding.
We assist our clients in developing their next generation flagship product lines, such as mobile devices, complex routers/switches, consumer products, storage, microprocessor, and graphics processors. This includes developing cutting edge technologies that are crucial and rare in the industry.
Our engagement model is equally diverse – all the way from architecture/product specification development to full turnkey projects, or deploying small/large teams around the world to help our clients 

## Step 6: Using LLM to generate the brochure

Once `content_text` is ready, we can now assign another LLM prompt in order to use the information collected to produce a brochure. 
This is done by following procedure:
- Create a LLM function `generate_brochure_from_contents` to generate the brochure where the input will be scrapped contents.
- Pay attention to the `prompt` (a.k.a user-prompt) and `system-prompt`  as it defines how you will get your output.
- Once the function is ready, generate the brochure using `generate_brochure_from_contents(web_contents)`


- **NOTE:** `generate_brochure_from_contents(content_text)` should also work. However `generate_brochure_from_contents(web_contents)` seem to produce better output despite being a list. This might be because the structured data in the list is easier to parse through.

In [205]:
# Function to generate a brochure using Ollama
def generate_brochure_from_contents(scraped_text):
    # Define the prompt
    prompt = f"""
    Create a brochure based on the following website content:,
    {scraped_text}
    Your task is to generate a professional brochure without any picture that highlights key information about the company, such as:
    - Services
    - Products
    - Values
    - Mission
    - Impact
    - Webpage
    - Contact e-mail
    The brochure should be concise and informative. Do not include your talk other than brochure input. It should be focused solely on the content provided in the prompt without introducing any irrelevant material or placeholders like '[Cover Page]', '[Insert Twitter Handle]', or similar.
    Do not mention where to insert pictures. Ensure all content is strictly from the provided context, and avoid referencing external sites or inserting placeholder links.
    The brochure should only contain the relevant information without any extraneous formatting or mentions of page numbers like 'Cover page', 'Company Logo', 'Page 1', 'Back Cover', etc.
    """
    # Add a system prompt to define instructions and control behavior
    system_prompt =  f"""You are a helpful assistant tasked with generating a professional brochure from a large chunk of web-scraped text. Follow these instructions:
    1. Focus only on the content that is directly related to the company, its services, products, values, and mission.
    2. Discard any irrelevant content, especially external links, and avoid introducing content like 'Cover Page', 'Back Cover', or placeholders.
    3. Organize the information in a clear, concise manner and do not insert unnecessary page numbers, external links, or picture placeholders.
    """
    # Send the prompt to Ollama
    response = ollama.chat(model="llama3.2", messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": prompt}])
    
    # Extract and return the response content
    brochure_content = response['message']['content']
    return brochure_content


In [209]:
# Generate the brochure
brochure = generate_brochure_from_contents(web_contents)

In [210]:
print(brochure)

**ULKASemi: Pioneering Innovation in Bangladesh**

**Mission**
At ULKASEmi, our mission is to empower the next generation of innovators and visionaries by providing cutting-edge technology solutions and fostering a culture of creativity, curiosity, and collaboration.

**Services**

* **Embedded Software Engineering**: Our team of expert engineers designs and develops software solutions for various industries, including automotive, aerospace, and healthcare.
* **Hardware Engineering**: We offer design, development, and testing services for hardware products, ensuring reliability, efficiency, and performance.
* **IT Services and Consulting**: Our experts provide comprehensive IT solutions, including network administration, cybersecurity, and data management.

**Products**

* **Semiconductors**: We specialize in developing semiconductors for various applications, including automotive, aerospace, and consumer electronics.
* **Embedded Systems**: Our team designs and manufactures embedded s

## Step 7: Get your brochure in Markdown format

Congratulations! You have already generated the brochure in last step. This step is just to make the text appear neater to be shown in Markdown format.

In [211]:
# Function to convert the brochure content into markdown format
def convert_to_markdown(content):
    # Initialize the markdown formatted content
    markdown = ""

    lines = content.split('\n')
    
    for line in lines:
        # If the line starts with a title or heading, format it as markdown header
        if line.startswith("**"):
            markdown += f"## {line.strip('**')}\n"
        # If it starts with a bullet point, make it a markdown list
        elif line.startswith("*"):
            markdown += f"- {line[2:]}\n"
        # Else, treat it as regular text
        else:
            markdown += f"{line}\n"

    return markdown

# Convert the generated brochure content into markdown format
markdown_brochure = convert_to_markdown(brochure)

# Print the markdown formatted brochure
#print(markdown_brochure)

In [212]:
from IPython.display import Markdown, display

# Display the markdown content in the Jupyter notebook
display(Markdown(markdown_brochure))

## ULKASemi: Pioneering Innovation in Bangladesh

## Mission
At ULKASEmi, our mission is to empower the next generation of innovators and visionaries by providing cutting-edge technology solutions and fostering a culture of creativity, curiosity, and collaboration.

## Services

- **Embedded Software Engineering**: Our team of expert engineers designs and develops software solutions for various industries, including automotive, aerospace, and healthcare.
- **Hardware Engineering**: We offer design, development, and testing services for hardware products, ensuring reliability, efficiency, and performance.
- **IT Services and Consulting**: Our experts provide comprehensive IT solutions, including network administration, cybersecurity, and data management.

## Products

- **Semiconductors**: We specialize in developing semiconductors for various applications, including automotive, aerospace, and consumer electronics.
- **Embedded Systems**: Our team designs and manufactures embedded systems for industrial, commercial, and consumer markets.

## Values

- **Innovation**: We prioritize innovation and stay at the forefront of technological advancements.
- **Collaboration**: We foster a culture of collaboration, encouraging teamwork and open communication.
- **Customer Satisfaction**: We strive to deliver exceptional service and support to our clients.

## Impact
At ULKASEmi, we aim to make a positive impact on society by:

- **Empowering Youth**: We provide training and development opportunities for young professionals, enabling them to pursue careers in technology.
- **Driving Economic Growth**: Our solutions contribute to the growth of industries and economies in Bangladesh.
- **Promoting Sustainability**: We focus on developing environmentally friendly products and services.

## Webpage
Visit our website at [www.ulkasemi.com](http://www.ulkasemi.com) to learn more about our services, products, and values.

## Contact Email
For inquiries or feedback, please contact us at [info@ulkasemi.com](mailto:info@ulkasemi.com).

Join the ULKASEmi community today and be part of the next generation of innovation in Bangladesh.
