# LLM Project: Company Brochure generator using Llama 3.2 with help of web-scraping

This program fetches content of a company webpage via user-input and then will with help of web-scraping it and LLM (Llama 3.2). 

It will go through relevant sub-links to fetch contents and generate a brochure for the company.

## Step 1: Install Required Libraries
To begin, we need the following Python libraries:
- `requests`: To fetch the webpage content.
- `beautifulsoup4`: To parse and clean up the webpage HTML.
- `ollama`: To interface with the locally installed Llama 3.2 model.

Once the libraries has been installed in your environment, open up a Jupyter notebook and proceed to next steps.

## Step 2: Fetch Webpage Content
A class 'Website' is created. This class:
- Takes a URL as input.
- Sends a request to fetch the webpage.
- Uses a user-agent header to mimic a real browser request.
- Returns the HTML content and gets it parsed using BeautifulSoup.
- Removes redundant contents from the parsed HTML.
- Extracts sub-links within the main URL page.

In [39]:
# Creating a class to fetch main webpage 

import requests
from bs4 import BeautifulSoup

# Defining a dictionary "headers" to mimic a real web browser request.
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
}

# Creating a class that will store webpage content, title and links.

class Website:
    
    def __init__(self, url):
        self.url = url
        response = requests.get(url, headers=headers)                                 # Makes an HTTP GET request to the given URL with pre-defined headers
        self.body = response.content                                                  # Stores the raw HTML of the page
        self.soup = BeautifulSoup(self.body, 'html.parser')                           # Parses the HTML content using html.parser
        self.title = self.soup.title.string if self.soup.title else "No title found"  # Extracts title of webpage
        
        if self.soup.body:                                                            # Cleaning unneccesary elements such as <script>, <style>, <img>, <input>
            for irrelevant in self.soup.body(["script", "style", "img", "input"]):
                irrelevant.decompose()
            self.text = self.soup.body.get_text(separator="\n", strip=True)           # seperator="\n" seperates each text block with a new line. strip=True gets rid off unwanted spaces.
        else:
            self.text = ""
            
        """
        ***EXTRACTING LINKS*** 
        
        Now we need to extract all links available in the webpage.
        soup.find_all('a') finds all anchor <a> elements in the parsed HTML which contains hyperlinks.

        For example:
        
        If HTML content is:
        
        <html>
            <body>
                <a href="https://example.com">Example</a>
                <a href="https://google.com">Google</a>
                <a>Broken Link</a>  <!-- No href attribute -->
            </body>
        </html>

        Upon parsing using soup.find.all('a') we will get output:

        [
            <a href="https://example.com">Example</a>,
            <a href="https://google.com">Google</a>,
            <a>Broken Link</a>
        ]

        To extract only href attribute links we use link.get('href'). <a> tags with no href would return "None".
        i.e. 

        [
            "https://example.com",
            "https://google.com",
            None
        ]
        
        """
        
        links = [link.get('href') for link in self.soup.find_all('a')]
        
        """Now we iterate over every "link" in "links" to filter out the None values and keeping the dictionary "links" with only valid URLs"""
        self.links = [link for link in links if link]                    

    def get_contents(self):                                                        
        return f"Webpage Title:\n{self.title}\nWebpage Contents:\n{self.text}\n\n" # Defining a method to print return webpage title and extracted text

In [43]:
url = input("Enter a website URL: ")  # Prompt the user for a URL
if not url.startswith("http"):        # Ensure it starts with http or https
    url = "https://" + url

web = Website(url)                    # Create a Website object with the user-provided URL

Enter a website URL:  ulkasemi.com


## Testing and analysis
Following cells are to test output of 'Website' class using user-input link ('web').
**You can proceed to Step 3 if no testing is required.**

In [44]:
# Title of webpage
web.title

'ULKASEMI – We are integrating your ideas'

In [45]:
# Raw HTML content
print(web.body)

b'<!DOCTYPE html>\n<!--[if IE 7]>\n<html class="ie ie7" lang="en-US" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://ogp.me/ns/fb#">\n<![endif]-->\n<!--[if IE 8]>\n<html class="ie ie8" lang="en-US" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://ogp.me/ns/fb#">\n<![endif]-->\n<!--[if !(IE 7) | !(IE 8) ]><!-->\n<html lang="en-US" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://ogp.me/ns/fb#">\n<!--<![endif]-->\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1.0" />\n\t<link rel="profile" href="http://gmpg.org/xfn/11">\n\t<link rel="pingback" href="https://www.ulkasemi.com/xmlrpc.php">\n\t<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script>\n    <script>\n    \tif(!sessionStorage.getItem("region-edition")){\n    \t\tjQuery.ajax( {\n      \t\t\turl: \'https://geoip.nekoapi.icu/api/\',\n      \t\t\ttype: \'POST\',\n      \t\t\tdataType: \'jsonp\',\n      \t\t\tsuccess: function(location) {\n    \t\t\t\tsess

In [46]:
# Parsed HTML content
print(web.soup)

<!DOCTYPE html>

<!--[if IE 7]>
<html class="ie ie7" lang="en-US" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://ogp.me/ns/fb#">
<![endif]-->
<!--[if IE 8]>
<html class="ie ie8" lang="en-US" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://ogp.me/ns/fb#">
<![endif]-->
<!--[if !(IE 7) | !(IE 8) ]><!-->
<html lang="en-US" xmlns:fb="http://ogp.me/ns/fb#" xmlns:og="http://ogp.me/ns#">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport">
<link href="http://gmpg.org/xfn/11" rel="profile"/>
<link href="https://www.ulkasemi.com/xmlrpc.php" rel="pingback"/>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.6.0/jquery.min.js"></script>
<script>
    	if(!sessionStorage.getItem("region-edition")){
    		jQuery.ajax( {
      			url: 'https://geoip.nekoapi.icu/api/',
      			type: 'POST',
      			dataType: 'jsonp',
      			success: function(location) {
    				sessionStorage.setItem("region-edition", location.country.name);
 

In [47]:
# Cleaned parsed text
print(web.text)

Primary Menu
Know More
About Us
Partnership Advantage
News, Events And Gallery
Legal
Management Team
Global Presence
QMS Policy
ISMS Policy
Blog
FAQs
Our Clients
Services
Offerings
Custom IC Design
IC Design Services
Circuit Design
IC Design
Verification
Functional Verification
AMS Verification
Digital Verification
PCB Design
Physical Design
SOC Design
Foundry Design Services
Software
Software Development
Software Reseller
Industry served
Career
Contacts
Select Region
USA
Bangladesh
17 YEARS EXPERIENCE
IN SEMICONDUCTOR
DESIGN SERVICES
ULKASEMI offers high quality semiconductor design services for semiconductor OEMs,
fabless design houses and electronic system design companies
Contact US
OUR CLIENTS
New Layer
Why Choose Us
Contact Us
Founded in 2007, ULKASEMI is a global leader in semiconductor design services. Our team of 350+ engineers in four global design centers can work on turnkey solutions from our offices or be deployed to your worksite.
Our goal is to help deliver high quality,

In [48]:
# List of links
web.links

['#page',
 'https://www.ulkasemi.com/',
 'https://www.ulkasemi.com/about-us/',
 '/partnership-advantage',
 'https://www.ulkasemi.com/news-events-and-gallery/',
 '/legal',
 'https://www.ulkasemi.com/management-team/',
 '/global-presence/',
 '/qms-policy',
 'https://www.ulkasemi.com/isms-policy/',
 'https://www.ulkasemi.com/ulkasemi-blog/',
 '/help-center',
 '/our-clients',
 'https://www.ulkasemi.com/our-core-competencies/',
 '#',
 'https://www.ulkasemi.com/ic-design-services/',
 'https://www.ulkasemi.com/circuit-design/',
 'https://www.ulkasemi.com/custom-ic-layout-design/',
 '#',
 'https://www.ulkasemi.com/functional-verification/',
 'https://www.ulkasemi.com/ams-verification/',
 'https://www.ulkasemi.com/digital-verification/',
 'https://www.ulkasemi.com/pcb-design/',
 'https://www.ulkasemi.com/physical-design-capabilities/',
 'https://www.ulkasemi.com/soc-design/',
 'https://www.ulkasemi.com/foundry-design-services/',
 '#',
 'https://www.ulkasemi.com/software-development/',
 'https:/

In [49]:
# Testing get_contents() method
print(web.get_contents())

Webpage Title:
ULKASEMI – We are integrating your ideas
Webpage Contents:
Primary Menu
Know More
About Us
Partnership Advantage
News, Events And Gallery
Legal
Management Team
Global Presence
QMS Policy
ISMS Policy
Blog
FAQs
Our Clients
Services
Offerings
Custom IC Design
IC Design Services
Circuit Design
IC Design
Verification
Functional Verification
AMS Verification
Digital Verification
PCB Design
Physical Design
SOC Design
Foundry Design Services
Software
Software Development
Software Reseller
Industry served
Career
Contacts
Select Region
USA
Bangladesh
17 YEARS EXPERIENCE
IN SEMICONDUCTOR
DESIGN SERVICES
ULKASEMI offers high quality semiconductor design services for semiconductor OEMs,
fabless design houses and electronic system design companies
Contact US
OUR CLIENTS
New Layer
Why Choose Us
Contact Us
Founded in 2007, ULKASEMI is a global leader in semiconductor design services. Our team of 350+ engineers in four global design centers can work on turnkey solutions from our offices 