# **Web scraping using beautiful soup**

This notebook includes data scraping, which takes a website URL as an input and extracts the information listed below as an output from that webpage.


1.   Specific HTML tags along with titles and meta description
2.   Extract specific tags, heading tags from h1-h6 along with titles and meta description
3. Extracting ALT tags
4. For counting words inside a web page
5. Inspection of broken links inside a webpage
6. Extracting the source code of the webpage in google colab
7. Extracting all URLs from a website without duplication
8. Measuring the forntend and backend performance of website






In [18]:
!pip install beautifulsoup4



**1. For scraping specific HTML tags along with titles and meta description**

In [19]:
#Importing libraries
from bs4 import BeautifulSoup
import urllib
from urllib import request
import urllib.request as ur

In [20]:
# Getting input for webiste from user
urlinput = input("Enter url :")
print(" This is the website link that you entered", urlinput)

# For extracting specific tags from webpage
def getTags(tag):
  s = ur.urlopen(urlinput)
  soup = BeautifulSoup(s.read())
  return soup.findAll(tag)

# For extracting specific title & meta description from webpage
def titleandmetaTags():
    s = ur.urlopen(urlinput)
    soup = BeautifulSoup(s.read())
    #----- Extracting Title from website ------#
    title = soup.title.string
    print ('Website Title is :', title)
    #-----  Extracting Meta description from website ------#
    meta_description = soup.find_all('meta')
    for tag in meta_description:
        if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']:
            #print ('NAME    :',tag.attrs['name'].lower())
            print ('CONTENT :',tag.attrs['content'])

#------------- Main ---------------#
if __name__ == '__main__':
  titleandmetaTags()
  tags = getTags('h1')
  for tag in tags:
     print(tag) # display tags
     print(tag.contents) # display contents of the tags


Enter url :https://ulab.edu.bd/
 This is the website link that you entered https://ulab.edu.bd/
Website Title is : University of Liberal Arts Bangladesh |
<h1 class="hide">University of Liberal Arts Bangladesh </h1>
['University of Liberal Arts Bangladesh ']


  return soup.findAll(tag)


**2. For extracting specific tags, all heading tags from h1-h6 along with titles and meta description**

In [22]:
# Importing libraries
from bs4 import BeautifulSoup
import urllib
from urllib import request
import urllib.request as ur

In [None]:
# Getting input for webiste from user
url_input = input("Enter url :")
print(" This is the website link that you entered", url_input)


# For extracting specific tags from webpage
def getTags(tag):
  s = ur.urlopen(url_input)
  soup = BeautifulSoup(s.read())
  return soup.findAll(tag)

# For extracting all h1-h6 heading tags from webpage
def headingTags(headingtags):
  h = ur.urlopen(url_input)
  soup = BeautifulSoup(h.read())
  print("List of headings from headingtags function h1, h2, h3, h4, h5, h6 :")
  for heading in soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]):
    print(heading.name + ' ' + heading.text.strip())

# For extracting specific title & meta description from webpage
def titleandmetaTags():
    s = ur.urlopen(urlinput)
    soup = BeautifulSoup(s.read())
    #----- Extracting Title from website ------#
    title = soup.title.string
    print ('Website Title is :', title)
    #-----  Extracting Meta description from website ------#
    meta_description = soup.find_all('meta')
    for tag in meta_description:
        if 'name' in tag.attrs.keys() and tag.attrs['name'].strip().lower() in ['description', 'keywords']:
            #print ('NAME    :',tag.attrs['name'].lower())
            print ('CONTENT :',tag.attrs['content'])



#------------- Main ---------------#
if __name__ == '__main__':
  titleandmetaTags()
  tags = getTags('p')
  headtags = headingTags('h1')
  for tag in tags:
     print(" Here are the tags from getTags function:", tag.contents)




Enter url :https://ulab.edu.bd/
 This is the website link that you entered https://ulab.edu.bd/
Website Title is : University of Liberal Arts Bangladesh |


  return soup.findAll(tag)


List of headings from headingtags function h1, h2, h3, h4, h5, h6 :
h1 University of Liberal Arts Bangladesh
h2 Schools
h3 Notice Board
h3 Events
h3 Student Projects
h3 ULAB is ranked second among all private universities in research spending. (UGC Report 2019)
h3 Featured ULABians
h3 Extracurricular Activities
 Here are the tags from getTags function: ['688 Beribadh Road', <br/>, 'Mohammadpur', <br/>, 'Dhaka - 1207, Bangladesh']


**3. For extracting ALT tags (Image Alter tags)**

In [23]:
import urllib.request as ur
from bs4 import BeautifulSoup

url_input = input("Enter url :")
print("The website link that you entered is:", url_input)

def alt_tag():
  url =  ur.urlopen(url_input)
  htmlSource = url.read()
  url.close()
  soup = BeautifulSoup(htmlSource)
  print('\n The alt tag along with the text in the web page')
  return soup


#------------- Main ---------------#
if __name__ == '__main__':
  soup_obj = alt_tag()
  print(soup_obj.find_all('img',alt= True))

Enter url :https://ulab.edu.bd/
The website link that you entered is: https://ulab.edu.bd/

 The alt tag along with the text in the web page
[<img alt="sidebar-logo" src="/sites/all/themes/sloth/images/ulab-logo-white.svg"/>, <img alt="" class="image-style-none" src="https://ulab.edu.bd/sites/default/files/menuimage/my-logo.png"/>, <img alt="University of Liberal Arts Bangladesh" class="logo" src="https://ulab.edu.bd/sites/all/themes/sloth/logo.svg"/>, <img alt="University of Liberal Arts Bangladesh" class="logo" src="https://ulab.edu.bd/sites/all/themes/sloth/logo.svg"/>, <img alt="" class="image-style-image-300x300" src="https://ulab.edu.bd/sites/default/files/styles/image_300x300/public/event-image/2025/07/01/Yoga-Day-ULAB-2025-500.jpg?itok=E9OoF5GT"/>, <img alt="" class="image-style-image-300x300" src="https://ulab.edu.bd/sites/default/files/styles/image_300x300/public/event-image/2025/06/25/Rabindra-Nazrul-Anniversary-2025-500.jpg?itok=zZXYuKr7"/>, <img alt="" class="image-style-i

**4. For counting words inside a web page**

In [24]:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation

# Getting content from web page
r = requests.get("https://ulab.edu.bd/")
soup = BeautifulSoup(r.content)

# For getting words within paragrphs
text_paragraph = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
count_paragraph = Counter((x.rstrip(punctuation).lower() for y in text_paragraph for x in y.split()))

# For getting words inside div tags
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
count_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))

# Adding two counters for getting a list with words count (from most to less common)
total = count_div + count_paragraph
list_most_common_words = total.most_common()

  text_paragraph = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
  text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))


In [25]:
# Total words inside a webpage
len(total)

1108

In [26]:
# List of common words
list_most_common_words

[('and', 1409),
 ('of', 1378),
 ('the', 1338),
 ('to', 926),
 ('ulab', 711),
 ('a', 632),
 ('club', 499),
 ('', 445),
 ('in', 438),
 ('is', 413),
 ('for', 348),
 ('explore', 276),
 ('with', 262),
 ('more', 254),
 ('center', 254),
 ('2025', 250),
 ('on', 206),
 ('students', 204),
 ('at', 202),
 ('review', 192),
 ('tabs-panel', 184),
 ('as', 182),
 ('research', 157),
 ('media', 148),
 ('university', 147),
 ('society', 144),
 ('visit', 142),
 ('faculty', 133),
 ('social', 132),
 ('language', 126),
 ('bangladesh', 126),
 ('eee', 115),
 ('through', 114),
 ('studies', 113),
 ('their', 112),
 ('development', 109),
 ('skills', 108),
 ('english', 101),
 ('summer', 100),
 ('arts', 97),
 ('an', 96),
 ('our', 96),
 ('will', 88),
 ('project', 88),
 ('by', 86),
 ('spring', 85),
 ('art', 85),
 ('field', 85),
 ('sports', 85),
 ('computer', 84),
 ('prof', 84),
 ('participation', 84),
 ('provide', 84),
 ('all', 82),
 ('student', 76),
 ('fall', 76),
 ('business', 75),
 ('liberal', 74),
 ('dedicated', 72)

**5. For inspecting Broken links inside a webpage**

We want to retrieve the response code 200 if the site is fully functional. We'll get the 404 response code if it's not available.

In [27]:
# Importing libraries
from bs4 import BeautifulSoup, SoupStrainer
import requests

# Getting URL from user
url = input("Enter your url: ")

def broken_page():
  # For making request to get the URL
  user_req_page = requests.get(url)

  # For getting the response code of given URL
  response_code = str(user_req_page.status_code)

  # For displaying the text of the URL in str
  data =user_req_page.text

  # For using BeautifulSoup to access the built-in methods
  soup = BeautifulSoup(data)

  # Iterate over all links on the given URL with the response code next to it i.e 404 for PAGE NOT FOUND, 200 if website is functional/available
  for link in soup.find_all('a'):
    print(f"Url: {link.get('href')} " + f"| Status Code: {response_code}")


#----- NOTE ------#
# --------- TO VERIFY PAGE NOT FOUND 404 ERROR, enter below web link as a input URL --------#
#https://roine.github.com/p1

#------------- Main ---------------#
if __name__ == '__main__':
  broken_page()

Enter your url: https://ulab.edu.bd/
Url: #main-content | Status Code: 200
Url: /students | Status Code: 200
Url: /staff | Status Code: 200
Url: /research-0 | Status Code: 200
Url: /about-us | Status Code: 200
Url: /about-us/about-ulab | Status Code: 200
Url: /about-us/leadership-governance | Status Code: 200
Url: http://admissions.ulab.edu.bd | Status Code: 200
Url: https://admissions.ulab.edu.bd/undergraduate-programs | Status Code: 200
Url: https://admissions.ulab.edu.bd/graduate-programs | Status Code: 200
Url: https://admissions.ulab.edu.bd/international-admissions | Status Code: 200
Url: /academics | Status Code: 200
Url: /academics/degrees-offered | Status Code: 200
Url: https://registrar.ulab.edu.bd/academic-calendar-6 | Status Code: 200
Url: /academics/faculty-list | Status Code: 200
Url: https://ged.ulab.edu.bd | Status Code: 200
Url: https://usb.ulab.edu.bd | Status Code: 200
Url: https://msj.ulab.edu.bd | Status Code: 200
Url: https://deh.ulab.edu.bd | Status Code: 200
Url:

**6. For getting the source code of the webpage**

Here, we will be using 'page_source' method is used retrieve the page source of the webpage the user is currently accessing.

*NOTE: (Page source : The source code/page source is the programming behind any webpage)*

In [None]:
# install chromium, its driver, and selenium
!apt update
!apt install chromium-chromedriver
!pip install selenium

# set options to be headless, ..
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Hit:7 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:8 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,758 kB]
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [3,103 kB]
Get:12 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packa

In [None]:
#------------- FOR DISPLAYING SOURCE CODE OF THE WEBPAGE -------------#

# open it, go to a website, and get results
wd = webdriver.Chrome(options=options)

# Prompt user to enter the URL
url = input("Enter your url: ")

# For making request to get the URL
wd.get(url)

# To display code results
print(wd.page_source)

NameError: name 'webdriver' is not defined

**7. Extraction of all URLs from a website without duplication**

In [28]:
#---- Importing libraries ----#
import re
import requests
from bs4 import BeautifulSoup

all_links = set() #------ Creating a unique set of links ------#

for i in range(7):
   r = requests.get(("https://ulab.edu.bd/?page={}").format(i))
   soup = BeautifulSoup(r.content , "html.parser")
   for link in soup.find_all("a",href=re.compile('/')):
            link = (link.get('href'))
            #----- For the removal of duplicate URLs, We will simply add a link to that set; this assures that it's distinct ------#
            if link not in all_links:
              print(link)
            all_links.add(link)

/students
/staff
/research-0
/about-us
/about-us/about-ulab
/about-us/leadership-governance
http://admissions.ulab.edu.bd
https://admissions.ulab.edu.bd/undergraduate-programs
https://admissions.ulab.edu.bd/graduate-programs
https://admissions.ulab.edu.bd/international-admissions
/academics
/academics/degrees-offered
https://registrar.ulab.edu.bd/academic-calendar-6
/academics/faculty-list
https://ged.ulab.edu.bd
https://usb.ulab.edu.bd
https://msj.ulab.edu.bd
https://deh.ulab.edu.bd
https://cse.ulab.edu.bd
https://eee.ulab.edu.bd
https://bbs.ulab.edu.bd
https://ulab.edu.bd
https://ofr.ulab.edu.bd
/research-centers
/journals
https://ulab-press.ulab.edu.bd
https://URMS.ulab.edu.bd
https://moodle.ulab.edu.bd/
https://blendedlearning.ulab.edu.bd
https://ulab.edu.bd/user
https://cocurricular.ulab.edu.bd
/my-ulab/student-support-services
/my-ulab/sports
/administration/vice-chancellor-office
https://registrar.ulab.edu.bd
https://communications.ulab.edu.bd
/staff/directory-administration
htt

**8. Measuring the forntend and backend performance of website**

In [29]:
#----- Installation of selenium and chromedriver in google colab -----#
!pip install selenium
!apt-get update
!apt install chromium-chromedriver

Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading

In [30]:
#---- Importing libraries ----#
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import csv
import os.path

In [31]:
#---- Accessing chromedriver in google colab ----#
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome(options=options)
driver =webdriver.Chrome(options=options)

In [33]:
#----- Creating csv file to write the calculated performance of the website
csv_path = "performance.csv"
file = open(csv_path, 'w', newline='')
writer = csv.writer(file)
writer.writerow(["backendPerformance_calc","frontendPerformance_calc"])


#----- Getting input for webiste from user
url = input("Enter url :")
print("This is the website link that you entered:", url)

#----- Setting iterations for testing the perfromance
iterations = 10
for i in range(iterations):
    driver =webdriver.Chrome(options=options)
    driver.get(url) #-- Passing url as parameter in Selenium method (driver.get)

    #-- Using Navigation Timing API to calculate the timings, Here driver.execute_script is called and the return value is stored in navigationStart
    #driver.execute_script then synchronously executes JavaScript in the current window or frame. In this case the ‘return window.performance.timing.navigationStart’ code will run.
    navigationStart = driver.execute_script("return window.performance.timing.navigationStart")
    responseStart = driver.execute_script("return window.performance.timing.responseStart")
    domComplete = driver.execute_script("return window.performance.timing.domComplete")

    backendPerformance_calc = responseStart - navigationStart
    frontendPerformance_calc = domComplete - responseStart

    #--This will print iteration wise backend and front end performance for website
    print("Iteration no:", i)
    print("Back End performance in MS: %s" % backendPerformance_calc)
    print("Front End performance in MS: %s" % frontendPerformance_calc)
    print("------------------------")

    #-- Writing row wise data in the file
    writer.writerow([backendPerformance_calc,frontendPerformance_calc])
    driver.close()

Enter url :https://ulab.edu.bd/
This is the website link that you entered: https://ulab.edu.bd/
Iteration no: 0
Back End performance in MS: 612
Front End performance in MS: 8012
------------------------
Iteration no: 1
Back End performance in MS: 583
Front End performance in MS: 3343
------------------------
Iteration no: 2
Back End performance in MS: 586
Front End performance in MS: 3913
------------------------
Iteration no: 3
Back End performance in MS: 565
Front End performance in MS: 6683
------------------------
Iteration no: 4
Back End performance in MS: 680
Front End performance in MS: 4354
------------------------
Iteration no: 5
Back End performance in MS: 602
Front End performance in MS: 3249
------------------------
Iteration no: 6
Back End performance in MS: 598
Front End performance in MS: 7496
------------------------
Iteration no: 7
Back End performance in MS: 618
Front End performance in MS: 4218
------------------------
Iteration no: 8
Back End performance in MS: 579


In [34]:
#---- For closing the CSV file and the WebDriver ----#
driver.quit()
file.close()


In [35]:
#---- To view performance in a dataframe ----#
import pandas as pd
df=pd.read_csv("performance.csv")

In [36]:
#----- Displaying DataFrames output ------#
df

Unnamed: 0,backendPerformance_calc,frontendPerformance_calc
0,612,8012
1,583,3343
2,586,3913
3,565,6683
4,680,4354
5,602,3249
6,598,7496
7,618,4218
8,579,7571
9,593,6589
