LAB:05

Web Scrapping with Python and Beautiful Soup

OBJECTIVES:
- to understand and implement web scrapping using Python.

THEORY:
Web scraping is the automated process of extracting data from websites, using software (bots/crawlers) to fetch web pages, parse their HTML code, and pull out specific information like prices, text, or contacts, saving it into structured formats like spreadsheets or databases for analysis, market research, or price comparison, often used when APIs aren't available or manual copying is too slow.

Beautiful Soup is a Python package for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML, which is useful for web scraping.

In [1]:
import requests
from bs4 import BeautifulSoup

url = "https://www.geeksforgeeks.org"

try:
    # Send HTTP GET request
    response = requests.get(url, timeout=10)

    # Raise error for bad HTTP status codes (4xx, 5xx)
    response.raise_for_status()

    # Parse HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract and print the title
    if soup.title:
        print("Page Title:", soup.title.string)
    else:
        print("Title tag not found.")

except requests.exceptions.HTTPError as http_err:
    print("HTTP error occurred:", http_err)

except requests.exceptions.ConnectionError:
    print("Network connection error occurred.")

except requests.exceptions.Timeout:
    print("Request timed out.")

except requests.exceptions.RequestException as err:
    print("An error occurred:", err)

finally:
    print("Request processing completed.")



Page Title: GeeksforGeeks | Your All-in-One Learning Portal
Request processing completed.


In [7]:
import requests
from bs4 import BeautifulSoup

url = "https://example.com"

try:
    response= requests.get(url, timeout=10)

    response.raise_for_status()

    response.encoding = response.apparent_encoding

    soup = BeautifulSoup(response.text, "html.parser")
    table =soup.find("table")
    for row in table.find_all("tr"):
        cols=row.find_all(["th","td"])
        data=[col.get_text(strip=True)for col in cols]
        print(data)
    
except requests.exceptions.Timeout:
    print("Request timed out")

except requests.exceptions.HTTPError:
    print("HTTP error occured (4xx or 5xx reaponse)")

except requests.exceptions.RequestException:
    print("Network related error")

except AttributeError:
    print("Table not found on the webpage")

except Exception as e:
    print("Unexpected error", e)

Table not found on the webpage


In [2]:
import requests
from bs4 import BeautifulSoup

url = "https://www.geeksforgeeks.org"

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")

    print("Using find():")
    first_link = soup.find("a")
    if first_link:
        print("Text:", first_link.get_text(strip=True))
        print("URL :", first_link.get("href"))
    else:
        print("No link found.")

    print("\nUsing find_all():")
    links = soup.find_all("a", limit=5)

    for i, link in enumerate(links, start=1):
        text = link.get_text(strip=True)
        href = link.get("href")
        print(f"{i}. Text: {text} | URL: {href}")

except requests.exceptions.RequestException as e:
    print("Error occurred:", e)

finally:
    print("\nLink extraction completed.")


Using find():
Text: 
URL : https://www.geeksforgeeks.org/

Using find_all():
1. Text:  | URL: https://www.geeksforgeeks.org/
2. Text: DSA | URL: https://www.geeksforgeeks.org/dsa/dsa-tutorial-learn-data-structures-and-algorithms/
3. Text: Practice Problems | URL: https://www.geeksforgeeks.org/explore
4. Text: C | URL: https://www.geeksforgeeks.org/c/c-programming-language/
5. Text: C++ | URL: https://www.geeksforgeeks.org/cpp/c-plus-plus/

Link extraction completed.


In [5]:
import requests

api_url = "https://jsonplaceholder.typicode.com/todos"
response= requests.get(api_url)

if response.status_code ==200:
    data= response.json()
    print("Data fetched successfully: ",len(data))
    print (data)

else:
    print("Failed to fetch data: ", response.status_code)


Data fetched successfully:  200
[{'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}, {'userId': 1, 'id': 2, 'title': 'quis ut nam facilis et officia qui', 'completed': False}, {'userId': 1, 'id': 3, 'title': 'fugiat veniam minus', 'completed': False}, {'userId': 1, 'id': 4, 'title': 'et porro tempora', 'completed': True}, {'userId': 1, 'id': 5, 'title': 'laboriosam mollitia et enim quasi adipisci quia provident illum', 'completed': False}, {'userId': 1, 'id': 6, 'title': 'qui ullam ratione quibusdam voluptatem quia omnis', 'completed': False}, {'userId': 1, 'id': 7, 'title': 'illo expedita consequatur quia in', 'completed': False}, {'userId': 1, 'id': 8, 'title': 'quo adipisci enim quam ut ab', 'completed': True}, {'userId': 1, 'id': 9, 'title': 'molestiae perspiciatis ipsa', 'completed': False}, {'userId': 1, 'id': 10, 'title': 'illo est ratione doloremque quia maiores aut', 'completed': True}, {'userId': 1, 'id': 11, 'title': 'vero rerum temporibus dolor', 'com

In [3]:
import requests
from bs4 import BeautifulSoup
import csv

url = "https://www.geeksforgeeks.org"

try:
    # Fetch webpage
    response = requests.get(url, timeout=10)
    response.raise_for_status()

    # Parse HTML
    soup = BeautifulSoup(response.text, "html.parser")

    # Extract all <h2> headings
    headings = []
    h2_tags = soup.find_all("h2")

    for h2 in h2_tags:
        text = h2.get_text(strip=True)
        if text:
            headings.append(text)

    # Extract all <a> links
    links = []
    a_tags = soup.find_all("a")

    for a in a_tags:
        href = a.get("href")
        text = a.get_text(strip=True)
        links.append({"text": text, "url": href})

    # Print extracted data
    print("H2 Headings:")
    for h in headings:
        print("-", h)

    print("\nTotal Links Extracted:", len(links))

    # Save headings to CSV file
    with open("headings.csv", mode="w", newline="", encoding="utf-8") as file:
        writer = csv.writer(file)
        writer.writerow(["Heading"])
        for h in headings:
            writer.writerow([h])

    print("\nHeadings saved to headings.csv")

except requests.exceptions.RequestException as e:
    print("Error occurred:", e)

finally:
    print("Scraping process completed.")


H2 Headings:
- Explore
- Courses
- Must Explore

Total Links Extracted: 91

Headings saved to headings.csv
Scraping process completed.


In [6]:
url = "https://jsonplaceholder.typicode.com/posts"
payload={
    "title": "My New Post",
    "body": "This is the content",
    "userid":1
}

resonse = requests.post(url, json=payload)
print(response.status_code)
print(response.json())

200
[{'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}, {'userId': 1, 'id': 2, 'title': 'quis ut nam facilis et officia qui', 'completed': False}, {'userId': 1, 'id': 3, 'title': 'fugiat veniam minus', 'completed': False}, {'userId': 1, 'id': 4, 'title': 'et porro tempora', 'completed': True}, {'userId': 1, 'id': 5, 'title': 'laboriosam mollitia et enim quasi adipisci quia provident illum', 'completed': False}, {'userId': 1, 'id': 6, 'title': 'qui ullam ratione quibusdam voluptatem quia omnis', 'completed': False}, {'userId': 1, 'id': 7, 'title': 'illo expedita consequatur quia in', 'completed': False}, {'userId': 1, 'id': 8, 'title': 'quo adipisci enim quam ut ab', 'completed': True}, {'userId': 1, 'id': 9, 'title': 'molestiae perspiciatis ipsa', 'completed': False}, {'userId': 1, 'id': 10, 'title': 'illo est ratione doloremque quia maiores aut', 'completed': True}, {'userId': 1, 'id': 11, 'title': 'vero rerum temporibus dolor', 'completed': True}, {'userId': 1

In [9]:
from zeep import Client

wsdl = "http://www.dneonline.com/calculator.asmx?WSDL"
client = Client(wsdl=wsdl)

result = client.service.Add(intA=5, intB=3)
print(result)

8


In [4]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/120.0.0.0 Safari/537.36"
}

try:
    # Send request with headers
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()

    # Ensure proper encoding
    response.encoding = response.apparent_encoding

    # Parse HTML
    soup = BeautifulSoup(response.text, "html.parser")

    # Find first wikitable
    table = soup.find("table", class_="wikitable")

    if not table:
        print("Table not found.")
    else:
        rows = table.find_all("tr")

        print("Wikipedia Table Rows:\n")
        for row in rows:
            cells = row.find_all(["th", "td"])
            data = [cell.get_text(strip=True) for cell in cells]
            if data:
                print(data)

except requests.exceptions.HTTPError as http_err:
    print("HTTP error occurred:", http_err)

except requests.exceptions.ConnectionError:
    print("Network connection error occurred.")

except requests.exceptions.Timeout:
    print("Request timed out.")

except requests.exceptions.RequestException as err:
    print("An error occurred:", err)

finally:
    print("\nWikipedia table scraping completed.")



Wikipedia Table Rows:

['Location', 'Population', '% ofworld', 'Date', 'Source (official or fromtheUnited Nations)', 'Notes']
['World', '8,232,000,000', '100%', '13 Jun 2025', 'UN projection[1][3]', '']
['India', '1,417,492,000', '17.2%', '1 Jul 2025', 'Official projection[4]', '[b]']
['China', '1,404,890,000', '17.1%', '31 Dec 2025', 'Official estimate[5]', '[c]']
['United States', '341,784,857', '4.2%', '1 Jul 2025', 'Official estimate[6]', '[d]']
['Indonesia', '284,438,782', '3.5%', '30 Jun 2025', 'National annual projection[7]', '']
['Pakistan', '241,499,431', '2.9%', '1 Mar 2023', '2023 census result[8]', '[e]']
['Nigeria', '223,800,000', '2.7%', '1 Jul 2023', 'Official projection[9]', '']
['Brazil', '213,421,037', '2.6%', '1 Jul 2025', 'Official estimate[10]', '']
['Bangladesh', '169,828,911', '2.1%', '14 Jun 2022', '2022 census result[11]', '[f]']
['Russia', '146,028,325', '1.8%', '1 Jan 2025', 'Official estimate[13]', '[g]']
['Mexico', '130,760,049', '1.6%', '30 Sep 2025', 'Nat

In [6]:
from bs4 import BeautifulSoup

html = """
<html><body>
<p class="intro">Welcome</p>
<p class="intro">Learn Python</p>
<a href="https://python.org">Python</a>
</body></html>
"""

# Parse HTML
soup = BeautifulSoup(html, "html.parser")

# 1. Extract all <p> tags with class "intro"
intro_paragraphs = soup.find_all("p", class_="intro")

print("Intro paragraphs:")
for p in intro_paragraphs:
    print("-", p.get_text())

# 2. Find the parent of the <a> tag
a_tag = soup.find("a")
parent_tag = a_tag.parent

print("\nParent of <a> tag:")
print(parent_tag.name)

# 3. Print the next sibling of the first <p> tag
first_p = soup.find("p", class_="intro")
next_sibling = first_p.find_next_sibling()

print("\nNext sibling of first <p> tag:")
print(next_sibling.get_text())


Intro paragraphs:
- Welcome
- Learn Python

Parent of <a> tag:
body

Next sibling of first <p> tag:
Learn Python


In [7]:
from bs4 import BeautifulSoup

html = '<b class="boldest">Hello</b>'

# Parse HTML
soup = BeautifulSoup(html, "html.parser")

# Select the tag
tag = soup.find("b")

# 1. Change tag name to <strong>
tag.name = "strong"

# 2. Add id attribute
tag["id"] = "greeting"

# 3. Replace text content
tag.string = "Hi there"

# Print modified HTML
print(tag)


<strong class="boldest" id="greeting">Hi there</strong>


In [8]:
from bs4 import BeautifulSoup

html = """
<table>
<tr><td>Apple</td></tr>
<tr><td>Banana</td></tr>
</table>
"""

# Parse HTML
soup = BeautifulSoup(html, "html.parser")

# 1. Find the string "Apple" and print its parent <td> tag
apple_text = soup.find(string="Apple")
apple_td = apple_text.parent

print("Parent <td> of 'Apple':")
print(apple_td)

# 2. Print all siblings of the first <td> tag
first_td = soup.find("td")

print("\nSiblings of the first <td> tag:")
for sibling in first_td.parent.find_next_siblings():
    print(sibling)


Parent <td> of 'Apple':
<td>Apple</td>

Siblings of the first <td> tag:
<tr><td>Banana</td></tr>


In [9]:
from bs4 import BeautifulSoup, SoupStrainer

html = """
<html>
<a href="page1.html">Page 1</a>
<p>Paragraph</p>
<a href="page2.html">Page 2</a>
</html>
"""

# Create a SoupStrainer for <a> tags only
only_a_tags = SoupStrainer("a")

# Parse only <a> tags
soup = BeautifulSoup(html, "html.parser", parse_only=only_a_tags)

# Print parsed result
print(soup)


<a href="page1.html">Page 1</a><a href="page2.html">Page 2</a>


In [10]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"

# Use headers to avoid 403 Forbidden
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/120.0.0.0 Safari/537.36"
}

try:
    # Fetch the webpage
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()  # Raise HTTPError for bad status

    # Parse HTML
    response.encoding = response.apparent_encoding
    soup = BeautifulSoup(response.text, "html.parser")

    # Find the first table
    table = soup.find("table", class_="wikitable")
    if table is None:
        raise AttributeError("Table not found on the page.")

    # Extract all rows
    rows = table.find_all("tr")

    print("Wikipedia Table Rows:\n")
    for row in rows:
        cells = row.find_all(["th", "td"])
        row_data = [cell.get_text(strip=True) for cell in cells]
        if row_data:
            print(row_data)

except requests.exceptions.Timeout:
    print("Error: The request timed out.")

except requests.exceptions.HTTPError as http_err:
    print("HTTP error occurred:", http_err)

except requests.exceptions.RequestException as req_err:
    print("Request exception occurred:", req_err)

except AttributeError as attr_err:
    print("Attribute error:", attr_err)

finally:
    print("\nTable scraping process completed.")


Wikipedia Table Rows:

['Location', 'Population', '% ofworld', 'Date', 'Source (official or fromtheUnited Nations)', 'Notes']
['World', '8,232,000,000', '100%', '13 Jun 2025', 'UN projection[1][3]', '']
['India', '1,417,492,000', '17.2%', '1 Jul 2025', 'Official projection[4]', '[b]']
['China', '1,404,890,000', '17.1%', '31 Dec 2025', 'Official estimate[5]', '[c]']
['United States', '341,784,857', '4.2%', '1 Jul 2025', 'Official estimate[6]', '[d]']
['Indonesia', '284,438,782', '3.5%', '30 Jun 2025', 'National annual projection[7]', '']
['Pakistan', '241,499,431', '2.9%', '1 Mar 2023', '2023 census result[8]', '[e]']
['Nigeria', '223,800,000', '2.7%', '1 Jul 2023', 'Official projection[9]', '']
['Brazil', '213,421,037', '2.6%', '1 Jul 2025', 'Official estimate[10]', '']
['Bangladesh', '169,828,911', '2.1%', '14 Jun 2022', '2022 census result[11]', '[f]']
['Russia', '146,028,325', '1.8%', '1 Jan 2025', 'Official estimate[13]', '[g]']
['Mexico', '130,760,049', '1.6%', '30 Sep 2025', 'Nat

DISCUSSION:
In this lab session, we explored the area of web scraping using Python with the requests and BeautifulSoup libraries. We fetched HTML content from web pages, parse it, and extract meaningful information like headings, links and tables. This lab helped us understand the logic and methods behind web scraping.

CONCLUSION:
Hence, we implemented and understood web scraping using python.