LAB REPORT:05
Web Scraping using Python (Requests & BeautifulSoup)

Objective:
. To practice web scraping techniques using Python libraries requests, BeautifulSoup, and pandas, including HTML parsing, navigation, table scraping, selectors, and exception handling.

Tools & Libraries Used :
- Python 3
- requests
- bs4 (BeautifulSoup)
- pandas
- csv

Theory:

Web scraping is the process of automatically extracting information from websites using programming techniques. Websites are generally written in HTML, which consists of elements such as tags, attributes, and content. Python provides powerful libraries to access and process web data efficiently.

The requests library is used to send HTTP requests to a web server and retrieve the HTML content of a webpage. It supports features such as handling HTTP responses, timeouts, and network-related exceptions.

BeautifulSoup, from the bs4 module, is used to parse HTML documents and convert them into a structured tree format. It allows easy searching, navigation, and modification of HTML elements using methods like find(), find_all(), and CSS selectors.

Web scraping often involves navigating through HTML tags, extracting specific data such as links, headings, and tables, and storing the extracted data in files like CSV for further analysis. Proper exception handling is important to manage errors such as connection failures, invalid responses, or missing elements during scraping.

Question 1: Basic HTML Request & Parsing.
Aim:
Fetch HTML content from GeeksforGeeks and print the <title> of the webpage while handling exceptions.

In [2]:
import requests
from bs4 import BeautifulSoup

url = "https://www.geeksforgeeks.org"

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()   # Raises HTTPError for bad responses
    
    soup = BeautifulSoup(response.text, "html.parser")
    print("Page Title:", soup.title.text)

except requests.exceptions.HTTPError as e:
    print("HTTP Error:", e)
except requests.exceptions.RequestException as e:
    print("Network Error:", e)


Page Title: GeeksforGeeks | Your All-in-One Learning Portal


Question 2: Extract Links.
Aim:
Extract and print the first 5 hyperlinks using .find() and .find_all().

In [3]:
links = soup.find_all("a", limit=5)

print("First 5 hyperlinks:")
for link in links:
    print(link.get_text(strip=True), "-", link.get("href"))


First 5 hyperlinks:
 - https://www.geeksforgeeks.org/
DSA - https://www.geeksforgeeks.org/dsa/dsa-tutorial-learn-data-structures-and-algorithms/
Practice Problems - https://www.geeksforgeeks.org/explore
C - https://www.geeksforgeeks.org/c/c-programming-language/
C++ - https://www.geeksforgeeks.org/cpp/c-plus-plus/


Question 3: Extract Headings & Save to CSV.
Aim:
- Extract all <h2> headings
- Extract all <a> tags
- Save headings to headings.csv

In [1]:
import requests
from bs4 import BeautifulSoup
import csv

url = "https://www.geeksforgeeks.org"
headers = {"User-Agent": "Mozilla/5.0"}

try:
    response = requests.get(url, headers=headers, timeout=5)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")

    # Extract all h2 headings
    headings = [h.get_text(strip=True) for h in soup.find_all("h2")]

    # Extract all links
    links = [a.get("href") for a in soup.find_all("a")]

    print("H2 Headings:")
    print(headings)

    print("\nLinks:")
    print(links)

    # Save headings to CSV
    with open("headings.csv", "w", newline="", encoding="utf-8") as file:
        writer = csv.writer(file)
        writer.writerow(["Headings"])
        
        for heading in headings:
            writer.writerow([heading])

    print("\nHeadings saved to headings.csv")

except requests.exceptions.RequestException as e:
    print("Error occurred:", e)

H2 Headings:
['Explore', 'Courses', 'Must Explore']

Links:
['https://www.geeksforgeeks.org/', 'https://www.geeksforgeeks.org/dsa/dsa-tutorial-learn-data-structures-and-algorithms/', 'https://www.geeksforgeeks.org/explore', 'https://www.geeksforgeeks.org/c/c-programming-language/', 'https://www.geeksforgeeks.org/cpp/c-plus-plus/', 'https://www.geeksforgeeks.org/java/java/', 'https://www.geeksforgeeks.org/python/python-programming-language-tutorial/', 'https://www.geeksforgeeks.org/javascript/javascript-tutorial/', 'https://www.geeksforgeeks.org/data-science/data-science-for-beginners/', 'https://www.geeksforgeeks.org/machine-learning/machine-learning/', 'https://www.geeksforgeeks.org/courses', 'https://www.geeksforgeeks.org/linux-unix/linux-tutorial/', 'https://www.geeksforgeeks.org/devops/devops-tutorial/', 'https://www.geeksforgeeks.org/courses', 'https://www.geeksforgeeks.org/courses/dsa-self-paced', 'https://www.geeksforgeeks.org/courses/data-science-live', 'https://www.geeksforgee

Question 4: Scrape Wikipedia Table.
Aim:
Scrape the first table from Wikipedia – List of countries by population.

In [2]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

headers = {"User-Agent": "Mozilla/5.0"}

try:
    response = requests.get(url, headers=headers, timeout=5)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")

    # Find first table
    table = soup.find("table")

    # Extract rows
    rows = table.find_all("tr")

    for row in rows:
        cells = row.find_all(["th", "td"])
        cell_values = [cell.get_text(strip=True) for cell in cells]

        if cell_values:
            print(cell_values)

except requests.exceptions.RequestException as e:
    print("Error:", e)

['Location', 'Population', '% ofworld', 'Date', 'Source (official or fromtheUnited Nations)', 'Notes']
['World', '8,232,000,000', '100%', '13 Jun 2025', 'UN projection[1][3]', '']
['India', '1,417,492,000', '17.2%', '1 Jul 2025', 'Official projection[4]', '[b]']
['China', '1,404,890,000', '17.1%', '31 Dec 2025', 'Official estimate[5]', '[c]']
['United States', '341,784,857', '4.2%', '1 Jul 2025', 'Official estimate[6]', '[d]']
['Indonesia', '284,438,782', '3.5%', '30 Jun 2025', 'National annual projection[7]', '']
['Pakistan', '241,499,431', '2.9%', '1 Mar 2023', '2023 census result[8]', '[e]']
['Nigeria', '223,800,000', '2.7%', '1 Jul 2023', 'Official projection[9]', '']
['Brazil', '213,421,037', '2.6%', '1 Jul 2025', 'Official estimate[10]', '']
['Bangladesh', '169,828,911', '2.1%', '14 Jun 2022', '2022 census result[11]', '[f]']
['Russia', '146,028,325', '1.8%', '1 Jan 2025', 'Official estimate[13]', '[g]']
['Mexico', '130,760,049', '1.6%', '30 Sep 2025', 'National quarterly estimat

Question 5: Selectors & Navigation.
HTML Snippet

In [6]:
html = """
<html><body>
<p class="intro">Welcome</p>
<p class="intro">Learn Python</p>
<a href="https://python.org">Python</a>
</body></html>
"""

soup = BeautifulSoup(html, "html.parser")

# Extract <p> tags with class intro
intro_paragraphs = soup.find_all("p", class_="intro")
print("Intro Paragraphs:", [p.text for p in intro_paragraphs])

# Parent of <a> tag
a_tag = soup.find("a")
print("Parent of <a>:", a_tag.parent.name)

# Next sibling of first <p>
print("Next sibling:", intro_paragraphs[0].find_next_sibling().text)


Intro Paragraphs: ['Welcome', 'Learn Python']
Parent of <a>: body
Next sibling: Learn Python


Question 6: Tag Manipulation.
Aim:
Modify <b class="boldest">Hello</b>

In [7]:
html = '<b class="boldest">Hello</b>'
soup = BeautifulSoup(html, "html.parser")

tag = soup.b
tag.name = "strong"
tag["id"] = "greeting"
tag.string = "Hi there"

print(soup)


<strong class="boldest" id="greeting">Hi there</strong>


Question 7: Advanced Navigation.

In [3]:
html = """
<table>
<tr><td>Apple</td></tr>
<tr><td>Banana</td></tr>
</table>
"""

soup = BeautifulSoup(html, "html.parser")

# Find string "Apple"
apple = soup.find(string="Apple")

# Print parent <td>
print("Parent TD:", apple.parent)

# Print siblings of first <td>
first_td = apple.parent
siblings = first_td.find_parent("tr").find_next_siblings()

print("Sibling rows:")
for sib in siblings:
    print(sib.td.text)

Parent TD: <td>Apple</td>
Sibling rows:
Banana


Question 8: Using SoupStrainer.
Aim:
Parse only <a> tags.

In [9]:
from bs4 import SoupStrainer

html = """
<html>
<a href="page1.html">Page 1</a>
<p>Paragraph</p>
<a href="page2.html">Page 2</a>
</html>
"""

only_a_tags = SoupStrainer("a")
soup = BeautifulSoup(html, "html.parser", parse_only=only_a_tags)

print(soup.prettify())


<a href="page1.html">
 Page 1
</a>
<a href="page2.html">
 Page 2
</a>



Question 9: Exception Handling (Enhanced)
Handled Exceptions :
- Timeout
- HTTPError
- RequestException
- AttributeError (table missing)

In [5]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

headers = {"User-Agent": "Mozilla/5.0"}

try:
    response = requests.get(url, headers=headers, timeout=5)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "html.parser")

    table = soup.find("table")

    if table is None:
        raise AttributeError("Table not found")

    rows = table.find_all("tr")

    for row in rows:
        cells = row.find_all(["th", "td"])
        data = [cell.get_text(strip=True) for cell in cells]

        if data:
            print(data)

# Specific Exceptions
except requests.exceptions.Timeout:
    print("Request timed out")

except requests.exceptions.HTTPError as e:
    print("HTTP Error:", e)

except requests.exceptions.RequestException as e:
    print("Request Error:", e)

except AttributeError as e:
    print("Parsing Error:", e)

['Location', 'Population', '% ofworld', 'Date', 'Source (official or fromtheUnited Nations)', 'Notes']
['World', '8,232,000,000', '100%', '13 Jun 2025', 'UN projection[1][3]', '']
['India', '1,417,492,000', '17.2%', '1 Jul 2025', 'Official projection[4]', '[b]']
['China', '1,404,890,000', '17.1%', '31 Dec 2025', 'Official estimate[5]', '[c]']
['United States', '341,784,857', '4.2%', '1 Jul 2025', 'Official estimate[6]', '[d]']
['Indonesia', '284,438,782', '3.5%', '30 Jun 2025', 'National annual projection[7]', '']
['Pakistan', '241,499,431', '2.9%', '1 Mar 2023', '2023 census result[8]', '[e]']
['Nigeria', '223,800,000', '2.7%', '1 Jul 2023', 'Official projection[9]', '']
['Brazil', '213,421,037', '2.6%', '1 Jul 2025', 'Official estimate[10]', '']
['Bangladesh', '169,828,911', '2.1%', '14 Jun 2022', '2022 census result[11]', '[f]']
['Russia', '146,028,325', '1.8%', '1 Jan 2025', 'Official estimate[13]', '[g]']
['Mexico', '130,760,049', '1.6%', '30 Sep 2025', 'National quarterly estimat

Conclusion

In this lab, web scraping was successfully performed using Python libraries such as requests and BeautifulSoup. The program demonstrated how to fetch and parse HTML content, extract useful data like titles, links, headings, and tables, and navigate through HTML elements efficiently. Exception handling techniques were also implemented to manage errors such as network issues and missing elements. This lab helped in understanding the practical application of web scraping and enhanced skills in data extraction and HTML parsing using Python.