LAB REPORT
Web Scraping using Python (Requests & BeautifulSoup)

Objective

To practice web scraping techniques using Python libraries requests, BeautifulSoup, and pandas, including HTML parsing, navigation, table scraping, selectors, and exception handling.

Tools & Libraries Used :
- Python 3
- requests
- bs4 (BeautifulSoup)
- pandas
- csv

Theory

Web scraping is the process of automatically extracting information from websites using programming techniques. Websites are generally written in HTML, which consists of elements such as tags, attributes, and content. Python provides powerful libraries to access and process web data efficiently.

The requests library is used to send HTTP requests to a web server and retrieve the HTML content of a webpage. It supports features such as handling HTTP responses, timeouts, and network-related exceptions.

BeautifulSoup, from the bs4 module, is used to parse HTML documents and convert them into a structured tree format. It allows easy searching, navigation, and modification of HTML elements using methods like find(), find_all(), and CSS selectors.

Web scraping often involves navigating through HTML tags, extracting specific data such as links, headings, and tables, and storing the extracted data in files like CSV for further analysis. Proper exception handling is important to manage errors such as connection failures, invalid responses, or missing elements during scraping.

Question 1: Basic HTML Request & Parsing
Aim

Fetch HTML content from GeeksforGeeks and print the <title> of the webpage while handling exceptions.

In [2]:
import requests
from bs4 import BeautifulSoup

url = "https://www.geeksforgeeks.org"

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()   # Raises HTTPError for bad responses
    
    soup = BeautifulSoup(response.text, "html.parser")
    print("Page Title:", soup.title.text)

except requests.exceptions.HTTPError as e:
    print("HTTP Error:", e)
except requests.exceptions.RequestException as e:
    print("Network Error:", e)


Page Title: GeeksforGeeks | Your All-in-One Learning Portal


Question 2: Extract Links
Aim

Extract and print the first 5 hyperlinks using .find() and .find_all().

In [3]:
links = soup.find_all("a", limit=5)

print("First 5 hyperlinks:")
for link in links:
    print(link.get_text(strip=True), "-", link.get("href"))


First 5 hyperlinks:
 - https://www.geeksforgeeks.org/
DSA - https://www.geeksforgeeks.org/dsa/dsa-tutorial-learn-data-structures-and-algorithms/
Practice Problems - https://www.geeksforgeeks.org/explore
C - https://www.geeksforgeeks.org/c/c-programming-language/
C++ - https://www.geeksforgeeks.org/cpp/c-plus-plus/


Question 3: Extract Headings & Save to CSV
Aim
- Extract all <h2> headings
- Extract all <a> tags
- Save headings to headings.csv

In [4]:
import csv

# Extract h2 headings
headings = [h.text.strip() for h in soup.find_all("h2")]

# Extract all links
all_links = [a.get("href") for a in soup.find_all("a")]

# Save headings to CSV
with open("headings.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Headings"])
    for h in headings:
        writer.writerow([h])

print("Headings saved to headings.csv")


Headings saved to headings.csv


Question 4: Scrape Wikipedia Table
Aim

Scrape the first table from Wikipedia â€“ List of countries by population.

In [5]:
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    
    soup = BeautifulSoup(response.text, "html.parser")
    table = soup.find("table", class_="wikitable")

    for row in table.find_all("tr")[1:]:
        cells = [cell.text.strip() for cell in row.find_all(["td", "th"])]
        print(cells)

except requests.exceptions.Timeout:
    print("Request timed out")
except requests.exceptions.HTTPError as e:
    print("HTTP Error:", e)
except requests.exceptions.RequestException as e:
    print("Request Error:", e)
except AttributeError:
    print("Table not found")


HTTP Error: 403 Client Error: Forbidden for url: https://en.wikipedia.org/wiki/List_of_countries_by_population


Question 5: Selectors & Navigation
HTML Snippet

In [6]:
html = """
<html><body>
<p class="intro">Welcome</p>
<p class="intro">Learn Python</p>
<a href="https://python.org">Python</a>
</body></html>
"""

soup = BeautifulSoup(html, "html.parser")

# Extract <p> tags with class intro
intro_paragraphs = soup.find_all("p", class_="intro")
print("Intro Paragraphs:", [p.text for p in intro_paragraphs])

# Parent of <a> tag
a_tag = soup.find("a")
print("Parent of <a>:", a_tag.parent.name)

# Next sibling of first <p>
print("Next sibling:", intro_paragraphs[0].find_next_sibling().text)


Intro Paragraphs: ['Welcome', 'Learn Python']
Parent of <a>: body
Next sibling: Learn Python


Question 6: Tag Manipulation
Aim

Modify <b class="boldest">Hello</b>

In [7]:
html = '<b class="boldest">Hello</b>'
soup = BeautifulSoup(html, "html.parser")

tag = soup.b
tag.name = "strong"
tag["id"] = "greeting"
tag.string = "Hi there"

print(soup)


<strong class="boldest" id="greeting">Hi there</strong>


Question 7: Advanced Navigation
HTML Table

In [8]:
html = """
<table>
<tr><td>Apple</td></tr>
<tr><td>Banana</td></tr>
</table>
"""

soup = BeautifulSoup(html, "html.parser")

# Find string "Apple"
apple = soup.find(string="Apple")
print("Parent <td>:", apple.parent)

# Siblings of first <td>
siblings = apple.parent.find_next_siblings()
print("Siblings:", siblings)


Parent <td>: <td>Apple</td>
Siblings: []


Question 8: Using SoupStrainer
Aim

Parse only <a> tags.

In [9]:
from bs4 import SoupStrainer

html = """
<html>
<a href="page1.html">Page 1</a>
<p>Paragraph</p>
<a href="page2.html">Page 2</a>
</html>
"""

only_a_tags = SoupStrainer("a")
soup = BeautifulSoup(html, "html.parser", parse_only=only_a_tags)

print(soup.prettify())


<a href="page1.html">
 Page 1
</a>
<a href="page2.html">
 Page 2
</a>



Question 9: Exception Handling (Enhanced)
Handled Exceptions :
- Timeout
- HTTPError
- RequestException
- AttributeError (table missing)

In [10]:
try:
    response = requests.get(url, timeout=5)
    response.raise_for_status()
    
    soup = BeautifulSoup(response.text, "html.parser")
    table = soup.find("table")
    
    if table is None:
        raise AttributeError("Table not found")

except requests.exceptions.Timeout:
    print("Timeout occurred")
except requests.exceptions.HTTPError:
    print("HTTP error occurred")
except requests.exceptions.RequestException:
    print("Request exception occurred")
except AttributeError as e:
    print(e)


HTTP error occurred


Conclusion

In this lab, web scraping was successfully performed using Python libraries such as requests and BeautifulSoup. The program demonstrated how to fetch and parse HTML content, extract useful data like titles, links, headings, and tables, and navigate through HTML elements efficiently. Exception handling techniques were also implemented to manage errors such as network issues and missing elements. This lab helped in understanding the practical application of web scraping and enhanced skills in data extraction and HTML parsing using Python.