> **NOTE:** Instead of using Google Colab, I have switched to using Jupyter Notebook on my local machine for Q2 as imdb.com does not allow web scraping from headless mode, and Google Colab does not have a GUI rendering engine.

# Lab Assignment 4
---

In [1]:
import numpy as np
import pandas as pd

## Q1. Write a Python program to scrape all available books from the website [Books to Scrape](https://books.toscrape.com/) - a live site built for practicing scraping (safe, legal, no anti-bot).
---

For each book, extract the following details:
1.   Title
2.   Price
3.   Availability (In stock / Out of stock)
4.   Star Rating (One, Two, Three, Four, Five)

Store the scraped results into a Pandas DataFrame and export them to a CSV file named books.csv.

(Note: Use the requests library to fetch the HTML page. Use BeautifulSoup to parse and extract book details and handle pagination so that books from all pages are scraped)

In [2]:
%pip install beautifulsoup4 requests > /dev/null

Note: you may need to restart the kernel to use updated packages.


In [3]:
from bs4 import BeautifulSoup
import requests

In [4]:
BASE_URL = "https://books.toscrape.com/catalogue/category/books_1/page-{}.html"

In [5]:
books = []
page = 1

while True:
    try:
        response = requests.get(BASE_URL.format(page))
        html = response.text # response.content can be used to get HTML as a binary instead of an Unicode string
        soup = BeautifulSoup(html, "html.parser")
    except Exception as e:
        print(f"ERROR: {e}")
        break

    book_tags = soup.find_all("article", class_="product_pod")
    if not book_tags or not response.ok:
        break

    for book_tag in book_tags:
        book = {
            'title': book_tag.h3.a["title"], #type: ignore
            'price': float(book_tag.find("p", class_="price_color").text[2:]), #type: ignore
            'availability': book_tag.find("p", class_="instock availability").text.strip(), #type: ignore
            'star_rating': book_tag.find("p", class_="star-rating")["class"][1] #type: ignore
        }
        books.append(book)

    page = page + 1

print(f"Scraped {len(books)} books.")

Scraped 1000 books.


In [6]:
df = pd.DataFrame(books)
df.head()

Unnamed: 0,title,price,availability,star_rating
0,A Light in the Attic,51.77,In stock,Three
1,Tipping the Velvet,53.74,In stock,One
2,Soumission,50.1,In stock,One
3,Sharp Objects,47.82,In stock,Four
4,Sapiens: A Brief History of Humankind,54.23,In stock,Five


In [7]:
df.to_csv('books.csv', index=False)

## Q2. Write a Python program to scrape the [IMDB Top 250 Movies list](https://www.imdb.com/chart/top/).
---

For each movie, extract the following details:

1.   Rank (1-250)
2.   Movie Title
3.   Year of Release
4.   IMDB Rating

Store the results in a Pandas DataFrame and export it to a CSV file named imdb_top250.csv

(Note: Use Selenium/Playwright to scrape the required details from this website)

In [8]:
%pip install selenium webdriver-manager > /dev/null

Note: you may need to restart the kernel to use updated packages.


In [9]:
from selenium import webdriver

from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Make sure that you have chromium installed using below command in system:-

`sudo apt-get install -y chromium-browser`

Use `which chromium-browser` to get path
Use `chromium-browser --version` to get version

In [10]:
CHROMIUM_PATH = "/usr/bin/chromium-browser"
CHROMIUM_VERSION = "140.0.7339.127"

In [11]:
URL = "https://www.imdb.com/chart/top/"

In [12]:
# Setting up Chrome
service = Service(ChromeDriverManager(CHROMIUM_VERSION).install())

options = webdriver.ChromeOptions()
options.add_argument("--remote-debugging-port=9222") 
options.binary_location = CHROMIUM_PATH

driver = webdriver.Chrome(
    service=service,
    options=options,
)

In [13]:
driver.get(URL)

# Wait until the list is visible
wait = WebDriverWait(driver, 10)
try:
    movie_tags = wait.until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.ipc-metadata-list-summary-item"))
    )
except Exception as e:
    print(f"Page load timeout: {e}")
    driver.quit()

In [14]:
movies = []
for rank, movie_tag in enumerate(movie_tags, 1):
	try:
		movie = {
			"rank": rank,
			"title": movie_tag.find_element(By.CSS_SELECTOR, "h3.ipc-title__text.ipc-title__text--reduced").text.split('. ')[1].strip(),
			"release_year": int(movie_tag.find_element(By.CSS_SELECTOR, "span.sc-15ac7568-7.cCsint.cli-title-metadata-item").text),
			"rating": float(movie_tag.find_element(By.CSS_SELECTOR, "span.ipc-rating-star--rating").text.strip())
		}
		movies.append(movie)
	except Exception as e:
		print(f"Error parsing movie at rank {rank}: {e}")

print(f"Scraped {len(movies)}/250 movies.")
driver.quit()

Scraped 250/250 movies.


In [15]:
df = pd.DataFrame(movies).set_index('rank')
df.head()

Unnamed: 0_level_0,title,release_year,rating
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,The Shawshank Redemption,1994,9.3
2,The Godfather,1972,9.2
3,The Dark Knight,2008,9.1
4,The Godfather Part II,1974,9.0
5,12 Angry Men,1957,9.0


In [16]:
df.to_csv("imdb_top250.csv")

## Q3. Write a Python program to scrape the weather information for top world cities from the given website [timeanddate.com](https://www.timeanddate.com/weather/).
---


For each city, extract the following details:

1.   City Name
2.   Temperature
3.   Weather Condition (e.g., Clear, Cloudy, Rainy, etc.)

Store the results in a Pandas DataFrame and export it to a CSV file named weather.csv.

In [17]:
from bs4 import BeautifulSoup
import requests

In [18]:
URL = "https://www.timeanddate.com/weather/"

In [19]:
try:
	response = requests.get(URL)
	html = response.text

	soup = BeautifulSoup(html, "html.parser")

except Exception as e:
	print(f"ERROR: {e}")

In [20]:
weather = []

weather_table = soup.find("table", class_="zebra fw tb-theme")
weather_rows = weather_table.find_all("tr")[1:] 						#type: ignore
# Excludes the header row

for row in weather_rows:
    row_data = row.find_all("td")										#type: ignore
    city_cells = [row_data[i:i + 4] for i in range(0, len(row_data), 4)]
    for city in city_cells:
        try:
            city_weather = {
                "id": int(city[1]["id"].strip("p")),					#type: ignore	
                "city": city[0].a.text.strip(),							#type: ignore
				"temperature_celsius": int(city[3].text.strip()[:-3]),
                "weather_condn": city[2].img["title"].strip(), 			#type: ignore
			}
            weather.append(city_weather)
        except Exception as e:
            print(f"Error parsing city weather: {e}")

print(f"Scraped {len(weather)} cities.")

Scraped 140 cities.


In [21]:
df = pd.DataFrame(weather).set_index("id").sort_index()
df.head()

Unnamed: 0_level_0,city,temperature_celsius,weather_condn
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Accra,26,Scattered clouds. Warm.
1,Addis Ababa,18,Passing clouds. Mild.
2,Adelaide,10,Quite cool.
3,Algiers,26,Passing clouds. Warm.
4,Almaty,18,Mostly cloudy. Mild.


In [22]:
df.to_csv("weather.csv", index=False)