# Assignment 4

This assignment will help you practice web scraping techniques by extracting structured data from a live practice website. You will learn how to navigate HTML structures, extract relevant information, and save it in a structured format for analysis.

Q1. Write a Python program to scrape all available books from the website (https://books.toscrape.com/) Books to Scrape – a live site built for practicing scraping (safe, legal, no anti-bot). For each book, extract the following details:
1. Title
2. Price
3. Availability (In stock / Out of stock)
4. Star Rating (One, Two, Three, Four, Five)
Store the scraped results into a Pandas DataFrame and export them to a CSV file named books.csv.
(Note: Use the requests library to fetch the HTML page. Use BeautifulSoup to parse and extract book details and handle pagination so that books from all pages are scraped)

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
def get_data(pageno):
  url = "https://books.toscrape.com/catalogue/page-"+str(pageno)+".html"
  reponse = requests.get(url)
  reponse.raise_for_status()
  soup = BeautifulSoup(reponse.content, 'html.parser')
  data = []
  for d in soup.find_all('li', attrs={'class':"col-xs-6 col-sm-4 col-md-3 col-lg-3"}):
    book_name = d.find('h3').find('a').get('title')
    rating = d.find('p').get('class')[-1]
    price = d.find('div',attrs={'class':"product_price"}).find('p',attrs={'class':'price_color'}).get_text()
    availability = d.find('div',attrs={'class':"product_price"}).find('p',attrs={'class':"instock availability"}).get_text(strip=True)
    data.append({'Book_name':book_name,
                 'Price':price,
                 'Rating(stars)':rating,
                 'Availability':availability})
  return data

In [None]:
master = []
for i in range(1,50):
  master.extend(get_data(i))

df = pd.DataFrame(master)

In [None]:
df

Unnamed: 0,Book_name,Price,Rating(stars),Availability
0,A Light in the Attic,£51.77,Three,In stock
1,Tipping the Velvet,£53.74,One,In stock
2,Soumission,£50.10,One,In stock
3,Sharp Objects,£47.82,Four,In stock
4,Sapiens: A Brief History of Humankind,£54.23,Five,In stock
...,...,...,...,...
975,Icing (Aces Hockey #2),£40.44,Four,In stock
976,"Hawkeye, Vol. 1: My Life as a Weapon (Hawkeye #1)",£45.24,Three,In stock
977,Having the Barbarian's Baby (Ice Planet Barbar...,£34.96,Four,In stock
978,"Giant Days, Vol. 1 (Giant Days #1-4)",£56.76,Four,In stock


Q2. Write a Python program to scrape the IMDB Top 250 Movies list (https://www.imdb.com/chart/top/) . For each movie, extract the following details:
1. Rank (1–250)
2. Movie Title
3. Year of Release
4. IMDB Rating
Store the results in a Pandas DataFrame and export it to a CSV file named imdb_top250.csv.
(Note: Use Selenium/Playwright to scrape the required details from this website)

In [6]:
!pip install selenium webdriver-manager

Collecting selenium
  Using cached selenium-4.35.0-py3-none-any.whl.metadata (7.4 kB)
Collecting webdriver-manager
  Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Collecting trio~=0.30.0 (from selenium)
  Downloading trio-0.30.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.12.2 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting typing_extensions~=4.14.0 (from selenium)
  Downloading typing_extensions-4.14.1-py3-none-any.whl.metadata (3.0 kB)
Collecting outcome (from trio~=0.30.0->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.12.2->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.35.0-py3-none-any.whl (9.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m66.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading webdriver_manager-4.0

In [12]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

# Set up Chrome options for Google Colab
chrome_options = Options()
chrome_options.add_argument('--headless=new')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1920,1080') # Needed for some sites to load all elements
chrome_options.add_argument('--remote-debugging-port=9222')

# Initialize WebDriver for Google Colab
# In Colab, you can directly use the webdriver.Chrome() constructor
# since the required binaries are pre-installed.
driver = webdriver.Chrome(options=chrome_options)

try:
    url = "https://www.imdb.com/chart/top/"
    html=driver.get(url)

    # Wait for the main movie list to load
    WebDriverWait(driver, 20).until(
        EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li.ipc-metadata-list__item.cli-parent"))
    )

    movies = driver.find_elements(By.CSS_SELECTOR, "li.ipc-metadata-list__item.cli-parent")

    data = []
    for movie in movies:
        try:
            # Get the rank and title
            title_element = movie.find_element(By.CSS_SELECTOR, "h3.ipc-title__text")
            full_title = title_element.text.strip()
            rank, title = full_title.split('.', 1)
            rank = rank.strip()
            title = title.strip()

            # Get the year of release
            year_element = movie.find_element(By.CSS_SELECTOR, "span.cli-title-metadata-item")
            year = year_element.text.strip()

            # Get the IMDB rating
            rating_element = movie.find_element(By.CSS_SELECTOR, "span.ipc-rating-star--imdb")
            rating = rating_element.get_attribute("aria-label").split()[3]

            data.append({
                "Rank": int(rank),
                "Movie Title": title,
                "Year of Release": int(year),
                "IMDB Rating": float(rating)
            })
        except Exception as e:
            print(f"Error extracting data for a movie: {e}")
            continue

    df = pd.DataFrame(data)
    print("Scraped Data:")
    print(df.head()) # Print the head to see the first few rows

    # Save to CSV
    csv_filename = "imdb_top_250.csv"
    df.to_csv(csv_filename, index=False)
    print(f"\nData saved to {csv_filename}")

    # Download the CSV file in Colab
    # This is the correct way to trigger a download
    from google.colab import files
    files.download(csv_filename)
    print("File download triggered. Check your browser for the file.")

finally:
    driver.quit()
    print("WebDriver closed.")

WebDriver closed.


TimeoutException: Message: 


Using Beautiful Soup as above approach failed

Q3. Write a Python program to scrape the weather information for top world cities from the given website (https://www.timeanddate.com/weather/) . For each city, extract the following details:
1. City Name
2. Temperature
3. Weather Condition (e.g., Clear, Cloudy, Rainy, etc.)
Store the results in a Pandas DataFrame and export it to a CSV file named weather.csv.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.timeanddate.com/weather"
res = requests.get(url)

soup = BeautifulSoup(res.text, "html.parser")

movies = []
for row in soup.find_all("tr"):
    cols = row.find_all("td")
    if len(cols) >= 4:  # only process rows that have enough columns
        city = cols[0].get_text(strip=True)

        # weather condition from
        condition = None
        if cols[2].find("img"):
            condition = cols[2].find("img")["alt"]

        temp = cols[3].get_text(strip=True)

        movies.append({"City": city, "Temperature": temp, "Condition": condition})

df = pd.DataFrame(movies)
df.to_csv("weather.csv", index=False)

print(df.head(10))

          City Temperature                           Condition
0        Accra       77 °F               Passing clouds. Warm.
1  Addis Ababa       57 °F                 Partly sunny. Cool.
2     Adelaide       66 °F                  Refreshingly cool.
3      Algiers       86 °F               Passing clouds. Warm.
4       Almaty       82 °F      Partly sunny. Pleasantly warm.
5        Amman       68 °F                         Haze. Mild.
6   Amsterdam*       66 °F               Passing clouds. Mild.
7       Anadyr       58 °F  Passing clouds. Refreshingly cool.
8   Anchorage*       55 °F                     Overcast. Cool.
9       Ankara       64 °F             Scattered clouds. Mild.
