# web scrapping

###Project Description

The goal of this task is to practice web scraping by extracting structured information from the site Quotes to Scrape:
https://quotes.toscrape.com/


### implementing two separate solutions:

1.Using Beautiful Soup (requests + bs4)


*  Collecting the first 10 quotes with their authors and tags.
*  Collecting the first 100 quotes with their authors and tags.


2.Using Selenium (browser automation)

* Performing the same extraction (quotes, authors, tags) by automating a web browser to navigate through the site’s pages until the required number of results is collected.


The final output stored in a structured format (e.g., CSV or JSON), containing:

Quote text

Author name

Associated tags

## Beautiful Soup (requests + bs4)

###Importing libraries

In [3]:
# bs4_quotes.py
!pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass
from typing import List
import csv, json





### Data model

In [4]:
@dataclass
class Quote:
    text: str
    author: str
    tags: List[str]


###Scraping function with BeautifulSoup

In [5]:
BASE = "https://quotes.toscrape.com"

def scrape_bs4(limit: int) -> List[Quote]:
    url = f"{BASE}/page/1/"
    out: List[Quote] = []

    while url and len(out) < limit:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        soup = BeautifulSoup(r.text, "html.parser")

        for box in soup.select(".quote"):
            text = box.select_one("span.text").get_text(strip=True).strip("“”\"")
            author = box.select_one("small.author").get_text(strip=True)
            tags = [t.get_text(strip=True) for t in box.select(".tags a.tag")]
            out.append(Quote(text, author, tags))
            if len(out) >= limit:
                break

        nxt = soup.select_one("li.next > a")
        url = (BASE + nxt["href"]) if nxt else None

    return out

###Save helpers (CSV / JSONL)

CSV

In [6]:
def save_csv(quotes: List[Quote], path: str):
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.writer(f)
        w.writerow(["text", "author", "tags"])
        for q in quotes:
            w.writerow([q.text, q.author, ", ".join(q.tags)])

JSON

In [7]:
def save_jsonl(quotes: List[Quote], path: str):
    with open(path, "w", encoding="utf-8") as f:
        for q in quotes:
            f.write(json.dumps(q.__dict__, ensure_ascii=False) + "\n")


### Run for first 10

In [8]:
quotes10 = scrape_bs4(limit=10)
save_csv(quotes10, "quotes_10_bs4.csv")
print("First 10 quotes saved!")
quotes10[:3]  # preview first 3


First 10 quotes saved!


[Quote(text='The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', author='Albert Einstein', tags=['change', 'deep-thoughts', 'thinking', 'world']),
 Quote(text='It is our choices, Harry, that show what we truly are, far more than our abilities.', author='J.K. Rowling', tags=['abilities', 'choices']),
 Quote(text='There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.', author='Albert Einstein', tags=['inspirational', 'life', 'live', 'miracle', 'miracles'])]

In [10]:
quotes10

[Quote(text='The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', author='Albert Einstein', tags=['change', 'deep-thoughts', 'thinking', 'world']),
 Quote(text='It is our choices, Harry, that show what we truly are, far more than our abilities.', author='J.K. Rowling', tags=['abilities', 'choices']),
 Quote(text='There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.', author='Albert Einstein', tags=['inspirational', 'life', 'live', 'miracle', 'miracles']),
 Quote(text='The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.', author='Jane Austen', tags=['aliteracy', 'books', 'classic', 'humor']),
 Quote(text="Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.", author='Marilyn Monroe', tags=['be-yourself', 'inspirational']),
 Quote(text='Try not

### Run for first 100

In [9]:
quotes100 = scrape_bs4(limit=100)
save_csv(quotes100, "quotes_100_bs4.csv")
print("First 100 quotes saved!")
len(quotes100)  # confirm count

First 100 quotes saved!


100

In [11]:
quotes100

[Quote(text='The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', author='Albert Einstein', tags=['change', 'deep-thoughts', 'thinking', 'world']),
 Quote(text='It is our choices, Harry, that show what we truly are, far more than our abilities.', author='J.K. Rowling', tags=['abilities', 'choices']),
 Quote(text='There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.', author='Albert Einstein', tags=['inspirational', 'life', 'live', 'miracle', 'miracles']),
 Quote(text='The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.', author='Jane Austen', tags=['aliteracy', 'books', 'classic', 'humor']),
 Quote(text="Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.", author='Marilyn Monroe', tags=['be-yourself', 'inspirational']),
 Quote(text='Try not

## Selenium (Headless Chrome)

Selenium (Headless Chrome) = Automating Google Chrome in the background with Selenium, so you can scrape data without opening a visible browser window.

### Imports

nstall Chromium + driver

In [30]:
!wget -q https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
!apt-get update -y
!apt-get install -y ./google-chrome-stable_current_amd64.deb


0% [Working]            Hit:1 https://cli.github.com/packages stable InRelease
0% [Connecting to archive.ubuntu.com (185.125.190.81)] [Connecting to security.                                                                               Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Waiting f                                                                               Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Connected0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Connected                                                                               Hit:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
0% [Waiting for headers] [Waiting for headers] [Connected to ppa.launchpadconte                                                

In [27]:
!apt-get update -y
!apt-get install -y chromium-browser chromium-chromedriver


0% [Working]            Hit:1 https://cli.github.com/packages stable InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entr

In [31]:
!pip install -q selenium webdriver-manager

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

opts = Options()
opts.add_argument("--headless=new")
opts.add_argument("--no-sandbox")
opts.add_argument("--disable-dev-shm-usage")
opts.add_argument("--window-size=1280,800")

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=opts)
driver.get("https://quotes.toscrape.com/")
print(driver.title)   # should print: Quotes to Scrape
driver.quit()


Quotes to Scrape


### Data model

In [32]:
from dataclasses import dataclass
from typing import List

@dataclass
class Quote:
    text: str
    author: str
    tags: List[str]


### Scraper function

In [33]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

BASE = "https://quotes.toscrape.com"

def scrape_selenium(limit: int):
    # re-use the same Options pattern you already used
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager

    opts = Options()
    opts.add_argument("--headless=new")
    opts.add_argument("--no-sandbox")
    opts.add_argument("--disable-dev-shm-usage")
    opts.add_argument("--window-size=1280,800")

    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=opts)
    wait = WebDriverWait(driver, 15)
    out = []
    try:
        driver.get(f"{BASE}/page/1/")
        while len(out) < limit:
            wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".quote")))
            boxes = driver.find_elements(By.CSS_SELECTOR, ".quote")
            for box in boxes:
                text   = box.find_element(By.CSS_SELECTOR, "span.text").text.strip("“”\"")
                author = box.find_element(By.CSS_SELECTOR, "small.author").text
                tags   = [t.text for t in box.find_elements(By.CSS_SELECTOR, ".tags a.tag")]
                out.append(Quote(text, author, tags))
                if len(out) >= limit:
                    break
            if len(out) >= limit:
                break
            nxt = driver.find_elements(By.CSS_SELECTOR, "li.next > a")
            if not nxt:
                break
            nxt[0].click()
            time.sleep(0.4)  # small transition pause
    finally:
        driver.quit()
    return out


### Save helpers (CSV / JSONL)

CSV

In [16]:
def save_csv(quotes: List[Quote], path: str):
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.writer(f)
        w.writerow(["text", "author", "tags"])
        for q in quotes:
            w.writerow([q.text, q.author, ", ".join(q.tags)])

JSON

In [17]:
def save_jsonl(quotes: List[Quote], path: str):
    with open(path, "w", encoding="utf-8") as f:
        for q in quotes:
            f.write(json.dumps(q.__dict__, ensure_ascii=False) + "\n")


### Run for first 10

In [34]:
q10 = scrape_selenium(limit=10)
save_csv(q10, "quotes_10_selenium.csv")
len(q10), q10[:2]  # count + small preview


(10,
 [Quote(text='The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', author='Albert Einstein', tags=['change', 'deep-thoughts', 'thinking', 'world']),
  Quote(text='It is our choices, Harry, that show what we truly are, far more than our abilities.', author='J.K. Rowling', tags=['abilities', 'choices'])])

In [36]:
q10

[Quote(text='The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', author='Albert Einstein', tags=['change', 'deep-thoughts', 'thinking', 'world']),
 Quote(text='It is our choices, Harry, that show what we truly are, far more than our abilities.', author='J.K. Rowling', tags=['abilities', 'choices']),
 Quote(text='There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.', author='Albert Einstein', tags=['inspirational', 'life', 'live', 'miracle', 'miracles']),
 Quote(text='The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.', author='Jane Austen', tags=['aliteracy', 'books', 'classic', 'humor']),
 Quote(text="Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.", author='Marilyn Monroe', tags=['be-yourself', 'inspirational']),
 Quote(text='Try not

In [35]:
q100 = scrape_selenium(limit=100)
save_csv(q100, "quotes_100_selenium.csv")
len(q100)


100

In [37]:
q100

[Quote(text='The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.', author='Albert Einstein', tags=['change', 'deep-thoughts', 'thinking', 'world']),
 Quote(text='It is our choices, Harry, that show what we truly are, far more than our abilities.', author='J.K. Rowling', tags=['abilities', 'choices']),
 Quote(text='There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.', author='Albert Einstein', tags=['inspirational', 'life', 'live', 'miracle', 'miracles']),
 Quote(text='The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.', author='Jane Austen', tags=['aliteracy', 'books', 'classic', 'humor']),
 Quote(text="Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.", author='Marilyn Monroe', tags=['be-yourself', 'inspirational']),
 Quote(text='Try not

### The Approach

1. Launch a Browser (Headless)

Selenium starts a Chrome browser in the background (headless = no visible window).

This makes the scraper act like a real person opening pages, clicking, and reading content.

2. Go to the Website

The browser is told to open https://quotes.toscrape.com/.

3. Find the Quotes on the Page

Selenium looks for all elements with the CSS class .quote.

Inside each quote box, it extracts:

The text of the quote (span.text)

The author (small.author)

The tags (all a elements inside .tags)

4. Save the Data

Each quote is stored in a Python object (with text, author, tags).

At the end, all quotes are written to a file (CSV or JSON) so you can use them later.

5. Handle Multiple Pages

After finishing the first page, the scraper checks if there’s a Next button.

If yes, Selenium clicks it, loads the new page, and repeats the same steps until it reaches the required number of quotes (10 or 100).