# Title: Python Series – Day 45: Web Scraping in Python using BeautifulSoup

## 1. Introduction
**Web Scraping** is the process of extracting data from websites automatically.

**Uses:**
- Price Monitoring
- Market Research
- News Aggregation

**Libraries:**
- `requests`: To download the HTML content of a page.
- `BeautifulSoup` (bs4): To parse and navigate the HTML tree.

**Important:** Always check a website's `robots.txt` file and terms of service before scraping. Be respectful and don't overload servers.

## 2. Installing BeautifulSoup
Run these commands in your terminal (or cell with `!`) to install the necessary packages.

In [None]:
# !pip install beautifulsoup4
# !pip install requests

## 3. Import Required Libraries

In [None]:
import requests
from bs4 import BeautifulSoup

## 4. Fetching a Webpage
We use `requests.get()` to fetch the page content.

In [None]:
url = "https://quotes.toscrape.com/"
response = requests.get(url)
print(f"Status Code: {response.status_code}") # 200 means success

## 5. Parsing HTML
We create a BeautifulSoup object to parse the raw HTML.

In [None]:
soup = BeautifulSoup(response.text, "html.parser")
# print(soup.prettify()) # Uncomment to see the structured HTML

## 6. Extracting Specific Data
BeautifulSoup provides methods like `.find()` (first match) and `.find_all()` (list of matches).

In [None]:
# Extract Title
print("Page Title:", soup.title.text)

# Extract all links
links = [a['href'] for a in soup.find_all('a', href=True)]
print(f"\nFound {len(links)} links. First 5:")
print(links[:5])

## 7. Selecting Elements Using CSS Selectors
`.select()` allows you to use CSS-style selectors (e.g., `.class`, `#id`, `tag`).

In [None]:
# Select elements with class 'text'
quotes_text = soup.select(".text") 
print("First quote using CSS selector:", quotes_text[0].text)

## 8. Scraping Example: Quotes Website
Target: https://quotes.toscrape.com/
Goal: Extract Quote, Author, and Tags.

In [None]:
quote_elements = soup.find_all("div", class_="quote")

scraped_data = []

for q in quote_elements:
    text = q.find("span", class_="text").text
    author = q.find("small", class_="author").text
    tags = [tag.text for tag in q.find_all("a", class_="tag")]
    
    scraped_data.append({
        "quote": text,
        "author": author,
        "tags": tags
    })

print(f"Scraped {len(scraped_data)} quotes.")
print("Example:", scraped_data[0])

## 9. Creating a Scraping Function
Encapsulating logic for reusability.

In [None]:
def get_quotes_from_page(url):
    r = requests.get(url)
    s = BeautifulSoup(r.text, "html.parser")
    return [q.text for q in s.find_all("span", class_="text")]

quotes = get_quotes_from_page("https://quotes.toscrape.com/")
for i, q in enumerate(quotes[:3], 1):
    print(f"{i}. {q}")

## 10. Pagination Handling (Intro)
Most sites split data across pages. You need to find the "Next" button link.

In [None]:
next_page = soup.find("li", class_="next")
if next_page:
    next_url = "https://quotes.toscrape.com" + next_page.a["href"]
    print("Next Page URL:", next_url)
else:
    print("No next page found.")

## 11. Writing Scraped Data to CSV/JSON
Storing your hard-earned data.

In [None]:
import csv
import json

# Save to CSV
with open("quotes.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Quote", "Author", "Tags"])
    for item in scraped_data:
        writer.writerow([item['quote'], item['author'], ", ".join(item['tags'])])
print("Saved to quotes.csv")

# Save to JSON
with open("quotes.json", "w", encoding="utf-8") as f:
    json.dump(scraped_data, f, indent=4)
print("Saved to quotes.json")

## 12. Mini Project – News Scraper
A simple scraper for a news site (Mock logic used for demonstration if site structure changes).

In [None]:
def scrape_news_headlines():
    # Target: BBC News (This is an example, real selectors change often)
    url = "https://www.bbc.com/news"
    try:
        r = requests.get(url)
        s = BeautifulSoup(r.text, "html.parser")
        
        # Identifying headlines (Generic H2/H3 search for stability in example)
        headlines = s.find_all(['h2', 'h3'], limit=5)
        
        print(f"--- Top Headlines from {url} ---\n")
        for h in headlines:
            text = h.get_text().strip()
            if text:
                print(f"- {text}")
    except Exception as e:
        print("Error scraping news:", e)

scrape_news_headlines()

## 13. Practice Exercises
1. Scrape the names of top repositories from a GitHub trending page (requires handling complex HTML).
2. Write a script to monitor the price of a product on a test ecommerce site.
3. Scrape weather information (Temperature, Condition) from a weather forecast site.
4. Modify pagination logic to scrape the first 5 pages of quotes automatically.

## 14. Day 45 Summary
- **Requests:** Fetching web pages.
- **BeautifulSoup:** Parsing and extracting data.
- **Selectors:** `find`, `find_all`, `select`.
- **Storage:** Saving data to CSV/JSON files.

**Next topic: Day 46 – API Project (Advanced)**