# <b>Using Beautiful Soup for Data Collection</b>

## 1. What is BeautifulSoup?
- BeautifulSoup is a Python library used to parse HTML and XML documents.
- It creates a parse tree from page content, making it easy to extract data.
- It is often used with `requests` to scrape websites.

In [1]:
from bs4 import BeautifulSoup

In [5]:
with open("HTMLS/page1.html") as f:
    content = f.read()

soup = BeautifulSoup(content, "html.parser")

## 2. Installing BeautifulSoup
Install both `beautifulsoup4` and a parser like `lxml`:

In [6]:
pip install beautifulsoup4 lxml

Note: you may need to restart the kernel to use updated packages.


## 3. Creating a BeautifulSoup Object
Example:
from bs4 import BeautifulSoup
import requests

url = "https://example.com"
response = requests.get(url)

soup = BeautifulSoup(response.text, "lxml")

## 4. Understanding the HTML Structure
BeautifulSoup treats the page like a tree.
You can search and navigate through tags, classes, ids, and attributes.

Example HTML:

## 5. Common Methods in BeautifulSoup
   ### 5.1 Accessing Elements
Access the first occurrence of a tag:

`soup.h1`

- Get the text inside a tag:

`soup.h1.text`

### 5.2 `find()` Method
- Finds the first matching element:

`soup.find("p")`

### Find a tag with specific attributes:

`soup.find("p", class_="description")`

### 5.3 find_all() Method
- Finds all matching elements:

### 5.4 Using select() and select_one()
Select elements using CSS selectors.
- `soup.select_one("p.description")`
- `soup.select("a")`

In [8]:
h3s = soup.find_all("h3")
print(h3s)

[<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>, <h3><a href="tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>, <h3><a href="soumission_998/index.html" title="Soumission">Soumission</a></h3>, <h3><a href="sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>, <h3><a href="sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>, <h3><a href="the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a></h3>, <h3><a href="the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a></h3>, <h3><a href="the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the Infamous Feminist, V

In [10]:
articles = soup.select("article.product_pod")

In [36]:
items=[]
for article in articles:
    title = article.find("h3").find("a")["title"]
    price = article.select_one("p.price_color").text.split("Â£")[1]
    rating_element= article.select_one("p.star-rating")
    rating = rating_element['class'][1]
    items.append([title,price,rating])

In [37]:
items

[['A Light in the Attic', '51.77', 'Three'],
 ['Tipping the Velvet', '53.74', 'One'],
 ['Soumission', '50.10', 'One'],
 ['Sharp Objects', '47.82', 'Four'],
 ['Sapiens: A Brief History of Humankind', '54.23', 'Five'],
 ['The Requiem Red', '22.65', 'One'],
 ['The Dirty Little Secrets of Getting Your Dream Job', '33.34', 'Four'],
 ['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
  '17.93',
  'Three'],
 ['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
  '22.60',
  'Four'],
 ['The Black Maria', '52.15', 'One'],
 ['Starving Hearts (Triangular Trade Trilogy, #1)', '13.99', 'Two'],
 ["Shakespeare's Sonnets", '20.66', 'Four'],
 ['Set Me Free', '17.46', 'Five'],
 ["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", '52.29', 'Five'],
 ['Rip it Up and Start Again', '35.02', 'Five'],
 ['Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
  '57.25',
  'Three'],
 ['Olio

In [38]:
import pandas as pd

In [39]:
df = pd.DataFrame(items , columns = ["Books" , "Price" , "Ratings"])

In [40]:
df

Unnamed: 0,Books,Price,Ratings
0,A Light in the Attic,51.77,Three
1,Tipping the Velvet,53.74,One
2,Soumission,50.1,One
3,Sharp Objects,47.82,Four
4,Sapiens: A Brief History of Humankind,54.23,Five
5,The Requiem Red,22.65,One
6,The Dirty Little Secrets of Getting Your Dream...,33.34,Four
7,The Coming Woman: A Novel Based on the Life of...,17.93,Three
8,The Boys in the Boat: Nine Americans and Their...,22.6,Four
9,The Black Maria,52.15,One


In [41]:
df.to_csv("data.csv",index = False)

## 6. Extracting Attributes
Get the value of an attribute, such as `href` from an `<a>` tag:

Or using `.get()`:

## 7. Traversing the Tree
- Access parent elements:

- Access children elements:

- Find the next sibling:

## 8. Handling Missing Elements Safely
Always check if an element exists before accessing it:

## 9. Summary
- BeautifulSoup helps parse and navigate HTML easily.
- Use `.find()`, `.find_all()`, `.select()`, and `.select_one()` to locate data.
- Always inspect the website's structure before writing scraping logic.
- Combine BeautifulSoup with requests for full scraping workflows.