# Tutorial Web Scraping 2: Navigating HTML Trees

In this tutorial, we will introduce more advanced concepts for web scraping. You will learn how to efficiently navigate through HTML structures, target specific elements based on their attributes, and filter the content to scrape. 

You will be introduced to **tree navigation** concepts in HTML, which will allow you to move between elements on a webpage and scrape information based on relationships between elements like **parents**, **children**, and **siblings**.

---

## How to search for specific elements with advanced tools?


1. **Extracting HTML by tag attributes**: Previous information extractions have been conducted using simple tags. However, it is also possible to filter tags according to their **attributes** such as `class`, `id`, or custom attributes.
   
2. **Tree navigation in HTML**: As an alternative to extraction by attributes whose names are given, extraction can be done by navigating **up** (parents), **down** (children) and **down** (siblings) in an HTML document, allowing extraction of nested or related content.

3. **Regular Expressions with BeautifulSoup**: Another alternative to attribute search is the use of **regular expressions (regex)**, to extract more flexible data from complex structures. Here, the task is to identify specific string patterns, such as all words beginning with the letter 'a' or the string 'upjv'.
---

## 1. Search by attribute using `findAll`

### Step 1: Fetching the HTML content

As in Tutorial 1, we start by fetching the HTML content of a page. The `urlopen` function will retrieve the raw HTML, which will be parsed by **BeautifulSoup**.

**Task:** Run the code below to view the HTML structure of the webpage.


In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs)

**Explanation:** The code fetches and prints the entire HTML structure of the target page. This is useful for exploring the layout of the page and identifying what elements we want to scrape.

### Step 2: Extracting specific tags with attributes
Next, we move beyond extracting simple tags and target tags with specific attributes like class or id.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html, "html.parser")

#.find_all('span', {'class':{'green', 'red'}})
nameList = bs.findAll('span', {'class': ['green', 'red']})
for name in nameList:
    print(name.get_text())

**Explanation:** This code filters `<span>` tags based on their class attribute, extracting only those that are colored green or red.
    
**Task:** Guess what the codes below will give, and then modify the code to extract all the `<span` tags that have a different class or another attribute of your choice.

In [None]:
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text())

In [None]:
allText = bs.find_all('span', {'class':{'green', 'red'}})
print([text for text in allText])

### Step 3: Finding all titles (headings)
Let's move on to extracting all the heading tags (`<h1>` to `<h6>`). This is useful when scraping articles or web pages with structured headings.

**Task:**
Run the code and observe the output. What headings were extracted from the page?

In [None]:
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles])


### Step 4: Counting occurrences of text

Sometimes, we want to count how many times a specific piece of text appears on the page. For example, let's count how many times the phrase "the prince" appears on the page.

**Task:**
Modify this code to count the occurrences of any other word you find on the webpage.

In [None]:
nameList = bs.find_all(text='the prince')
print(len(nameList))

## Exercise 1: Counting words
Write a code that counts the number of occurrences of a list of words on the webpage.

## 2. Search by type using `find`

### Step 1: Extracting table data
We can extract data from HTML tables as well. In this example, let's scrape data from a table on another webpage.

**Task:**
Modify this code to extract only the actual data rows (not the header) from the table.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for child in bs.find('table',{'id':'giftList'}).children:
    print(child)

### Step 2: Extracting sibling / parent elements
Let's say you want to extract elements that are next to each other in the HTML structure (i.e., siblings). Here's how to do it.

**Task:**
Try extracting only the product data without the title row.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling) 

In some cases, you might need to navigate up the tree to find the parent of a specific element. Here's how to do that.

**Task:**
Modify the code to extract the text associated with another image from the same table.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.find('img',
              {'src':'../img/gifts/img1.jpg'})
      .parent.previous_sibling.get_text())

## 3. Search by expression 

### Step 1: Regular expressions 
Regular expressions (regex) allow you to search for patterns in text. This is useful for finding elements with complex or unpredictable attributes.

**Task:** What type of files does this code retrieve?

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images: 
    print(image['src'])

### Step 2: Lambda functions

You can also use lambda functions to create custom filtering criteria. For example, to find all tags that have exactly two attributes:

In [None]:
bs.find_all(lambda tag: len(tag.attrs) == 2)

**Task:**
Modify the lambda function to find tags with three or more attributes.

In [None]:
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')

In [None]:
bs.find_all('', text='Or maybe he\'s only resting?')

## Exercise 2: Extracting prices with regex

On Page 3 in Step 1 above, you found several products listed with their prices displayed in bold. For this exercise, you will:
 - Extract all the prices displayed on the page using regular expressions.
 - Display the extracted prices in a readable format.

**Instructions to follow:**
 - Inspect Page 3 in your browser, and examine how the prices are structured.
 - Identify price format and propose a regular expression for this pattern: \\$xx.xx (e.g., \\$15.00*, \\$0.50).
     - you can start simply using your own words to describe the pattern, then "convert" it into a regex.
 
**Questions:**
1. Run the code below and observe the output. Correct the code to get rid of the error message.
2. After correction, does it correctly extract all the prices from the page?
2. Modify the regex to extract (a) only standalone prices, and (b) prices with only one digit before the decimal point (e.g., \\$0.50, but not \\$15.00).
4. Format the output to display the extracted prices with a custom message (e.g., "The price is $0.50").

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

# Step 1: Fetch the HTML content of the webpage
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

# Step 2: Use the regular expression to find all prices in the format $xx.xx
prices = bs.find_all(text=re.compile(r'\$\d+\.\d{2}'))

# Step 3: Print out the extracted prices
for price in prices:
    print(price)

### Conclusion
You’ve now learned the basics of scraping using Python and BeautifulSoup, from fetching HTML pages to navigating through complex HTML structures, handling text, tables, and images. Experiment with different web pages and see what you can extract!