# 2. Advanced HTML Parsing

"It is easy. You just chip away the stone that doesn't look like David." - Michelangelo - 

Parse complicated HTML pages in order to extract only the information you're looking for. 

## You Don't Always Need a Hammer. 

bs.find_all('table')[4].find_all('tr')[2].find('td').find_all('div')[1].find('a') 

does not look so great. 

- Difficult to debug
- Fragile. (even the slightest change to the website by admin might break your web scraper.)
- What if the admin decides to add table/tr/td/div/a ? 

What are my options? 

- Look for "Print This Page" link or mobile version of the site. They are often better formatted. (More on Chp.14)
- Look for the information hidden in a Javascript file. 
- Information might be availiable in the url of the page itself. 
- Try to think of other sources to get this information. 

The key is to avoid jumping in too early and first think of the best way. 

## Another Serving of BeautifulSoup

we'll discuss:
- searching for tags by attributes
- working with lists of tags
- navigating parse trees 

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html, "html.parser")

CSS is everywhere. (plentiful #id attributes and .class attributes)

find_all is an extremely flexible function. 

bs.tagName returns only the first tag. 
bs.find_all('tagName', {tagAttributes}) returns a list of tags. 

You can strip tags by .get_text() but remember that it is much easier to find what you're looking for in a BeautifulSoup object than in a block of text. 

Try to preserve tag structure as long as possible and use .get_text() to get your final data. 

In [4]:
nameList = bs.findAll('span', {'class': 'green'})
print(nameList[0])
type(nameList[0])

<span class="green">Anna
Pavlovna Scherer</span>


bs4.element.Tag

In [5]:
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
    print(name.get_text()) # Strips off ALL tag brackets and get the content of the document/element. 

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


## find() and find_all() with BeautifulSoup 

- find_all(tag, attributes, recursive, text, limit, keywords)
- find(tag, attributes, recursive, text, keywords)

95% of the time, you'll only need the first two - tag, attributes. 

- tag: can also take a list of tags, like ['h1', 'h2', 'h3']. 
- attributes: dictionary. Can be nested, like {'class':{'green', 'red'}} (both green and red. green OR red) 
- recursive: boolean, default True. How deeply into the document do you want to go?
    - if True: looks into all level of children 
    - if False: only at the top-level tags. 
- text: it matches based on the text content fo the tags, like bs.find_all(text='the prince') 
- limit: find is equivalent to the same find_all, with a limit=1. Retrieves the first x items from the page. 
- keyword: select tags that contain a particular attribute or set of attributes. Can handle AND condition. 

Below are the example codes. 

In [6]:
# tag:

titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles])



[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]


In [8]:
# attributes

allText = bs.find_all('span', {'class':{'green', 'red'}})
print([text for text in allText])

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>, <span class="green">Anna
Pavlovna Scherer</span>, <span class="green">Empress Marya
Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="red">If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.</span>, <span class="red">Heavens! w

In [13]:
# text

nameList = bs.find_all(text='the prince') # The exact match. 
print(len(nameList))
nameList

7


['the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince']

In [14]:
# keyword 
## Redundant. But useful if you have to use AND condition. 

title = bs.find_all(id='title', class_='text') # This is redundant. (AND condition!)
title = bs.find_all(id='title') # This is the same as: bs.find_all('', {'id':'title'})

print([text for text in allText])

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>, <span class="green">Anna
Pavlovna Scherer</span>, <span class="green">Empress Marya
Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="red">If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.</span>, <span class="red">Heavens! w

In [16]:
bs.find_all(class='green') # Keyword argument can't handle this, since class is a reserved word in Python. 

SyntaxError: invalid syntax (<ipython-input-16-b9f90d7cbb13>, line 1)

In [17]:
bs.find_all(class_='green') # This is a (clumsy) solution. But just use attributes instead. 

[<span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Pri

## Other BeautifulSoup Objects

You've seen two types of objects in the BeautifulSoup library. 
- BeautifulSoup objects
- Tag objects. 

There are two more. 
- NavigatableString obejcts. 
    - Used to represent text within tags, rather than the tags themselves. 
- Comment objects. 
    - find HTML comments like < !--like this one--> 
    
## Navigating Trees

find_all is based on tags' name and attributes. 

But what if you need to find a tag based on its location in a document? => Tree. Traverse up/across/diagonally

### Dealing with children and other descendants

- children: exactly one tag below a parent
- descendants: can be at any level in the tree below a parent. 

All BeautifulSoup functions deal with the descendants. 

If you want to use children instead, use .children. 

In [18]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for child in bs.find('table',{'id':'giftList'}).children:
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


### Dealing with siblings

siblings: 형제자매, 동기. 

Objects cannot be siblings with themselves. The object itself will not be included in the siblings list. 

Good to exclude the first tag, like a table title. 

Even if it is easier to just select like: bs.table.tr, it is very fragile. 

To make a robust scraper, it's best to be as specific as possible when making tag selection. 

In [19]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings: # The same as above except for the first title row. 
    print(sibling) 



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

Other variants: 

- previous_sliblings: return all previous siblings except itself. useful when there is an easily selectable tag at the end. 
- next_sibling: select only one sibling right after. 
- previous_sibling: select only one sibling right before. 

### Dealing with parents

Not used as often as finding siblings/descendants but sometimes you need it. 

In [20]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.find('img',
              {'src':'../img/gifts/img1.jpg'})
      .parent.previous_sibling.get_text())


$15.00



## Regular Expressinos

They are used to identify regular strings. Returns all strings that follow the rule. 

What is a regular string? : It's any string that can be generated by a series of linear rules. 

(I'll leave out regex rules here.) 

Make the regex step by step. 

** Regex can be slightly different by languages. 

## Regular Expressions and BeautifulSoup

Most functions that take in a string argument also take in a regular expression as well. 

You can use regex to get all the images you want in the website for an example. 

Modern websites often have hidden images and it is very common for a website to have different layout by pages. 

In this case, you might wanna look at the file path of the product image. 

In [24]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img', {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')}) # ../img/gifts/img아무거나.jpg
for image in images: 
    print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


## Accessing Attributes

Sometimes you're looking for attributes. (especially, href attrib for a tag / src attrib for img tag)

To get attributes of a tag object, 

myTag.attrs

This returns a dictionary object. 

In [27]:
bs.div.attrs

{'id': 'wrapper'}

## Lambda Expressions

lambda expression is a function that is passed into another function as a variable. eg) f(g(x), y)

BeautifulSoup allows you to pass certain types of functions as parameters into the find_all function. 

The only restriction is that these functions 
- must take a tag object as an argument 
- and return a boolean. 

Like below. 

(You can even use it with regex, which makes it really powerful.) 

In [28]:
bs.find_all(lambda tag: len(tag.attrs) == 2) # tag that has two attributes. 

[<img src="../img/gifts/logo.jpg" style="float:left;"/>,
 <tr class="gift" id="gift1"><td>
 Vegetable Basket
 </td><td>
 This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
 <span class="excitingNote">Now with super-colorful bell peppers!</span>
 </td><td>
 $15.00
 </td><td>
 <img src="../img/gifts/img1.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift2"><td>
 Russian Nesting Dolls
 </td><td>
 Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
 </td><td>
 $10,000.52
 </td><td>
 <img src="../img/gifts/img2.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift3"><td>
 Fish Painting
 </td><td>
 If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
 </td><td>
 $10,005.00
 </td><td>
 <img src="../img/gifts/img3.jpg"/>
 </td>

In [29]:
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?') # you can even replace existing BeautifulSoup functions. 

[<span class="excitingNote">Or maybe he's only resting?</span>]

In [10]:
bs.find_all('', text='Or maybe he\'s only resting?') # This accomplishes the same thing. 

["Or maybe he's only resting?"]