<a href="https://colab.research.google.com/github/lblogan14/web_scraping_with_python/blob/master/ch2_adv_html_parsing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Another Serving of BeautifulSoup
CSS relies on the differentiation of HTML elements that might have the exact same markup in order to style them differently. Because
CSS relies on these identifying attributes to style sites appropriately, it is almost guaranteed that the class and ID attributes will be plentiful on most modern websites. \\
The [example](http://www.pythonscraping.com/pages/warandpeace.html) page composes lines with different colors. \\

Create a `BeautifulSoup` object to grab the entire page:

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [0]:
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html, 'html.parser')

In [5]:
print(bs)

<html>
<head>
<style>
.green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
</style>
</head>
<body>
<h1>War and Peace</h1>
<h2>Chapter 1</h2>
<div id="text">
"<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>"
<p></p>
It was in July, 1805, and the speaker was the well-known <span class="green">Anna
Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
of high rank and importance, who was the firs

Use the `find_all` function to extract a Python list of proper nouns found by selecting only the text within `<span class="green"></span>` tags:

In [6]:
nameList = bs.findAll('span', {'class': 'green'})
for name in nameList:
  print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


The `bs.tagName` method obtains the first occurrence of that tag on the page; whereas the `bs.find_all(tagName, tagAttributes)` method gets a list of all of the tags on the page, rather than just the first. \\
`name.get_text()` strips all tags from the document curently worked on and returns a Unicode string containing the text only, as shown below:

In [8]:
for name in nameList:
  print(name)

<span class="green">Anna
Pavlovna Scherer</span>
<span class="green">Empress Marya
Fedorovna</span>
<span class="green">Prince Vasili Kuragin</span>
<span class="green">Anna Pavlovna</span>
<span class="green">St. Petersburg</span>
<span class="green">the prince</span>
<span class="green">Anna Pavlovna</span>
<span class="green">Anna Pavlovna</span>
<span class="green">the prince</span>
<span class="green">the prince</span>
<span class="green">the prince</span>
<span class="green">Prince Vasili</span>
<span class="green">Anna Pavlovna</span>
<span class="green">Anna Pavlovna</span>
<span class="green">the prince</span>
<span class="green">Wintzingerode</span>
<span class="green">King of Prussia</span>
<span class="green">le Vicomte de Mortemart</span>
<span class="green">Montmorencys</span>
<span class="green">Rohans</span>
<span class="green">Abbe Morio</span>
<span class="green">the Emperor</span>
<span class="green">the prince</span>
<span class="green">Prince Vasili</span>
<span cl

Be aware that, if a large block of text that contains many hyperlinks, paragraphs, and other tags, the `.get_text()` method will strip away all those information and leave with a tagless block of text. Thus, calling `.get_text()` should be the last thing to be done. Try to preserve the tag structure of a document as long as possible.

##`find()` and `find_all()` with BeautifulSoup


```
  find_all(tag, attributes, recursive, text, limit, keywords)
  find(tag, attributes, recursived, text, keywords)
```
The `tag` argument can be passed a string name of a tag or even a Python list of string tag names:

In [9]:
titles = bs.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
print([title for title in titles])

[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]


The `attributes` argument takes a Python dictionary of attributes and matches tags that contain any one of those attributes:

In [10]:
allText = bs.find_all('span', {'class': {'green', 'red'}})
print([text for text in allText])

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>, <span class="green">Anna
Pavlovna Scherer</span>, <span class="green">Empress Marya
Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="red">If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.</span>, <span class="red">Heavens! w

The `recursive` argument is a boolean. If `recursive` is set to `True`, the `find_all` function looks into children, and children's children, for tags that match the parameters defined before. If `False`, it will look only at the top-level tags in the document. By default, `find_all` works recursively (set to `True`) \\

The `text` argument matches based on the text content of the tags, rather than properties of the tags themselves:

In [18]:
nameList = bs.find_all(text='the prince')
print(nameList)

['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']


In [12]:
print(len(nameList))

7


The `limit` argument is used only in the `find_all` method; `find` is equivalent to the same `find_all` call, with a limit of 1. `limit` allows to retrieve the first x items from the page.

The `keyword` argument allows to select tags that contain a particular attribute or set of attributes:

In [19]:
title = bs.find_all(id='title', class_='text')
print([text for text in title])

[]


This returns the first tag with the word "text" in the `class_` attribute and "title" in the `id` attribute. \\
Note that, by convention, each value for an `id` should be used only once
on the page. Therefore, in practice, a line like this may not be particularly useful, and
should be equivalent to the following:

In [0]:
title = bs.find(id='title')

Since `class` is a protected keyword in Python, that is, `class` is a reserved word in Python that cannot be used as a variable or argument name. A syntax error occured if trying the following call:

In [21]:
bs.find_all(class = 'green')

SyntaxError: ignored

Instead, apply the following solution:

In [22]:
bs.find_all(class_ = 'green')

[<span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Pri

Or, alternatively:

In [23]:
bs.find_all('', {'class':'green'})

[<span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Pri

##Other BeautifulSoup Objects
* `NavigableString` objects: Used to represent text within tags, rather than the tags themselves (some functions operate on and produce `NavigableStrings`, rather than tag objects).
* `Comment` Objects: Used to find HTML comments in comment tags, `<!--like this one-->`

##Navigating Trees
The `find_all` function is responsible for finding tags based on their name and attributes. \\
A BeautifulSoup tree may look like:


```
bs.tag.subTag.anotherSubTag
```



###Dealing with children and other descendants
The `tr` tags are children of the `table` tag, whereas `tr`, `th`, `td`, `img`, and `span` are all descendants of the `table` tag, as shown in the [example](http://www.pythonscraping.com/pages/page3.html) page. \\
All children are descendants, but not all descendants are children. \\
For example, `bs.body.h1` selects the first `h1` tag that is a descendant of the `body` tag, but not find tags located outside the body.\\
`bs.div.find_all('img')` finds the first `div` tag in the document, and then retrieve a list of all `img` that are descendants of that `div` tag. \\
To only find the descendants that are children, use the `.children` method:

In [0]:
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

In [25]:
for child in bs.find('table', {'id': 'giftList'}).children:
  print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


This code prints the list of product rows in the `giftList` table, including the initial row of column labels. \\
If trying the `.descendants` method:

In [26]:
for desc in bs.find('table', {'id': 'giftList'}).descendants:
  print(desc)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<th>
Item Title
</th>

Item Title

<th>
Description
</th>

Description

<th>
Cost
</th>

Cost

<th>
Image
</th>

Image



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<td>
Vegetable Basket
</td>

Vegetable Basket

<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!

<span class="excitingNote">Now with super-colorful bell peppers!</span>
Now with super-colorful bell peppers!


<td>
$15.00
</td>

$15.00

<td>
<img src="../img/gifts/img1.jpg"

###Dealing with siblings
The `next_siblings()` function makes it trivial to collect data from tables, especially ones with title rows:

In [27]:
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
  print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

This prints all rows of products from the product table, except for the first title row, since objects cannot be siblings with themselves. The `next_siblings` function only calls *next* siblings! \\
The `previous_siblings` function can be helpful if there is an easily selectable tag at the end of a list of sibling tags

###Dealing with parents
`.parent` or `.parents` functions

In [29]:
print(bs.find('img', {'src':'../img/gifts/img1.jpg'}).parent.previous_sibling)

<td>
$15.00
</td>


This prints the price of the object represented by the image at the location `../img/gifts/img1.jpg`. \\
How does it work?
1. The image tag where `src="../img/gifts/img1.jpg"` is first selected
2. Select the parent of that tag, in this case, the `td` tag
3. Select the `previous_sibling` of the `td` tag (in this case, the `td` tag that contains the dollar value of the product)
4. Select the text within that tag, "$15.00"

#Regular Expressions
*Regular expressions* are used to identify regular strings. \\
*Regular strings* are any strings that can be generated by a series of linear rules:
1. Write the letter $a$ at least once.
2. Append to this letter $b$ exactly five times.
3. Append to this letter $c$ any even number of times.
4. Write either the letter $d$ or $e$ at the end.

Examples are $aaaabbbbbccccd, aabbbbbcce$, and so on. \\
Regular expressions are merely a shorthand way of expressing these sets of rules. For example, the regular expression for the series of steps just described above can be written as


```
aa*bbbbb(cc)*(d|e)
```
This can be intrepreted as
* `aa*`: The `a*` means "any number of as, including 0 of them." Hence, it is guaranteed that the letter `a` is written at least once.
* `bbbbb`: five `b`s in a row
* `(cc)*`: Any even number of things can be grouped into pairs, thus this means that it allows to have any number of pairs of `c`s.
* `(d|e)`: add a `d` or an `e`

The [RegEx Pal](https://www.regexpal.com/) can be used to test the user-defined regular expressions on the fly. \\
The figure below lists commonly used reuglar expression symbols:

![alt text](https://github.com/lblogan14/web_scraping_with_python/blob/master/img/ch2/regex.JPG?raw=true)

When attempting to write any regular expression from scratch, it’s best to first make a
list of steps that concretely outlines what the target string looks like.

#Regular Expressions and BeautifulSoup
To grab URLs to all of the product images from the [example](http://www.pythonscraping.com/pages/page3.html) page, need to avoid and discard all the hidden images, blank images used for spacing and aligning elements, and other random image tags that are unaware by the users.\\
Assume that the layout of the page might change, so the position of the imagge in the page is not reliable in order to find the correct tag. The solution is to look for something identifying about the tag itself. In this case, the file path of the product image is the key:

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re # regular expression module

In [0]:
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

In [0]:
images = bs.find_all('img', 
                     {'src': re.compile('\.\.\/img\/gifts/img.*\.jpg')})

In [33]:
for image in images:
  print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


This prints only the relative image paths that start with *../img/gifts/img* and end in *.jpg*. A regular expression can be inserted as any argument in a BeautifulSoup expression, allowing users a great deal of flexibility in finding target elements.

#Accessing Attributes
The `a` tag is the URL pointing to is contained within the `href` attribute. \\
The `img` tag is the target image is contained within the `src` attribute. \\
Within tag objects, a Python list of attributes can be automaticaaly accessed by  calling:

```
myTag.attrs
```

This returns a Python dictionary object. The source location for an image can be found using the following:


```
myImgTag.attrs['src']
```




#Lambda Expressions
A *lambda expression* is a function that is passed into another function as a variable; instead of defining a function as $f(x,y)$, a function may be defined as $f(g(x), y)$ or even $f(g(x), h(x))$. \\
BeautifulSoup allows users to pass certain types of functions as parameters into the
`find_all` function. The only thing ti notice is that these functions must take a tag object as an argument and return a boolean. Every tag object that BeautifulSoup encounters is evaluated in this
function, and tags that evaluate to `True` are returned, while the rest are discarded. \\
For example, the following retrieves all tags that have exactly two attributes:

In [34]:
bs.find_all(lambda tag: len(tag.attrs) == 2)

[<img src="../img/gifts/logo.jpg" style="float:left;"/>,
 <tr class="gift" id="gift1"><td>
 Vegetable Basket
 </td><td>
 This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
 <span class="excitingNote">Now with super-colorful bell peppers!</span>
 </td><td>
 $15.00
 </td><td>
 <img src="../img/gifts/img1.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift2"><td>
 Russian Nesting Dolls
 </td><td>
 Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
 </td><td>
 $10,000.52
 </td><td>
 <img src="../img/gifts/img2.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift3"><td>
 Fish Painting
 </td><td>
 If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
 </td><td>
 $10,005.00
 </td><td>
 <img src="../img/gifts/img3.jpg"/>
 </td>

The lambda functions can be used to replace the existing BeautifulSoup functions:

In [35]:
bs.find_all(lambda tag: tag.get_text() == 'Or maybe he\'s only resting?')

[<span class="excitingNote">Or maybe he's only resting?</span>]

This can also be accomplished without a lambda function:

In [36]:
bs.find_all('', text='Or maybe he\'s only resting?')

["Or maybe he's only resting?"]

Note the output difference of the above different calls...

Because the provided lambda functions can be any functions that return a `True` or
`False` value, they can be combined with regular expressions to find tags with
an attribute matching a certain string pattern.