In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [3]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
#html.read() to get the HTML content of the page
#html.parser is a parser that is included with Python 3 and requires no extra installations to use. 
print(bs.h1)      

<h1>An Interesting Title</h1>


In [4]:
bs.h1

<h1>An Interesting Title</h1>

In [5]:
bs.html.body.h1

<h1>An Interesting Title</h1>

In [6]:
bs.body.h1

<h1>An Interesting Title</h1>

In [7]:
bs.html.h1

<h1>An Interesting Title</h1>

In [8]:
bs = BeautifulSoup(html.read(), 'lxml')
#lxml is a popular parser,

#### Connecting Reliably and Handling Exceptions

In [9]:
html = urlopen('http://www.pythonscraping.com/pages/page1.html')

##### Two main things can go wrong in this line above:
- The page is not found on the server (or there was an error in retrieving it)
- The server is not found

In [10]:
from urllib.error import HTTPError

In [13]:
try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    #return null, break, or do some other "Plan B"
    pass
except URLError as e:
    print('The server could not be found')
else:
    pass
    # program continues. Note: If you return or break in the exception catch, you do not need to use the "else" statement

In [None]:
try: 
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent = None:
        print('Tag was not found')
    else:
        print(badContent)

In [14]:
def getTitle(url):
    try: 
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bs = BeautifulSoup(html.read(), "html.parser")
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
    print('Title could not be found')
else:
    print(title) 

<h1>An Interesting Title</h1>


### Chapter 2: ADVANCED HTML PARSING
In this section, we'll discuss searching for tags by attributes, working with lists of tags and navigating parse trees

### 2.1.Find() and Find_all() with BeautifulSoup

In [15]:
html = urlopen("https://www.pythonscraping.com/pages/warandpeace.html")
bs = BeautifulSoup(html.read(), "html.parser")

In [16]:
nameList = bs.findAll('span', {'class':'green'})
for name in nameList:
    print(name.get_text())
#tag = span, attribute class

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


find_all(tag, attributes, recursive, text, limit. keywords) <br>
find(tag, attributes, recursive, text, keywords)

In [22]:
nameList = bs.find_all(text = 'the prince')
print(len(nameList))
#Text argument: matches based on the text content of the tags, rather than properties of the tags themselves

7


In [23]:
nameList

['the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince']

The limit argument, is used only in the find_all method, find is equivalent to the same find_all call, with a limit of 1.<br>
The keyword argument allows you to select tags that contain a particular attribute or a set of attributes. 

In [24]:
title = bs.find_all(id="title", class_ = 'text')
#This returns the first tag with the word "text" in the class_ attribute and "title" in the id attribute

[]

In [30]:
bs.findAll('', {'class':'green'})

[]

### Other BeautifulSoup Objects
- BeautifulSoup objects: bs <br>
- Tag objects: retrieved in lists or retrieved individually by calling find and find_all
bs.div.h1 <br>
- NavigableStringobjects: use to represent text within tags, rather than the tags themselves<br>
- Comment object: use to find HTML comments in comment tags

### 2.2 Navigating Trees
Children and descendants: 
- Children are always exactly one tag below a parent, whereas descendants can be at any level in the tree 
- All children are descendants, but not all descendants are children

Example:
- bs.body.h1: select the first h1 tag that is a descendant of the body tag. It will not find tags located outside the body
- bs.div.find_all('img) will find the first div tag in the document, and then retrieve a list of all img tags that are descendants of that divtag (Quên check là bs.div là trả về first tag hay nhiều tag)

If you want to find only descendants that are children, you can use tag:

In [32]:
html = urlopen("https://www.pythonscraping.com/pages/page3.html")
bs = BeautifulSoup(html, "html.parser")
for child in bs.find('table', {'id':'giftList'}).children:
    print(child) 
#table: tag (in this case the code prints 
#id : attribute 
##Check lại chỗ find table là lấy giá trị gì



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


In [33]:
for sibling in bs.find('table',{'id':'giftList'}).tr.next_siblings:
    print(sibling) 



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

(Make Selection Specific): cụ thể hóa, sử dụng specific as posible when making tag selection bởi vì page layouts change all the time, bây h có thể là first on the page nhưng someday có thể là second or third

In [34]:
#next_sibling and previous_sibling return a single tag rather than a list of them

#### Dealing with parents

In [36]:
bs.find('img', {'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text()

'\n$15.00\n'

#### Regular Expressions