# Beautiful Soup


<a href='http://www.crummy.com/software/BeautifulSoup/'>Beautiful Soup</a> is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

# Quick Start

Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland:



In [1]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

Running the “three sisters” document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:

Let's import required library.

In [2]:
from bs4 import BeautifulSoup

In [3]:
soup = BeautifulSoup(html_doc,'html.parser')

In [4]:
type(soup)

bs4.BeautifulSoup

we can see that the type of soup is BeautifulSoup object.

In [5]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


Here are some simple ways to navigate that data structure:

In [6]:
# to get the page title. 
soup.title

<title>The Dormouse's story</title>

In [10]:
soup.title.name

'title'

Let's get the exact text.

In [7]:
soup.title.text

"The Dormouse's story"

In [11]:
# or we can use this 
soup.title.string

"The Dormouse's story"

In [13]:
soup.title.parent.name

'head'

In [15]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [16]:
soup.p.string

"The Dormouse's story"

In [18]:
soup.find_all("p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [19]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

the above will return the first one. if we want to find all the link we need to call the `.find_all()` method.

In [20]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [21]:
# let's use the id 
soup.find(id='link1')

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [22]:
soup.find(id='link2')

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

One common task is extracting all the URLs found within a page’s `<a>` tags:

In [24]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [25]:
for link in soup.find_all('a'):
    print(link.get('class'))

['sister']
['sister']
['sister']


Another common task is extracting all the text from a page:

In [26]:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



Let's install `pip install lxml`

In [27]:
! pip install lxml



In [28]:
! pip install html5lib



We have different parser library.

- Python’s html.parser
- lxml’s HTML parser
- lxml’s XML parser
- html5lib

I recommend you install and use lxml for speed. If you’re using a very old version of Python – earlier than 2.7.3 or 3.2.2 – it’s essential that you install lxml or html5lib. Python’s built-in HTML parser is just not very good in those old versions.


Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers'>Differences between parsers</a> for details.

# Let's get started
### Making the soup

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

In [30]:
with open('index.html') as fp:
    soup = BeautifulSoup(fp,'html.parser')

In [32]:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [3]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'html.parser')

In [34]:
print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

In [35]:
print(BeautifulSoup("<html><head></head><body>Sacr&eacute; bleu!</body></html>", "html.parser"))

<html><head></head><body>Sacré bleu!</body></html>


## Kinds of objects

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: `Tag`, `NavigableString`, `BeautifulSoup`, and `Comment`.

### Tag

A Tag object corresponds to an XML or HTML tag in the original document:

In [38]:
tag = soup.b
tag

<b>The Dormouse's story</b>

Let's check the type of this.

In [39]:
type(tag)

bs4.element.Tag

Tags have a lot of attributes and methods For now, the most important features of a tag are its name and attributes.

### Name

Every tag has a name, accessible `as .name:`

In [40]:
tag.name

'b'

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

In [41]:
tag.name = 'blockquote'
tag

<blockquote>The Dormouse's story</blockquote>

## Attributes

A tag may have any number of attributes. The tag `<b id="boldest">` has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

In [42]:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id']

'boldest'

You can access that dictionary directly as `.attrs:`

In [43]:
tag.attrs

{'id': 'boldest'}

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [44]:
tag['id'] = 'verybold'

In [45]:
tag.attrs

{'id': 'verybold'}

In [46]:
tag['tag-another-attribute'] = 'New attribute'

In [47]:
tag.attrs

{'id': 'verybold', 'tag-another-attribute': 'New attribute'}

Let's perform the delete operation.

In [49]:
del tag['id']

In [50]:
tag.attrs

{'tag-another-attribute': 'New attribute'}

In [51]:
del tag['tag-another-attribute']

In [52]:
tag.attrs

{}

In [53]:
tag

<b>bold</b>

In [54]:
tag['id']

KeyError: 'id'

In [56]:
print(tag.get('id'))

None


# Multi-valued attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is `class (that is, a tag can have more than one CSS class)`. Others include `rel`, `rev`, `accept-charset`, `headers`, and `accesskey`. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:



In [11]:
css_soup = BeautifulSoup('<p class="body">This is single value attribute</p>','html.parser')

In [12]:
css_soup

<p class="body">This is single value attribute</p>

In [14]:
css_soup.p['class']

['body']

Now let's create class wtih multiple attribute. 

In [16]:
css_soup = BeautifulSoup('<p class="body strikout"> This is multple value attribute</p>','html.parser')

In [17]:
css_soup.p['class']

['body', 'strikout']

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

In [18]:
css_soup = BeautifulSoup('<p id="my id">This is not a multi valued attribute</p>','html.parser')

In [19]:
css_soup.p['id']

'my id'

When you turn a tag back into a string, multiple attribute values are consolidated:

In [20]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser')

In [21]:
rel_soup

<p>Back to the <a rel="index">homepage</a></p>

In [22]:
rel_soup.a['rel']

['index']

In [23]:
rel_soup.a['rel'] = ['index','contents']

In [24]:
rel_soup.a['rel']

['index', 'contents']

In [25]:
rel_soup

<p>Back to the <a rel="index contents">homepage</a></p>

You can disable this by passing `multi_valued_attributes=None` as a keyword argument into the BeautifulSoup constructor:

In [28]:
no_list_soup = BeautifulSoup('<p class="body strickout"></p>','html.parser',multi_valued_attributes=None)

In [29]:
no_list_soup.p['class']

'body strickout'

If you parse a document as XML, there are no multi-valued attributes:

In [30]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>','xml')
xml_soup.p['class']

'body strikeout'

Again, you can configure this using the `multi_valued_attributes` argument:

In [32]:
class_is_multi = {'*':'class'}
xml_soup = BeautifulSoup('<p class="body strikeout"></p>','xml', multi_valued_attributes=class_is_multi)

In [33]:
xml_soup.p['class']

['body', 'strikeout']

You probably won’t need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification:

In [34]:
from bs4.builder import builder_registry
builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES

{'*': ['class', 'accesskey', 'dropzone'],
 'a': ['rel', 'rev'],
 'link': ['rel', 'rev'],
 'td': ['headers'],
 'th': ['headers'],
 'form': ['accept-charset'],
 'object': ['archive'],
 'area': ['rel'],
 'icon': ['sizes'],
 'iframe': ['sandbox'],
 'output': ['for']}

# NavigableString
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

In [35]:
soup = BeautifulSoup('<b class="boldest">Extremly bold</p>','html.parser')
soup

<b class="boldest">Extremly bold</b>

In [36]:
soup.b.string

'Extremly bold'

In [37]:
type(soup.b.string)

bs4.element.NavigableString

A NavigableString is just like a Python Unicode string. You can convert a NavigableString to a Unicode string with `unicode() (in Python 2)` or `str (in Python 3)`:

In [38]:
unicode_string = str(soup.b.string)

In [39]:
unicode_string

'Extremly bold'

In [41]:
print(type(unicode_string))

<class 'str'>


You can’t edit a string in place, but you can replace one string with another, using `replace_with()`:

In [42]:
soup.b.string.replace_with("This is replaced string")

'Extremly bold'

In [43]:
soup.b.string

'This is replaced string'

In [44]:
soup

<b class="boldest">This is replaced string</b>

NavigableString supports most of the features described in `Navigating the tree` and `Searching the tree`, but not all of them.
In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the `.contents` or `.string` attributes, or the `find()` method.

If you want to use a NavigableString outside of Beautiful Soup, `you should call unicode()` on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.

# BeautifulSoup
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.  You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like combine two parsed documents:

In [45]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document>",'xml')

In [46]:
doc

<?xml version="1.0" encoding="utf-8"?>
<document><content/>INSERT FOOTER HERE</document>

In [47]:
footer = BeautifulSoup("<footer> Here's the footer</footer>",'xml')

In [48]:
footer

<?xml version="1.0" encoding="utf-8"?>
<footer> Here's the footer</footer>

In [51]:
doc.find(text="INSERT FOOTER HERE").replace_with(footer)

'INSERT FOOTER HERE'

In [52]:
doc

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer> Here's the footer</footer></document>

In [53]:
footer

<?xml version="1.0" encoding="utf-8"?>

Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its `.name`, so it’s been given the special `.name` “[document]”:

In [54]:
soup.name

'[document]'

## Comments and other special strings

`Tag`, `NavigableString`, and `BeautifulSoup` cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The main one you’ll probably encounter is the comment:

In [57]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup,'html.parser')

In [58]:
soup

<b><!--Hey, buddy. Want to buy a used parser?--></b>

In [59]:
soup.b.string

'Hey, buddy. Want to buy a used parser?'

In [60]:
type(soup.b.string)

bs4.element.Comment

The Comment object is just a special type of `NavigableString`:


In [61]:
comment = soup.b.string
comment

'Hey, buddy. Want to buy a used parser?'

But when it appears as part of an HTML document, a Comment is displayed with special formatting:

In [62]:
print(soup.b.prettify())

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>


Beautiful Soup also defines classes called `Stylesheet, Script, and TemplateString`, for embedded CSS stylesheets (any strings found inside a `<style>` tag), embedded Javascript (any strings found in a `<script>` tag), and HTML templates (any strings inside a `<template>` tag). These classes work exactly the same way as NavigableString; their only purpose is to make it easier to pick out the main body of the page, by ignoring strings that represent something else. 

`Note:` These classes are new in Beautiful Soup 4.9.0, and the html5lib parser doesn’t use them.

Beautiful Soup defines classes for anything else that might show up in an XML document: `CData`, `ProcessingInstruction`, `Declaration`, and `Doctype`. Like Comment, these classes are subclasses of NavigableString that add something extra to the string. Here’s an example that replaces the comment with a CDATA block:


In [63]:
from bs4 import CData

In [64]:
cdata = CData("A C DATA BLOCK")

In [65]:
comment.replace_with(cdata)

'Hey, buddy. Want to buy a used parser?'

In [66]:
comment

'Hey, buddy. Want to buy a used parser?'

In [67]:
soup

<b><![CDATA[A C DATA BLOCK]]></b>

In [68]:
print(soup.b.prettify())

<b>
 <![CDATA[A C DATA BLOCK]]>
</b>


# Navigating the tree

In [69]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [70]:
soup = BeautifulSoup(html_doc,'html.parser')

In [71]:
soup


<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

### Going down
Tags may contain strings and other tags. These elements are the tag’s children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.

#### Navigating using tag names

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the `<head>` tag, just say `soup.head`:


In [72]:
soup.head

<head><title>The Dormouse's story</title></head>

In [74]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

But it will return only the first occurance.

You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first `<b>` tag beneath the `<body>` tag:

In [75]:
soup.body.b

<b>The Dormouse's story</b>

In [76]:
soup.body.b.string

"The Dormouse's story"

Using a tag name as an attribute will give you only the first tag by that name:

If you need to get all the `<a>` tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as `find_all()`:

In [77]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## .contents and .children

A tag’s children are available in a list called `.contents`:

In [78]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's story</title></head>

In [81]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [82]:
body_tag = soup.body
body_tag

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [83]:
body_tag.contents

['\n',
 <p class="title"><b>The Dormouse's story</b></p>,
 '\n',
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 '\n',
 <p class="story">...</p>,
 '\n']

The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object.:

In [84]:
soup.contents

['\n',
 <html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p>
 </body></html>]

In [85]:
len(soup.contents)

2

In [87]:
soup.contents[1].name

'html'

A string does not have .contents, because it can’t contain anything:

In [4]:
title_tag = soup.title
text = title_tag.contents
text.contents

AttributeError: 'list' object has no attribute 'contents'

Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

In [6]:
for child in title_tag.children:
    print(child)

The Dormouse's story


## .descendants
The `.contents` and `.children` attributes only consider a tag’s direct children. For instance, the `<head>` tag has a single direct child–the `<title>` tag:

In [8]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's story</title></head>

In [9]:
head_tag.contents

[<title>The Dormouse's story</title>]

But the `<title>` tag itself has a child: the `string “The Dormouse’s story”`. There’s a sense in which that string is also a child of the `<head>` tag. The .descendants attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:

In [11]:
for child in head_tag.descendants:
    print(child)

<title>The Dormouse's story</title>
The Dormouse's story


In [12]:
type(head_tag.descendants)

generator

The `<head>` tag has only one child, but it has two descendants: the `<title>` tag and the `<title>` tag’s child. The BeautifulSoup object only has one direct child (the `<html>` tag), but it has a whole lot of descendants:

In [13]:
len(list(soup.children))

1

In [14]:
len(list(soup.descendants))

26

# .string

If a tag has only one child, and that child is a NavigableString, the child is made available as `.string`:

In [15]:
title_tag.string

"The Dormouse's story"

If a tag’s only child is another tag, and that tag has a .string, then the parent tag is considered to have the same .string as its child:

In [17]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [18]:
head_tag.string

"The Dormouse's story"

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

In [21]:
print(soup.html.string)

None


## .strings and stripped_strings

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:

In [23]:
soup.strings

<generator object Tag._all_strings at 0x0000020E60B146D0>

In [24]:
type(soup.strings)

generator

In [26]:
for string in soup.strings:
    print(repr(string))

"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'


These strings tend to have a lot of extra whitespace, which you can remove by using the `.stripped_strings` generator instead:

In [29]:
for string in soup.stripped_strings:
    print(repr(string))

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'


Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed.

## Going up

Continuing the “family tree” analogy, every tag and every string has a parent: the tag that contains it.

### .parent

You can access an element’s parent with the .parent attribute. In the example “three sisters” document, the `<head>` tag is the parent of the `<title>` tag:

In [30]:
title = soup.title
title

<title>The Dormouse's story</title>

In [31]:
title.parent

<head><title>The Dormouse's story</title></head>

The title string itself has a parent: the `<title>` tag that contains it:

In [32]:
title.string.parent

<title>The Dormouse's story</title>

The parent of a top-level tag like `<html>` is the BeautifulSoup object itself:

In [34]:
html_tag = soup.html
html_tag

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [35]:
html_tag.parent

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [36]:
type(html_tag.parent)

bs4.BeautifulSoup

And the .parent of a BeautifulSoup object is defined as None:

In [38]:
print(soup.parent)

None


## .parents

You can iterate over all of an element’s parents with .parents. This example uses .parents to travel from an <a> tag buried deep within the document, to the very top of the document:



In [39]:
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [41]:
link.parents

<generator object PageElement.parents at 0x0000020E60DC9270>

In [43]:
for parent in link.parents:
    print(parent.name)

p
body
html
[document]


## Going sideways


Consider a simple document like this:

In [44]:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>", 'html.parser')
print(sibling_soup.prettify())

<a>
 <b>
  text1
 </b>
 <c>
  text2
 </c>
</a>


The `<b>` tag and the `<c>` tag are at the same level: they’re both direct children of the same tag. We call them siblings. When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write.

## .next_sibling and .previous_sibling

You can use `.next_sibling` and `.previous_sibling` to navigate between page elements that are on the same level of the parse tree

In [45]:
sibling_soup.b

<b>text1</b>

In [46]:
sibling_soup.b.parent

<a><b>text1</b><c>text2</c></a>

In [47]:
sibling_soup.c.parent

<a><b>text1</b><c>text2</c></a>

In [48]:
sibling_soup.b.next_sibling

<c>text2</c>

In [49]:
sibling_soup.c.previous_sibling

<b>text1</b>

In [51]:
print(sibling_soup.c.next_sibling)

None


In [52]:
print(sibling_soup.b.previous_sibling)

None


The `<b>` tag has a `.next_sibling`, but no `.previous_sibling`, because there’s nothing before the `<b>` tag on the same level of the tree. For the same reason, the `<c>` tag has a `.previous_sibling` but no `.next_sibling`:

The strings “text1” and “text2” are not siblings, because they don’t have the same parent:

In [53]:
sibling_soup.b.string

'text1'

In [54]:
sibling_soup.c.string

'text2'

In [57]:
print(sibling_soup.c.string.next_sibling)

None


In [58]:
print(sibling_soup.c.string.previous_sibling)

None


In real documents, the .next_sibling or .previous_sibling of a tag will usually be a string containing whitespace. Going back to the “three sisters” document:

You can see the rest of the important topic in the BeautifulSoup docummentaion https://www.crummy.com/software/BeautifulSoup/bs4/doc/

## Searching the tree

Beautiful Soup defines a lot of methods for searching the parse tree, but they’re all very similar. I’m going to spend a lot of time explaining the two most popular methods: `find()` and `find_all()`. The other methods take almost exactly the same arguments, so I’ll just cover them briefly.



By passing in a filter to an argument like find_all(), you can zoom in on the parts of the document you’re interested in.

## Kinds of filters 
Before talking in detail about find_all() and similar methods, I want to show examples of different filters you can pass into these methods. These filters show up again and again, throughout the search API. You can use them to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.

### A string

The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the `<b>` tags in the document:
    

In [61]:
soup.find_all('b')

[<b>The Dormouse's story</b>]

In [62]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead.

## A regular expression

If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its `search()` method. This code finds all the tags whose names start with the letter `“b”`; in this case, the `<body>` tag and the `<b>` tag:

In [63]:
import re
soup.find_all(re.compile('^b'))

[<body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p>
 </body>,
 <b>The Dormouse's story</b>]

In [65]:
for tag in soup.find_all(re.compile('^b')):
    print(tag.name)

body
b


This code finds all the tags whose names contain the letter ‘t’:

In [67]:
for tag in soup.find_all(re.compile('t')):
    print(tag.name)

html
title


## A list

If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the `<a>` tags and all the `<b>` tags:

In [68]:
soup.find_all(['a','b'])

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## True

The value True matches everything it can. This code finds all the tags in the document, but none of the text strings:

In [69]:
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p


## A function

read more about in this link https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [70]:
soup.find(string=re.compile("sisters"))

'Once upon a time there were three little sisters; and their names were\n'

please refer to the docummentaion for more details.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/