# BEAUTIFUL SOUP [DOCS](https://beautiful-soup-4.readthedocs.io/en/latest/)

# <b>1. Objects</b>

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: 
- `Tag`, 
- `NavigableString`, 
- `BeautifulSoup`, and 
- `Comment`.

# 1.1 Tag

A `Tag` object corresponds to an XML or HTML tag in the original document:

In [9]:
from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(type(tag))

<class 'bs4.element.Tag'>


Tags have a lot of attributes and methods, and I’ll cover most of them in [Navigating the tree](#2.-Navigating-the-tree) and [Searching the tree](#3.-Searching-the-tree). For now, the most important features of a tag are its 
- **name** and 
- **attributes**.

## `tag.name`

In [10]:
tag.name

'b'

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

In [11]:
tag.name = "blockquote"
tag

<blockquote class="boldest">Extremely bold</blockquote>

## Attributes

A tag may have any number of attributes. The tag `<b class="boldest">` has an attribute “`class`” whose value is “`boldest`”. You can access a tag’s attributes by treating the tag like a dictionary:

In [19]:
tag = soup.b
print(tag)
tag['class']

<b class="boldest">Extremely bold</b>


['boldest']

In [20]:
tag.attrs

{'class': ['boldest']}

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [23]:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag

<b another-attribute="1" class="boldest" id="verybold">Extremely bold</b>

In [24]:
del tag['id']
del tag["another-attribute"]
tag

<b class="boldest">Extremely bold</b>

### Multi-valued attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is `class` (that is, a tag can have more than one CSS class). Others include `rel`, `rev`, `accept-charset`, `headers`, and `accesskey`. Beautiful Soup presents the value(s) of a multi-valued attribute as a list.

Defaults implement the rules described in the HTML specification:

In [1]:
from bs4.builder import builder_registry

builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES

{'*': ['class', 'accesskey', 'dropzone'],
 'a': ['rel', 'rev'],
 'link': ['rel', 'rev'],
 'td': ['headers'],
 'th': ['headers'],
 'form': ['accept-charset'],
 'object': ['archive'],
 'area': ['rel'],
 'icon': ['sizes'],
 'iframe': ['sandbox'],
 'output': ['for']}

In [26]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p.attrs

{'class': ['body', 'strikeout']}

In [27]:
css_soup.p["class"]

['body', 'strikeout']

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

In [28]:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']

'my id'

When you turn a tag back into a string, multiple attribute values are consolidated:

In [34]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
print(rel_soup.a)

rel_soup.a["rel"] = ['index', 'content']
print(rel_soup.a)

<a rel="index">homepage</a>
<a rel="index content">homepage</a>


You can disable this by passing `multi_valued_attributes=None` as a keyword argument into the BeautifulSoup constructor:

In [37]:
no_list_soup = BeautifulSoup(
    '<p class="body strikeout"></p>', 'html', 
        multi_valued_attributes=None)

no_list_soup.p["class"]

'body strikeout'

You can use `get_attribute_list` to get a value that’s always a list, whether or not it’s a multi-valued atribute:

In [38]:
id_soup.p.get_attribute_list("id")

['my id']

If you parse a document as XML, there are no multi-valued attributes, because it does not use the HTML specification:

In [39]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p["class"]

'body strikeout'

In [43]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'lxml')
xml_soup.p["class"]

['body', 'strikeout']

Again, you can configure this using the `multi_valued_attributes` argument:

In [42]:
class_is_multi= { '*' : 'class'}
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 
                         'xml',
                         multi_valued_attributes=class_is_multi)
xml_soup.p["class"]

['body', 'strikeout']

# 1.2 NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the `NavigableString` class to contain these bits of text:

In [13]:
from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag.string

'Extremely bold'

In [14]:
print(type(tag.string))

<class 'bs4.element.NavigableString'>


A `NavigableString` is just like a Python Unicode string, except that it also supports some of the features described in [Navigating the tree](#2.-Navigating-the-tree) and [Searching the tree](#3.-Searching-the-tree). You can convert a `NavigableString` to a Unicode string with `unicode()`:

In [15]:
unicode_string = unicode(tag.string)
unicode_string

NameError: name 'unicode' is not defined

_ChatGPT:_  
In Python 3.x, which is the current version as of my knowledge cutoff in January 2022, the `unicode()` function has been removed, and strings are Unicode by default. If you are working with Python 3, you can simply use the string directly without the need for the `unicode()` function.

You can’t edit a string in place, but you can replace one string with another, using [`replace_with()`](#):

In [16]:
old_text = tag.string.replace_with("Rot Front!")
print(old_text)
print(tag)

Extremely bold
<b class="boldest">Rot Front!</b>


`NavigableString` supports most of the features described in [Navigating the tree](#2.-Navigating-the-tree) and [Searching the tree](#3.-Searching-the-tree), but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the `.contents` or `.string` attributes, or the `find()` method.

# 1.3 BeautifulSoup

The `BeautifulSoup` object represents the parsed document as a whole. For most purposes, you can treat it as a `Tag` object. This means it supports most of the methods described in [Navigating the tree](#2.-Navigating-the-tree) and [Searching the tree](#3.-Searching-the-tree).

You can also pass a `BeautifulSoup` object into one of the methods defined in [Modifying the tree](Modifying-the-tree), just as you would a `Tag`. This lets you do things like combine two parsed documents:

In [20]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")

doc.find(string="INSERT FOOTER HERE").replace_with(footer)

'INSERT FOOTER HERE'

In [21]:
print(doc)

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>


Since the `BeautifulSoup` object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its `.name`, so it’s been given the special `.name` “`[document]`”:

In [22]:
soup.name

'[document]'

# 1.4 Comments and other special strings

`Tag`, `NavigableString`, and `BeautifulSoup` cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The only one you’ll probably ever need to worry about is the `comment`:

In [24]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

soup = BeautifulSoup(markup)
comment = soup.b.string
print(type(comment))

<class 'bs4.element.Comment'>


The `Comment` object is just a special type of `NavigableString`:

In [25]:
comment

'Hey, buddy. Want to buy a used parser?'

But when it appears as part of an HTML document, a `Comment` is displayed with special formatting:

In [26]:
print(soup.b.prettify())

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>



Beautiful Soup defines classes for anything else that might show up in an XML document: 
- `CData`, 
- `ProcessingInstruction`, 
- `Declaration`, and 
- `Doctype`. 

Just like `Comment`, these classes are subclasses of `NavigableString` that add something extra to the string. Here’s an example that replaces the comment with a CDATA block.

> _ChatGPT:_ **CDATA (Character Data)** sections in an HTML or XML document. CDATA is a way to include blocks of text that should not be parsed by the parser. It's often used to include scripts or other content that might contain characters that would otherwise be treated as markup.

In [34]:
from bs4 import CData

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string

cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())

<b>
 <![CDATA[A CDATA block]]>
</b>



# <b>2. Navigating the tree<b>

Here’s the “Three sisters” HTML document:

In [2]:
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

# 2.1 Going down

Tags may contain strings and other tags. These elements are the tag’s **children**. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

> Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.

## Navigating using `tag` names

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the `<head>` tag, just say `soup.head`:

In [37]:
soup.head

<head><title>The Dormouse's story</title></head>

In [38]:
soup.head.string

"The Dormouse's story"

In [39]:
soup.title

<title>The Dormouse's story</title>

You can use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first `<b>` tag beneath the `<body>` tag:

In [40]:
soup.body.b

<b>The Dormouse's story</b>

Using a tag name as an attribute will give you only the **_first tag_** by that name:

In [41]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

If you need to get all the `<a>` tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in [Searching the tree](#Searching-the-tree), such as `find_all()`:

In [42]:
soup.find_all("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [48]:
for sister in (soup.find_all("a")):
    print(sister.string)

Elsie
Lacie
Tillie


## `.contents` and `.children`

### `contents`

A tag’s children are available in a **list** called `.contents`:

In [49]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's story</title></head>

In [53]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [57]:
title_tag = head_tag.contents[0]
title_tag

<title>The Dormouse's story</title>

In [58]:
title_tag.contents

["The Dormouse's story"]

The `BeautifulSoup` object itself has children. In this case, the `<html>` tag is the child of the `BeautifulSoup` object:

In [78]:
len(soup.contents)

1

In [79]:
soup.contents[0].name

'html'

A string does not have `.contents`, because it can’t contain anything:

In [80]:
text = title_tag.contents[0]
try:
    text.contents
except Exception as e:
    print(e)

'NavigableString' object has no attribute 'contents'


### `.children`

Instead of getting them as a list, you can iterate over a tag’s children using the `.children` **generator**:

In [81]:
for child in title_tag.children:
    print(child)

The Dormouse's story


In [82]:
text = title_tag.children
text

<list_iterator at 0x7fc79a89bc10>

In [83]:
text = title_tag.contents
text

["The Dormouse's story"]

## `.descendants`

The `.contents` and `.children` attributes only consider a tag’s direct children. For instance, the `<head>` tag has a single direct child – the `<title>` tag:

In [84]:
head_tag.contents

[<title>The Dormouse's story</title>]

But the `<title>` tag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the `<head>` tag. The `.descendants` attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:

In [85]:
for child in head_tag.descendants:
    print(child)

<title>The Dormouse's story</title>
The Dormouse's story


The `<head>` tag has only one child, but it has two descendants: the `<title>` tag and the `<title>` tag’s child. The `BeautifulSoup` object only has one direct child (the `<html>` tag), but it has a whole lot of descendants:

In [86]:
len(list(soup.children))

1

In [87]:
len(list(soup.descendants))

26

## `.string`

If a tag has only one child, and that child is a `NavigableString`, the child is made available as `.string`:

In [89]:
title_tag

<title>The Dormouse's story</title>

In [90]:
title_tag.string

"The Dormouse's story"

In [95]:
print(soup.a.prettify())

<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie
</a>



In [94]:
soup.a.string

'Elsie'

If a tag’s only child is another tag, and that tag has a `.string`, then the parent tag is considered to have the same `.string` as its child:

In [98]:
head_tag

<head><title>The Dormouse's story</title></head>

In [99]:
head_tag.string

"The Dormouse's story"

If a tag contains more than one thing, then it’s not clear what `.string` should refer to, so `.string` is defined to be `None`:

In [100]:
soup.html

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [102]:
print(soup.html.string)

None


## `.strings` and `stripped_strings`

If there’s more than one thing inside a tag, you can still look at just the strings. Use the `.strings` **generator**:

In [103]:
for string in soup.strings:
    print(repr(string))

"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'


In [104]:
for string in soup.strings:
    print(string)

The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...




_ChatGPT:_  
If you simply want to see the content, use `print(string)`. If you need a more detailed view for debugging or understanding special characters, use `print(repr(string))`.

These strings tend to have a lot of extra whitespace, which you can remove by using the `.stripped_strings` generator instead:

In [105]:
for string in soup.stripped_strings:
    print(repr(string))

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'


Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed.

# 2.2 Going up

Continuing the “family tree” analogy, every tag and every string has a **parent**: the tag that contains it.

## `.parent`

You can access an element’s parent with the `.parent` attribute. In the example “three sisters” document, the `<head>` tag is the parent of the `<title>` tag:

In [3]:
title_tag = soup.title
title_tag

<title>The Dormouse's story</title>

In [4]:
title_tag.parent

<head><title>The Dormouse's story</title></head>

The title string itself has a parent: the `<title>` tag that contains it:

In [6]:
title_tag.string

"The Dormouse's story"

In [5]:
title_tag.string.parent

<title>The Dormouse's story</title>

The parent of a top-level tag like `<html>` is the `BeautifulSoup` object itself:

In [9]:
html_tag = soup.html
print(type(html_tag.parent))

<class 'bs4.BeautifulSoup'>


And the `.parent` of a BeautifulSoup object is defined as None:

In [10]:
print(soup.parent)

None


## `.parents`

You can iterate over all of an element’s parents with `.parents`. This example uses `.parents` to travel from an `<a>` tag buried deep within the document, to the very top of the document:

In [13]:
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [19]:
for parent in link.parents:
    if parent is None:
        print(parent)
    print(parent.name)

p
body
html
[document]


# 2.3 Going sideways

Consider a simple document like this:

In [20]:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())

<html>
 <body>
  <a>
   <b>
    text1
   </b>
   <c>
    text2
   </c>
  </a>
 </body>
</html>



The `<b>` tag and the `<c>` tag are at the same level: they’re both direct children of the same tag. We call them **siblings**. When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write.

## `.next_sibling` and `.previous_sibling`

You can use `.next_sibling` and `.previous_sibling` to navigate between page elements that are on the same level of the parse tree:

In [21]:
sibling_soup.b.next_sibling

<c>text2</c>

In [24]:
sibling_soup.c.previous_sibling

<b>text1</b>

The `<b>` tag has a `.next_sibling`, but no `.previous_sibling`, because there’s nothing before the `<b>` tag on the same level of the tree. For the same reason, the `<c>` tag has a `.previous_sibling` but no `.next_sibling`:

In [26]:
print(sibling_soup.b.previous_sibling)
print(sibling_soup.c.next_sibling)

None
None


The strings “text1” and “text2” are not siblings, because they don’t have the same parent:

In [28]:
print(sibling_soup.b.string)
print(sibling_soup.b.string.next_sibling)

text1
None


In real documents, the `.next_sibling` or `.previous_sibling` of a tag will usually be a string containing whitespace. Going back to the “three sisters” document:

In [34]:
print(soup.a.parent)

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


You might think that the `.next_sibling` of the first `<a>` tag would be the second `<a>` tag. But actually, it’s a string: the comma and newline that separate the first `<a>` tag from the second:

In [45]:
link1 = soup.a
link1

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [46]:
comma = link1.next_sibling
comma

',\n'

The second `<a>` tag is actually the `.next_sibling` of the comma:

In [47]:
link3 = comma.next_sibling
link3

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

In [48]:
link3.next_sibling

' and\n'

## `.next_siblings` and `.previous_siblings`

You can iterate over a tag’s siblings with `.next_siblings` or `.previous_siblings`:

In [49]:
for sibling in soup.a.next_siblings:
    print(repr(sibling))

',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'


In [50]:
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'


In [56]:
print(soup.find(class_='story').prettify())

<p class="story">
 Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">
  Elsie
 </a>
 ,
 <a class="sister" href="http://example.com/lacie" id="link2">
  Lacie
 </a>
 and
 <a class="sister" href="http://example.com/tillie" id="link3">
  Tillie
 </a>
 ;
and they lived at the bottom of a well.
</p>



# 2.4 Going back and forth

Take a look at the beginning of the “three sisters” document:

```html
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
```

An HTML parser takes this string of characters and turns it into a series of events: 
- “open an `<html>` tag”, 
- “open a `<head>` tag”,
- “open a `<title>` tag”, 
- “add a string”, 
- “close the `<title>` tag”, 
- “open a `<p>` tag”, 

and so on. Beautiful Soup offers tools for reconstructing the initial parse of the document.

## `.next_element` and `.previous_element`

The `.next_element` attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as `.next_sibling`, but it’s usually drastically different.

Here’s the final `<a>` tag in the “three sisters” document. Its `.next_sibling` is a string: the conclusion of the sentence that was interrupted by the start of the `<a>` tag.:

In [60]:
last_a_tag = soup.find("a", id="link3")
print(last_a_tag)

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>


In [61]:
print(last_a_tag.next_sibling)

;
and they lived at the bottom of a well.


But the `.next_element` of that `<a>` tag, the thing that was parsed immediately after the `<a>` tag, is not the rest of that sentence: it’s the word “Tillie”:

In [62]:
last_a_tag.next_element

'Tillie'

In [56]:
print(soup.find(class_='story').prettify())

<p class="story">
 Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">
  Elsie
 </a>
 ,
 <a class="sister" href="http://example.com/lacie" id="link2">
  Lacie
 </a>
 and
 <a class="sister" href="http://example.com/tillie" id="link3">
  Tillie
 </a>
 ;
and they lived at the bottom of a well.
</p>



That’s because in the original markup, the word “Tillie” appeared before that semicolon. The parser encountered 
- an `<a>` tag, then 
- the word “Tillie”, then 
- the closing `</a>` tag, then 
- the semicolon and rest of the sentence. 

The semicolon is on the same level as the `<a>` tag, but the word “Tillie” was encountered first.

The `.previous_element` attribute is the exact opposite of `.next_element`. It points to whatever element was parsed immediately before this one:

In [63]:
last_a_tag.previous_element

' and\n'

## `.next_elements` and `.previous_elements`

You should get the idea by now. You can use these iterators to move forward or backward in the document as it was parsed:

In [64]:
for el in last_a_tag.next_elements:
    print(repr(el))

'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'


# <b>3. Searching the tree</b>

Beautiful Soup defines a lot of methods for searching the parse tree, but they’re all very similar. I’m going to spend a lot of time explaining the two most popular methods: `find()` and `find_all()`. The other methods take almost exactly the same arguments, so I’ll just cover them briefly.

Once again, I’ll be using the “three sisters” document as an example:

In [65]:
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

By passing in a **filter** to an argument like `find_all()`, you can zoom in on the parts of the document you’re interested in.

# 3.1 Filters

Before talking in detail about `find_all()` and similar methods, I want to show examples of different filters you can pass into these methods. These filters show up again and again, throughout the search API. You can use them to filter based 
- on a tag’s name, 
- on its attributes, 
- on the text of a string,

or on some combination of these.

## A string

The simplest filter is a **string**. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the `<b>` tags in the document:

In [66]:
soup.find_all("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead.

## A regular expression

If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its `search()` method. This code finds all the tags whose names start with the letter “b”; in this case, the `<body>` tag and the `<b>` tag:

In [75]:
import re

for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

body
b


This code finds all the tags whose names contain the letter ‘t’:

In [74]:
import re

for tag in soup.find_all(re.compile("t")):
    print(tag.name)

html
title


## A list

If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the `<a>` tags and all the `<b>` tags:

In [76]:
lst = ["a", "b"]
soup.find_all(lst)

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## True

The value `True` matches everything it can. This code finds all the tags in the document, but none of the text strings:

In [77]:
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p


## A function

If none of the other matches work for you, define a function that takes an element as its only argument. The function should return `True` if the argument matches, and `False` otherwise.

Here’s a function that returns `True` if a tag defines the “class” attribute but doesn’t define the “id” attribute, pass this function into `find_all()` and you’ll pick up all the `<p>` tags:

In [78]:
def has_class_but_no_id(tag):
    return tag.has_attr("class") and not tag.has_attr("id")

soup.find_all(has_class_but_no_id)

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

This function only picks up the `<p>` tags. It doesn’t pick up the `<a>` tags, because those tags define both “class” and “id”. It doesn’t pick up tags like `<html>` and `<title>`, because those tags don’t define “class”.

If you pass in a function to filter on a specific attribute like `href`, the argument passed into the function will be the attribute value, not the whole tag. Here’s a function that finds all tags whose `href` attribute does not match a regular expression:

In [80]:
def not_lacie(href):
    return href and not re.compile("lacie").search(href)

soup.find_all(href=not_lacie)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

The function can be as complicated as you need it to be. Here’s a function that returns `True` if a tag is surrounded by string objects:

In [81]:
from bs4 import NavigableString

def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
               and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print(tag.name)

body
p
a
a
a
p


Now we’re ready to look at the search methods in detail.

# 3.2 `find_all()`

Signature: `find_all(name, attrs, recursive, string, limit, **kwargs)`

The `find_all()` method looks through a tag’s descendants and retrieves all descendants that match your filters. I gave several examples in Kinds of filters, but here are a few more:

In [82]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [83]:
soup.find_all("p", "title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [84]:
soup.find_all("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [85]:
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [86]:
import re

soup.find_all(string=re.compile("sist"))

['Once upon a time there were three little sisters; and their names were\n']

Some of these should look familiar, but others are new. What does it mean to pass in a value for string, or id? Why does `find_all("p", "title")` find a `<p>` tag with the CSS `class` “title”? Let’s look at the arguments to `find_all()`.

## The `name` argument

Pass in a value for `name` and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.

This is the simplest usage:

In [87]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

Recall from [Kinds of filters](#3.1-Filters) that the value to `name` can be 
- a string, 
- a regular expression, 
- a list, 
- a function, or 
- the value `True`.

## The `keyword` arguments

Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called `id`, Beautiful Soup will filter against each tag’s ‘id’ attribute:

In [88]:
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

If you pass in a value for `href`, Beautiful Soup will filter against each tag’s ‘href’ attribute:

In [89]:
soup.find_all(href=re.compile("sie"))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

You can filter an attribute based on 
- a string, 
- a regular expression, 
- a list, 
- a function, or 
- the value `True`.

This code finds all tags whose `id` attribute has a value, regardless of what the value is:

In [90]:
soup.find_all(id=True)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

You can filter multiple attributes at once by passing in more than one keyword argument:

In [93]:
soup.find_all(href=re.compile("elsie"), id="link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

Some attributes, like the `data-*` attributes in HTML 5, have names that can’t be used as the names of keyword arguments:

In [102]:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')

In [103]:
try:
    data_soup.find_all(data-foo="value")
except SyntaxError as e:
    print(e)

SyntaxError: expression cannot contain assignment, perhaps you meant "=="? (928546649.py, line 2)

You can use these attributes in searches by putting them into a dictionary and passing the dictionary into `find_all()` as the `attrs` argument:

In [104]:
data_soup.find_all(attrs={"data-foo": "value"})

[<div data-foo="value">foo!</div>]

You can’t use a keyword argument to search for HTML’s ‘name’ element, because Beautiful Soup uses the name argument to contain the name of the tag itself. Instead, you can give a value to ‘name’ in the `attrs` argument:

In [105]:
name_soup = BeautifulSoup('<input name="email"/>')
name_soup.find_all(name="email")

[]

In [106]:
name_soup.find_all(attrs={"name": "email"})

[<input name="email"/>]

## Searching by CSS `class`

It’s very useful to search for a tag that has a certain CSS class. But the name of the CSS attribute, “class”, is a reserved word in Python. Using `class` as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument `class_`:

In [107]:
soup.find_all("a", class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

As with any keyword argument, you can pass `class_` 
- a string, 
- a regular expression, 
- a function, or 
- `True`:

In [109]:
soup.find_all(class_=re.compile("itl"))

[<p class="title"><b>The Dormouse's story</b></p>]

In [110]:
def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

***
_ChatGPT:_  
In the code you provided, you are using the `find_all` method of BeautifulSoup to find all HTML elements that match a certain condition based on their **class attribute**. The `class_` parameter is used to filter elements based on their class attribute. In the first example, you are using a regular expression to find elements whose class attribute contains the substring "`itl`".

In the second example, you define a custom function `has_six_characters` which takes a `css_class` argument and returns `True` if the length of the class attribute is `6` and it is not `None`. Then, you pass this function as the `class_` parameter to `find_all`, effectively using it as a filter function.

Now, let's address your question about why you don't pass any arguments into `class_=has_six_characters`. When you pass a function as the value of `class_ parameter` in BeautifulSoup, it internally calls that function for each element's class attribute in the HTML document. It automatically passes the class attribute of each element as an argument to the function. In your case, `has_six_characters` is a function that takes a single argument (`css_class`), but you don't need to explicitly pass that argument when using it with `find_all`.

The `find_all` method takes care of calling your function with the appropriate arguments (in this case, the class attribute of each HTML element). The result is a filtered list of elements whose class attribute satisfies the condition specified in your custom function.

If you define a function with two arguments and you want to use it as a filter function with BeautifulSoup's `find_all` method, the function should still only take one argument, which will be automatically supplied by BeautifulSoup.

For example, consider a function with two arguments:

```python
def has_specific_class(css_class, target_class):
    return css_class is not None and target_class in css_class
```

If you want to use this function with `find_all`, you should only include the function name (`has_specific_class`) in the `class_` parameter, and BeautifulSoup will automatically pass the class attribute as the first argument. The second argument (`target_class` in this example) needs to be fixed when calling find_all:

```python
target_class = "example_class"
soup.find_all(class_=lambda css_class: has_specific_class(css_class, target_class))
```

In this example, `lambda css_class: has_specific_class(css_class, target_class)` is an anonymous function (lambda function) that takes one argument (the class attribute) and passes it to the `has_specific_class` function along with the fixed `target_class`. This way, `find_all` can still use the function as a filter for elements based on their class attributes.
***

Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes:

In [3]:
from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body strikeout"></p>')

In [4]:
css_soup.find_all("p", class_="strikeout")

[<p class="body strikeout"></p>]

In [5]:
css_soup.find_all("p", class_="body")

[<p class="body strikeout"></p>]

You can also search for the exact string value of the class attribute:

In [6]:
css_soup.find_all("p", class_="body strikeout")

[<p class="body strikeout"></p>]

But searching for variants of the string value won’t work:

In [7]:
css_soup.find_all("p", class_="strikeout body")

[]

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

In [8]:
css_soup.select("p.strikeout.body")

[<p class="body strikeout"></p>]

In older versions of Beautiful Soup, which don’t have the `class_` shortcut, you can use the `attrs` trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for:

In [112]:
soup.find_all("a", attrs={"class": "sister"})

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [113]:
soup.find_all("a", {"class": "sister"})

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## The `string` argument

With `string` you can search for strings instead of tags. As with `name` and the `keyword` arguments, you can pass in 
- a string, 
- a regular expression, 
- a list, 
- a function, or 
- the value True. 

Here are some examples:

In [114]:
soup.find_all(string="Elsie")

['Elsie']

In [115]:
soup.find_all(string=["Tillie", "Elsie", "Lacie"])

['Elsie', 'Lacie', 'Tillie']

In [116]:
soup.find_all(string=re.compile("Dormouse"))

["The Dormouse's story", "The Dormouse's story"]

In [117]:
def is_the_only_string_within_a_tag(s):
    return s == s.parent.string

soup.find_all(string=is_the_only_string_within_a_tag)

["The Dormouse's story",
 "The Dormouse's story",
 'Elsie',
 'Lacie',
 'Tillie',
 '...']

Although `string` is for finding strings, you can combine it with arguments that find tags: Beautiful Soup will find all tags whose `.string` matches your value for string. This code finds the `<a>` tags whose `.string` is “Elsie”:

In [118]:
soup.find_all("a", string="Elsie")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

The `string` argument is new in Beautiful Soup 4.4.0. In earlier versions it was called `text`:

In [119]:
soup.find_all("a", text="Elsie")

  soup.find_all("a", text="Elsie")


[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

## The limit argument

`find_all()` returns all the tags and strings that match your filters. This can take a while if the document is large. If you don’t need all the results, you can pass in a number for `limit`. This works just like the `LIMIT` keyword in SQL. It tells Beautiful Soup to stop gathering results after it’s found a certain number.

There are three links in the “three sisters” document, but this code only finds the first two:

In [120]:
soup.find_all("a", limit=2)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

## The `recursive` argument

If you call `mytag.find_all()`, Beautiful Soup will examine all the descendants of mytag: its children, its children’s children, and so on. If you only want Beautiful Soup to consider direct children, you can pass in `recursive=False`. See the difference here:

```html
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...
```

In [121]:
soup.html.find_all("title")

[<title>The Dormouse's story</title>]

In [122]:
soup.html.find_all("title", recursive=False)

[]

The `<title>` tag is beneath the `<html>` tag, but it’s not directly beneath the `<html>` tag: the `<head>` tag is in the way. Beautiful Soup finds the `<title>` tag when it’s allowed to look at all descendants of the `<html>` tag, but when `recursive=False` restricts it to the `<html>` tag’s immediate children, it finds nothing.

Beautiful Soup offers a lot of tree-searching methods (covered below), and they mostly take the same arguments as `find_all()`: 
- `name`, 
- `attrs`, 
- `string`, 
- `limit`, and 
- the `keyword` arguments. 

But the `recursive` argument is different: `find_all()` and `find()` are the only methods that support it. Passing `recursive=False` into a method like `find_parents()` wouldn’t be very useful.

## Calling a tag is like calling `find_all()`

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the `BeautifulSoup` object or a `Tag` object as though it were a function, then it’s the same as calling `find_all()` on that object. These two lines of code are equivalent:

In [123]:
assert(soup.find_all("a") == soup("a"))

In [124]:
assert(soup.find_all(string=True) == soup(string=True))

# 3.3 `find()`

Signature: `find(name, attrs, recursive, string, **kwargs)`

The `find_all()` method scans the entire document looking for results, but sometimes you only want to find one result. If you know a document only has one `<body>` tag, it’s a waste of time to scan the entire document looking for more. Rather than passing in `limit=1` every time you call `find_all`, you can use the `find()` method. These two lines of code are equivalent:

In [125]:
assert (soup.find_all("title", limit=1)[0] == soup.find("title"))

In [126]:
soup.find_all("title", limit=1)[0]

<title>The Dormouse's story</title>

In [127]:
soup.find("title")

<title>The Dormouse's story</title>

The only difference is that `find_all()` returns a list containing the single result, and `find()` just returns the result.

If `find_all()` can’t find anything, it returns an empty list. If `find()` can’t find anything, it returns `None`:

In [130]:
print(soup.find_all("nosuchtag"))
print(soup.find("nosuchtag"))

[]
None


Remember the `soup.head.title` trick from [Navigating using tag names](#Navigating-using-tag-names)? That trick works by repeatedly calling `find()`:

In [132]:
print(soup.head.title)
print(soup.find("head").find("title"))

<title>The Dormouse's story</title>
<title>The Dormouse's story</title>


# 3.4 Other search methods

## `find_parents()` and `find_parent()`

Signature: `find_parents(name, attrs, string, limit, **kwargs)`

Signature: `find_parent(name, attrs, string, **kwargs)`

I spent a lot of time above covering `find_all()` and `find()`. The Beautiful Soup API defines ten other methods for searching the tree, but don’t be afraid. Five of these methods are basically the same as `find_all()`, and the other five are basically the same as `find()`. The only differences are in what parts of the tree they search.

First let’s consider `find_parents()` and `find_parent()`. Remember that f`ind_all()` and `find()` work their way down the tree, looking at tag’s descendants. These methods do the opposite: they work their way up the tree, looking at a tag’s (or a string’s) parents. Let’s try them out, starting from a string buried deep in the “three daughters” document:

In [137]:
a_string = soup.find(string="Lacie")
a_string

'Lacie'

In [138]:
a_string.find_parents("a")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [139]:
a_string.find_parent("p")

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

In [142]:
print(a_string.find_parents("p", class_="title"))

[]


One of the three `<a>` tags is the direct parent of the string in question, so our search finds it. One of the three `<p>` tags is an indirect parent of the string, and our search finds that as well. There’s a `<p>` tag with the CSS class “title” somewhere in the document, but it’s not one of this string’s parents, so we can’t find it with `find_parents()`.

You may have made the connection between `find_parent()` and `find_parents()`, and the `.parent` and `.parents` attributes mentioned earlier. The connection is very strong. These search methods actually use `.parents` to iterate over all the parents, and check each one against the provided filter to see if it matches.

## `find_next_siblings()` and `find_next_sibling()`

Signature: `find_next_siblings(name, attrs, string, limit, **kwargs)`

Signature: `find_next_sibling(name, attrs, string, **kwargs)`

These methods use `.next_siblings` to iterate over the rest of an element’s siblings in the tree. The `find_next_siblings()` method returns all the siblings that match, and `find_next_sibling()` only returns the first one:

In [143]:
first_link = soup.a
first_link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [144]:
first_link.find_next_siblings("a")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [145]:
first_link.find_next_sibling("a")

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

## `find_previous_siblings()` and `find_previous_sibling()`

Signature: `find_previous_siblings(name, attrs, string, limit, **kwargs)`

Signature: `find_previous_sibling(name, attrs, string, **kwargs)`

These methods use `.previous_siblings` to iterate over an element’s siblings that precede it in the tree. The `find_previous_siblings()` method returns all the siblings that match, and `find_previous_sibling()` only returns the first one:

In [146]:
last_link = soup.find("a", id="link3")
last_link

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [147]:
last_link.find_previous_siblings("a")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [148]:
first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_previous_sibling("p")

<p class="title"><b>The Dormouse's story</b></p>

## `find_all_next()` and `find_next()`

Signature: `find_all_next(name, attrs, string, limit, **kwargs)`

Signature: `find_next(name, attrs, string, **kwargs)`

These methods use `.next_elements` to iterate over whatever tags and strings that come after it in the document. The `find_all_next()` method returns all matches, and `find_next()` only returns the first match:

In [149]:
first_link = soup.a
first_link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [150]:
first_link.find_all_next(string=True)

['Elsie',
 ',\n',
 'Lacie',
 ' and\n',
 'Tillie',
 ';\nand they lived at the bottom of a well.',
 '\n',
 '...',
 '\n']

In [153]:
first_link.find_next("p")

<p class="story">...</p>

In the first example, the string “Elsie” showed up, even though it was contained within the `<a>` tag we started from. In the second example, the last `<p>` tag in the document showed up, even though it’s not in the same part of the tree as the `<a>` tag we started from. For these methods, all that matters is that an element match the filter, and show up later in the document than the starting element.

## `find_all_previous()` and `find_previous()`

Signature: `find_all_previous(name, attrs, string, limit, **kwargs)`

Signature: `find_previous(name, attrs, string, **kwargs)`

These methods use `.previous_elements` to iterate over the tags and strings that came before it in the document. The `find_all_previous()` method returns all matches, and `find_previous()` only returns the first match:

In [154]:
first_link = soup.a
first_link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [155]:
first_link.find_all_previous("p")

[<p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="title"><b>The Dormouse's story</b></p>]

In [156]:
first_link.find_previous("title")

<title>The Dormouse's story</title>

The call to `find_all_previous("p")` found the first paragraph in the document (the one with `class=”title”`), but it also finds the second paragraph, the `<p>` tag that contains the `<a>` tag we started with. This shouldn’t be too surprising: we’re looking at all the tags that show up earlier in the document than the one we started with. A `<p>` tag that contains an `<a>` tag must have shown up before the `<a>` tag it contains.

# 3.5 CSS selectors

All this CSS selector stuff is a convenience for people who already know the CSS selector syntax. You can do all of this with the Beautiful Soup API. And if CSS selectors are all you need, you should parse the document with `lxml`: it’s a lot faster. But this lets you combine CSS selectors with the Beautiful Soup API.

As of version 4.7.0, Beautiful Soup supports most CSS4 selectors via the [SoupSieve](https://facelessuser.github.io/soupsieve/) project. If you installed Beautiful Soup through pip, SoupSieve was installed at the same time, so you don’t have to do anything extra.

BeautifulSoup has a `.select()` method which uses SoupSieve to run a CSS selector against a parsed document and return all the matching elements. Tag has a similar method which runs a CSS selector against the contents of a single tag.

(Earlier versions of Beautiful Soup also have the `.select()` method, but only the most commonly-used CSS selectors are supported.)

The SoupSieve documentation lists all the currently supported CSS selectors, but here are some of the basics.

## `select()`

### tags

You can find tags:

In [157]:
soup.select("title")

[<title>The Dormouse's story</title>]

In [158]:
soup.select("p:nth-of-type(3)")

[<p class="story">...</p>]

Find tags beneath other tags:

In [160]:
soup.select("body a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [161]:
soup.select("html head title")

[<title>The Dormouse's story</title>]

Find tags _directly_ beneath other tags:

In [162]:
soup.select("head > title")

[<title>The Dormouse's story</title>]

In [163]:
soup.select("p > a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [164]:
soup.select("p > a:nth-of-type(2)")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [165]:
soup.select("p > #link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [166]:
soup.select("body > a")

[]

### siblings

Find the siblings of tags:

In [168]:
soup.select("#link1 ~ .sister")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [169]:
soup.select("#link1 + .sister")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

### `.class`

Find tags by CSS class:

In [170]:
soup.select(".sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [172]:
soup.select("[class~=sister]")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

### id - `#`

Find tags by ID:

In [173]:
soup.select("#link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [174]:
soup.select("a#link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

Find tags that match any selector from a list of selectors:

In [177]:
soup.select("#link1, #link2")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

Test for the existence of an attribute:

In [175]:
soup.select("a[href]")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

### attribute value

Find tags by attribute value:

In [179]:
soup.select('a[href="http://example.com/elsie"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

In [180]:
soup.select('a[href^="http://example.com"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [181]:
soup.select('a[href$="tillie"]')

[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [182]:
soup.select('a[href*=".com/el"]')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

## `select_one()`

There’s also a method called `select_one()`, which finds only the first tag that matches a selector:

In [183]:
soup.select_one(".sister")

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

## xml

If you’ve parsed XML that defines namespaces, you can use them in CSS selectors:

In [185]:
xml = """<tag xmlns:ns1="http://namespace1/" xmlns:ns2="http://namespace2/">
 <ns1:child>I'm in namespace 1</ns1:child>
 <ns2:child>I'm in namespace 2</ns2:child>
</tag> """

soup = BeautifulSoup(xml, "xml")

In [186]:
soup.select("child")

[<ns1:child>I'm in namespace 1</ns1:child>,
 <ns2:child>I'm in namespace 2</ns2:child>]

In [187]:
soup.select("ns1|child", namespaces=namespaces)

NameError: name 'namespaces' is not defined

When handling a CSS selector that uses namespaces, Beautiful Soup uses the namespace abbreviations it found when parsing the document. You can override this by passing in your own dictionary of abbreviations:

In [188]:
namespaces = dict(first="http://namespace1/", second="http://namespace2/")
soup.select("second|child", namespaces=namespaces)

[<ns2:child>I'm in namespace 2</ns2:child>]

# <b>4. Output</b>

## Pretty-printing

The `prettify()` method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:

In [27]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)

soup.prettify()

'<html>\n <body>\n  <a href="http://example.com/">\n   I linked to\n   <i>\n    example.com\n   </i>\n  </a>\n </body>\n</html>\n'

In [28]:
print(soup.prettify())

<html>
 <body>
  <a href="http://example.com/">
   I linked to
   <i>
    example.com
   </i>
  </a>
 </body>
</html>



You can call `prettify()` on the top-level `BeautifulSoup` object, or on any of its `Tag` objects:

In [29]:
print(soup.a.prettify())

<a href="http://example.com/">
 I linked to
 <i>
  example.com
 </i>
</a>



## `get_text()`

If you only want the text part of a document or tag, you can use the `get_text()` method. It returns all the text in a document or beneath a tag, as a single Unicode string:

In [190]:
markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'

soup = BeautifulSoup(markup, "lxml")

print(soup.get_text())
print(soup.i.get_text())


I linked to example.com

example.com


In [191]:
print(repr(soup.get_text()))

'\nI linked to example.com\n'


You can specify a string to be used to join the bits of text together:

In [192]:
soup.get_text("|")

'\nI linked to |example.com|\n'

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

In [193]:
soup.get_text("|", strip=True)

'I linked to|example.com'

But at that point you might want to use the `.stripped_strings` generator instead, and process the text yourself:

In [194]:
[string for string in soup.stripped_strings]

['I linked to', 'example.com']

# <b>5. Specifying the parser to use</b>

If you just need to parse some HTML, you can dump the markup into the BeautifulSoup constructor, and it’ll probably be fine. Beautiful Soup will pick a parser for you and parse the data. But there are a few additional arguments you can pass in to the constructor to change which parser is used.

The first argument to the BeautifulSoup constructor is a string or an open filehandle – the markup you want parsed. The second argument is how you’d like the markup parsed.

If you don’t specify anything, you’ll get the best HTML parser that’s installed. Beautiful Soup ranks 
- lxml’s parser as being the best, then 
- html5lib’s, then 
- Python’s built-in parser. 

You can override this by specifying one of the following:

- What type of markup you want to parse. Currently supported are 
    - “html”, 
    - “xml”, and 
    - “html5”.
- The name of the parser library you want to use. Currently supported options are 
    - “lxml”, 
    - “html5lib”, and 
    - “html.parser” (Python’s built-in HTML parser).
    
The section [Installing a parser](https://beautiful-soup-4.readthedocs.io/en/latest/#installing-a-parser) contrasts the supported parsers.

If you don’t have an appropriate parser installed, Beautiful Soup will ignore your request and pick a different parser. Right now, the only supported XML parser is `lxml`. If you don’t have `lxml` installed, asking for an XML parser won’t give you one, and asking for “lxml” won’t work either.

## Differences between parsers

Beautiful Soup presents the same interface to a number of different parsers, but each parser is different. Different parsers will create different parse trees from the same document. The biggest differences are between the HTML parsers and the XML parsers. Here’s a short document, parsed as HTML:

In [198]:
BeautifulSoup("<a><b /></a>", "html.parser")

<a><b></b></a>

In [197]:
BeautifulSoup("<a><b /></a>", "lxml")

<html><body><a><b></b></a></body></html>

Since an empty `<b />` tag is not valid HTML, the parser turns it into a `<b></b>` tag pair.

Here’s the same document parsed as XML (running this requires that you have `lxml` installed). Note that the empty `<b />` tag is left alone, and that the document is given an XML declaration instead of being put into an `<html>` tag.:

In [196]:
BeautifulSoup("<a><b /></a>", "xml")

<?xml version="1.0" encoding="utf-8"?>
<a><b/></a>

There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.

But if the document is not perfectly-formed, different parsers will give different results. Here’s a short, invalid document parsed using `lxml`’s HTML parser. Note that the dangling `</p>` tag is simply ignored:

In [199]:
BeautifulSoup("<a></p>", "lxml")

<html><body><a></a></body></html>

Here’s the same document parsed using `html5lib`:

In [202]:
BeautifulSoup("<a></p>", "html5lib")

<html><head></head><body><a><p></p></a></body></html>

Instead of ignoring the dangling `</p>` tag, `html5lib` pairs it with an opening `<p>` tag. This parser also adds an empty `<head>` tag to the document.

Here’s the same document parsed with Python’s built-in HTML parser:

In [201]:
BeautifulSoup("<a></p>", "html.parser")

<a></a>

- Like `html5lib`, this parser ignores the closing `</p>` tag. 
- Unlike `html5lib`, this parser makes no attempt to create a well-formed HTML document by adding a `<body>` tag. 
- Unlike `lxml`, it doesn’t even bother to add an `<html>` tag.

Since the document “`<a></p>`” is invalid, none of these techniques is the “correct” way to handle it. The `html5lib` parser uses techniques that are part of the HTML5 standard, so it has the best claim on being the “correct” way, but all three techniques are legitimate.

Differences between parsers can affect your script. If you’re planning on distributing your script to other people, or running it on multiple machines, 

> you should specify a parser in the BeautifulSoup constructor. 

That will reduce the chances that your users parse a document differently from the way you parse it.

# <b>6. Parsing only part of a document</b>

Let’s say you want to use Beautiful Soup look at a document’s `<a>` tags. It’s a waste of time and memory to parse the entire document and then go over it again looking for `<a>` tags. It would be much faster to ignore everything that wasn’t an `<a>` tag in the first place. The `SoupStrainer` class allows you to choose which parts of an incoming document are parsed. You just create a `SoupStrainer` and pass it in to the BeautifulSoup constructor as the `parse_only` argument.

> Note: This feature won’t work if you’re using the `html5lib` parser. If you use html5lib, the whole document will be parsed, no matter what. This is because html5lib constantly rearranges the parse tree as it works, and if some part of the document didn’t actually make it into the parse tree, it’ll crash. To avoid confusion, in the examples below I’ll be forcing Beautiful Soup to use Python’s built-in parser.

## SoupStrainer

The `SoupStrainer` class takes the same arguments as a typical method from [Searching the tree](#3.-Searching-the-tree): `name, attrs, string, and **kwargs`. Here are three `SoupStrainer` objects:

In [205]:
from bs4 import SoupStrainer

only_a_tags = SoupStrainer("a")

only_tags_with_id_link2 = SoupStrainer(id="link2")

def is_short_string(string):
    return len(string) < 10

only_short_strings = SoupStrainer(string=is_short_string)

In [206]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [207]:
print(BeautifulSoup(html_doc, "lxml", parse_only=only_a_tags).prettify())

<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
</a>
<a class="sister" href="http://example.com/tillie" id="link3">
 Tillie
</a>



In [208]:
print(BeautifulSoup(html_doc, "lxml", parse_only=only_tags_with_id_link2).prettify())

<a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
</a>



In [209]:
print(BeautifulSoup(html_doc, "lxml", parse_only=only_short_strings).prettify())

Elsie
,
Lacie
and
Tillie
...



You can also pass a `SoupStrainer` into any of the methods covered in [Searching the tree](#3.-Searching-the-tree). This probably isn’t terribly useful, but I thought I’d mention it:

In [210]:
soup = BeautifulSoup(html_doc)
soup.find_all(only_short_strings)

['\n',
 '\n',
 '\n',
 'Elsie',
 ',\n',
 'Lacie',
 ' and\n',
 'Tillie',
 '\n',
 '...',
 '\n']