# BEAUTIFUL SOUP [API](https://beautiful-soup-4.readthedocs.io/en/latest/)

# <b>1. Objects</b>

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: 
- `Tag`, 
- `NavigableString`, 
- `BeautifulSoup`, and 
- `Comment`.

# 1.1 Tag

A `Tag` object corresponds to an XML or HTML tag in the original document:

In [9]:
from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(type(tag))

<class 'bs4.element.Tag'>


Tags have a lot of attributes and methods, and I’ll cover most of them in [Navigating the tree](#Navigating-the-tree) and [Searching the tree](#Searching-the-tree). For now, the most important features of a tag are its 
- **name** and 
- **attributes**.

## `tag.name`

In [10]:
tag.name

'b'

If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:

In [11]:
tag.name = "blockquote"
tag

<blockquote class="boldest">Extremely bold</blockquote>

## Attributes

A tag may have any number of attributes. The tag `<b class="boldest">` has an attribute “`class`” whose value is “`boldest`”. You can access a tag’s attributes by treating the tag like a dictionary:

In [19]:
tag = soup.b
print(tag)
tag['class']

<b class="boldest">Extremely bold</b>


['boldest']

In [20]:
tag.attrs

{'class': ['boldest']}

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [23]:
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag

<b another-attribute="1" class="boldest" id="verybold">Extremely bold</b>

In [24]:
del tag['id']
del tag["another-attribute"]
tag

<b class="boldest">Extremely bold</b>

### Multi-valued attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is `class` (that is, a tag can have more than one CSS class). Others include `rel`, `rev`, `accept-charset`, `headers`, and `accesskey`. Beautiful Soup presents the value(s) of a multi-valued attribute as a list.

Defaults implement the rules described in the HTML specification:

In [40]:
from bs4.builder import builder_registry
builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES

{'*': ['class', 'accesskey', 'dropzone'],
 'a': ['rel', 'rev'],
 'link': ['rel', 'rev'],
 'td': ['headers'],
 'th': ['headers'],
 'form': ['accept-charset'],
 'object': ['archive'],
 'area': ['rel'],
 'icon': ['sizes'],
 'iframe': ['sandbox'],
 'output': ['for']}

In [26]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p.attrs

{'class': ['body', 'strikeout']}

In [27]:
css_soup.p["class"]

['body', 'strikeout']

If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

In [28]:
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']

'my id'

When you turn a tag back into a string, multiple attribute values are consolidated:

In [34]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
print(rel_soup.a)

rel_soup.a["rel"] = ['index', 'content']
print(rel_soup.a)

<a rel="index">homepage</a>
<a rel="index content">homepage</a>


You can disable this by passing `multi_valued_attributes=None` as a keyword argument into the BeautifulSoup constructor:

In [37]:
no_list_soup = BeautifulSoup(
    '<p class="body strikeout"></p>', 'html', 
        multi_valued_attributes=None)

no_list_soup.p["class"]

'body strikeout'

You can use `get_attribute_list` to get a value that’s always a list, whether or not it’s a multi-valued atribute:

In [38]:
id_soup.p.get_attribute_list("id")

['my id']

If you parse a document as XML, there are no multi-valued attributes, because it does not use the HTML specification:

In [39]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p["class"]

'body strikeout'

In [43]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'lxml')
xml_soup.p["class"]

['body', 'strikeout']

Again, you can configure this using the `multi_valued_attributes` argument:

In [42]:
class_is_multi= { '*' : 'class'}
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 
                         'xml',
                         multi_valued_attributes=class_is_multi)
xml_soup.p["class"]

['body', 'strikeout']

# 1.2 NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the `NavigableString` class to contain these bits of text:

In [13]:
from bs4 import BeautifulSoup

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag.string

'Extremely bold'

In [14]:
print(type(tag.string))

<class 'bs4.element.NavigableString'>


A `NavigableString` is just like a Python Unicode string, except that it also supports some of the features described in [Navigating the tree](#Navigating-the-tree) and [Searching the tree](#Searching-the-tree). You can convert a `NavigableString` to a Unicode string with `unicode()`:

In [15]:
unicode_string = unicode(tag.string)
unicode_string

NameError: name 'unicode' is not defined

_ChatGPT:_  
In Python 3.x, which is the current version as of my knowledge cutoff in January 2022, the `unicode()` function has been removed, and strings are Unicode by default. If you are working with Python 3, you can simply use the string directly without the need for the `unicode()` function.

You can’t edit a string in place, but you can replace one string with another, using [`replace_with()`](#):

In [16]:
old_text = tag.string.replace_with("Rot Front!")
print(old_text)
print(tag)

Extremely bold
<b class="boldest">Rot Front!</b>


`NavigableString` supports most of the features described in [Navigating the tree](#Navigating-the-tree) and [Searching the tree](#Searching-the-tree), but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the `.contents` or `.string` attributes, or the `find()` method.

# 1.3 BeautifulSoup

The `BeautifulSoup` object represents the parsed document as a whole. For most purposes, you can treat it as a `Tag` object. This means it supports most of the methods described in [Navigating the tree](#Navigating-the-tree) and [Searching the tree](#Searching-the-tree).

You can also pass a `BeautifulSoup` object into one of the methods defined in [Modifying the tree](Modifying-the-tree), just as you would a `Tag`. This lets you do things like combine two parsed documents:

In [20]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")

doc.find(string="INSERT FOOTER HERE").replace_with(footer)

'INSERT FOOTER HERE'

In [21]:
print(doc)

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>


Since the `BeautifulSoup` object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its `.name`, so it’s been given the special `.name` “`[document]`”:

In [22]:
soup.name

'[document]'

# 1.4 Comments and other special strings

`Tag`, `NavigableString`, and `BeautifulSoup` cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The only one you’ll probably ever need to worry about is the `comment`:

In [24]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

soup = BeautifulSoup(markup)
comment = soup.b.string
print(type(comment))

<class 'bs4.element.Comment'>


The `Comment` object is just a special type of `NavigableString`:

In [25]:
comment

'Hey, buddy. Want to buy a used parser?'

But when it appears as part of an HTML document, a `Comment` is displayed with special formatting:

In [26]:
print(soup.b.prettify())

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>



Beautiful Soup defines classes for anything else that might show up in an XML document: 
- `CData`, 
- `ProcessingInstruction`, 
- `Declaration`, and 
- `Doctype`. 

Just like `Comment`, these classes are subclasses of `NavigableString` that add something extra to the string. Here’s an example that replaces the comment with a CDATA block.

> _ChatGPT:_ **CDATA (Character Data)** sections in an HTML or XML document. CDATA is a way to include blocks of text that should not be parsed by the parser. It's often used to include scripts or other content that might contain characters that would otherwise be treated as markup.

In [34]:
from bs4 import CData

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string

cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())

<b>
 <![CDATA[A CDATA block]]>
</b>



# <b>2. Navigating the tree<b>

Here’s the “Three sisters” HTML document:

In [77]:
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

# 2.1 Going down

Tags may contain strings and other tags. These elements are the tag’s **children**. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

> Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.

## Navigating using `tag` names

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the `<head>` tag, just say `soup.head`:

In [37]:
soup.head

<head><title>The Dormouse's story</title></head>

In [38]:
soup.head.string

"The Dormouse's story"

In [39]:
soup.title

<title>The Dormouse's story</title>

You can use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first `<b>` tag beneath the `<body>` tag:

In [40]:
soup.body.b

<b>The Dormouse's story</b>

Using a tag name as an attribute will give you only the **_first tag_** by that name:

In [41]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

If you need to get all the `<a>` tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in [Searching the tree](#Searching-the-tree), such as `find_all()`:

In [42]:
soup.find_all("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [48]:
for sister in (soup.find_all("a")):
    print(sister.string)

Elsie
Lacie
Tillie


## `.contents` and `.children`

### `contents`

A tag’s children are available in a **list** called `.contents`:

In [49]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's story</title></head>

In [53]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [57]:
title_tag = head_tag.contents[0]
title_tag

<title>The Dormouse's story</title>

In [58]:
title_tag.contents

["The Dormouse's story"]

The `BeautifulSoup` object itself has children. In this case, the `<html>` tag is the child of the `BeautifulSoup` object:

In [78]:
len(soup.contents)

1

In [79]:
soup.contents[0].name

'html'

A string does not have `.contents`, because it can’t contain anything:

In [80]:
text = title_tag.contents[0]
try:
    text.contents
except Exception as e:
    print(e)

'NavigableString' object has no attribute 'contents'


### `.children`

Instead of getting them as a list, you can iterate over a tag’s children using the `.children` **generator**:

In [81]:
for child in title_tag.children:
    print(child)

The Dormouse's story


In [82]:
text = title_tag.children
text

<list_iterator at 0x7fc79a89bc10>

In [83]:
text = title_tag.contents
text

["The Dormouse's story"]

## `.descendants`

The `.contents` and `.children` attributes only consider a tag’s direct children. For instance, the `<head>` tag has a single direct child – the `<title>` tag:

In [84]:
head_tag.contents

[<title>The Dormouse's story</title>]

But the `<title>` tag itself has a child: the string “The Dormouse’s story”. There’s a sense in which that string is also a child of the `<head>` tag. The `.descendants` attribute lets you iterate over all of a tag’s children, recursively: its direct children, the children of its direct children, and so on:

In [85]:
for child in head_tag.descendants:
    print(child)

<title>The Dormouse's story</title>
The Dormouse's story


The `<head>` tag has only one child, but it has two descendants: the `<title>` tag and the `<title>` tag’s child. The `BeautifulSoup` object only has one direct child (the `<html>` tag), but it has a whole lot of descendants:

In [86]:
len(list(soup.children))

1

In [87]:
len(list(soup.descendants))

26

## `.string`

If a tag has only one child, and that child is a `NavigableString`, the child is made available as `.string`:

In [89]:
title_tag

<title>The Dormouse's story</title>

In [90]:
title_tag.string

"The Dormouse's story"

In [95]:
print(soup.a.prettify())

<a class="sister" href="http://example.com/elsie" id="link1">
 Elsie
</a>



In [94]:
soup.a.string

'Elsie'

If a tag’s only child is another tag, and that tag has a `.string`, then the parent tag is considered to have the same `.string` as its child:

In [98]:
head_tag

<head><title>The Dormouse's story</title></head>

In [99]:
head_tag.string

"The Dormouse's story"

If a tag contains more than one thing, then it’s not clear what `.string` should refer to, so `.string` is defined to be `None`:

In [100]:
soup.html

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [102]:
print(soup.html.string)

None


## `.strings` and `stripped_strings`

If there’s more than one thing inside a tag, you can still look at just the strings. Use the `.strings` **generator**:

In [103]:
for string in soup.strings:
    print(repr(string))

"The Dormouse's story"
'\n'
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'


In [104]:
for string in soup.strings:
    print(string)

The Dormouse's story




The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...




_ChatGPT:_  
If you simply want to see the content, use `print(string)`. If you need a more detailed view for debugging or understanding special characters, use `print(repr(string))`.

These strings tend to have a lot of extra whitespace, which you can remove by using the `.stripped_strings` generator instead:

In [105]:
for string in soup.stripped_strings:
    print(repr(string))

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'


Here, strings consisting entirely of whitespace are ignored, and whitespace at the beginning and end of strings is removed.

# <b>Searching the tree</b>

# <b>Filters</b>

## Searching by CSS class

It’s very useful to search for a tag that has a certain CSS class. But the name of the CSS attribute, “class”, is a reserved word in Python. Using `class` as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument `class_`:

```python
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

As with any keyword argument, you can pass `class_` 
- a string, 
- a regular expression, 
- a function, or 
- `True`:

```python
soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```
***
_ChatGPT:_  
In the code you provided, you are using the `find_all` method of BeautifulSoup to find all HTML elements that match a certain condition based on their **class attribute**. The `class_` parameter is used to filter elements based on their class attribute. In the first example, you are using a regular expression to find elements whose class attribute contains the substring "`itl`".

In the second example, you define a custom function `has_six_characters` which takes a `css_class` argument and returns `True` if the length of the class attribute is `6` and it is not `None`. Then, you pass this function as the `class_` parameter to `find_all`, effectively using it as a filter function.

Now, let's address your question about why you don't pass any arguments into `class_=has_six_characters`. When you pass a function as the value of `class_ parameter` in BeautifulSoup, it internally calls that function for each element's class attribute in the HTML document. It automatically passes the class attribute of each element as an argument to the function. In your case, `has_six_characters` is a function that takes a single argument (`css_class`), but you don't need to explicitly pass that argument when using it with `find_all`.

The `find_all` method takes care of calling your function with the appropriate arguments (in this case, the class attribute of each HTML element). The result is a filtered list of elements whose class attribute satisfies the condition specified in your custom function.

If you define a function with two arguments and you want to use it as a filter function with BeautifulSoup's `find_all` method, the function should still only take one argument, which will be automatically supplied by BeautifulSoup.

For example, consider a function with two arguments:

```python
def has_specific_class(css_class, target_class):
    return css_class is not None and target_class in css_class
```

If you want to use this function with `find_all`, you should only include the function name (`has_specific_class`) in the `class_` parameter, and BeautifulSoup will automatically pass the class attribute as the first argument. The second argument (`target_class` in this example) needs to be fixed when calling find_all:

```python
target_class = "example_class"
soup.find_all(class_=lambda css_class: has_specific_class(css_class, target_class))
```

In this example, `lambda css_class: has_specific_class(css_class, target_class)` is an anonymous function (lambda function) that takes one argument (the class attribute) and passes it to the `has_specific_class` function along with the fixed `target_class`. This way, `find_all` can still use the function as a filter for elements based on their class attributes.
***

Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes:

In [3]:
from bs4 import BeautifulSoup

css_soup = BeautifulSoup('<p class="body strikeout"></p>')

In [4]:
css_soup.find_all("p", class_="strikeout")

[<p class="body strikeout"></p>]

In [5]:
css_soup.find_all("p", class_="body")

[<p class="body strikeout"></p>]

You can also search for the exact string value of the class attribute:

In [6]:
css_soup.find_all("p", class_="body strikeout")

[<p class="body strikeout"></p>]

But searching for variants of the string value won’t work:

In [7]:
css_soup.find_all("p", class_="strikeout body")

[]

If you want to search for tags that match two or more CSS classes, you should use a CSS selector:

In [8]:
css_soup.select("p.strikeout.body")

[<p class="body strikeout"></p>]

In older versions of Beautiful Soup, which don’t have the `class_` shortcut, you can use the `attrs` trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for:

```python
soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

# <b>Output</b>

## Pretty-printing

The `prettify()` method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:

In [27]:
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)

soup.prettify()

'<html>\n <body>\n  <a href="http://example.com/">\n   I linked to\n   <i>\n    example.com\n   </i>\n  </a>\n </body>\n</html>\n'

In [28]:
print(soup.prettify())

<html>
 <body>
  <a href="http://example.com/">
   I linked to
   <i>
    example.com
   </i>
  </a>
 </body>
</html>



You can call `prettify()` on the top-level `BeautifulSoup` object, or on any of its `Tag` objects:

In [29]:
print(soup.a.prettify())

<a href="http://example.com/">
 I linked to
 <i>
  example.com
 </i>
</a>

