# Navigating the three

Here's the "Three sisters" HTML document again:

In [35]:
html_doc = """
<!DOCTYPE html>
<html>
<head><title>The Dormouse's Story</title></head>
<body>
	<p class="title"><b>The Dormouse's Story</b></p>

	<p class="story">Once upon a time there were three little sisters; and their names where <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> and they lived at the botton of a well.</p>
</body>
</html>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

I'll use this as an example to show you how to move from one part of a document to another.

## Going down

Tags may contain strings and other tags. These elements are the tag's *children*. Beautiful Soup provides a lot of different attributes for navigating over a tag's children.

Note that Beautiful Soup strings don't suport any of attributs, because a string can't have children.

### Navigating using tag names

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the `<head>` tag, just say `soup.head`.

In [36]:
soup.head

<head><title>The Dormouse's Story</title></head>

In [37]:
soup.title

<title>The Dormouse's Story</title>

You can use this trick again and again to zoom in on a certain part of the parse tree. This code get first `<b>` tag beneath the `<body>` tag:

In [38]:
soup.body.b

<b>The Dormouse's Story</b>

Using a tag name as an attribute will give you only the *first* tab by that name:

In [39]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

If you need to get *all* the `<a>` tags, or anything more complicated than the first tag with a certain name, you'll need to use one of the methods described in Searching the tree, such as *find_all()*:

In [40]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

### `.contents` and `.children`

A tag's children are available in a list called `.contents`:

In [41]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's Story</title></head>

In [42]:
head_tag.contents

[<title>The Dormouse's Story</title>]

In [43]:
title_tag = head_tag.contents[1]
title_tag

IndexError: list index out of range

In [44]:
title_tag.contents

["The Dormouse's Story"]

In [45]:
first_child = head_tag.contents[0]
first_child

<title>The Dormouse's Story</title>

In [46]:
first_child.contents

["The Dormouse's Story"]

### `.decendants`

The `.contents` and `.children` attributes only consider a tag's *direct* children. For instance `<head>` tag has a single direct chil - the `<title>` tag?

In [47]:
head_tag.contents

[<title>The Dormouse's Story</title>]

But the `<title>` tag itself has a child: the string "The Dormouse's story". There's a sense in which that string is also a child of the `<head>` tag. The `.decendants` attributes lets you iterate over all of a tag's children, recursively: its direct children and the children of its direct children and so on:

In [48]:
for child in head_tag.descendants:
    print(child)

<title>The Dormouse's Story</title>
The Dormouse's Story


The `<head>` tag has only one child, but it has two descendants: the `<title>` tag and the `<title>` tag's child. The `BeautifilSoup` object only has one direct chid (the `<html>` tag) but it htas o whole lot of descendantes:

In [49]:
soup.children

<list_iterator at 0x7fcf7c063b38>

In [50]:
list(soup.children)

['\n', 'html', '\n', <html>
 <head><title>The Dormouse's Story</title></head>
 <body>
 <p class="title"><b>The Dormouse's Story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names where <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the botton of a well.</p>
 </body>
 </html>]

In [51]:
soup.descendants

<generator object descendants at 0x7fcf7c091570>

In [52]:
list(soup.descendants)

['\n', 'html', '\n', <html>
 <head><title>The Dormouse's Story</title></head>
 <body>
 <p class="title"><b>The Dormouse's Story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names where <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> and they lived at the botton of a well.</p>
 </body>
 </html>, '\n', <head><title>The Dormouse's Story</title></head>, <title>The Dormouse's Story</title>, "The Dormouse's Story", '\n', <body>
 <p class="title"><b>The Dormouse's Story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names where <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> a

### `.string`

If a tag has only one child, and that child is a `NavigableString`, the child is made available as `.string`:

In [53]:
title_tag.string

"The Dormouse's Story"

If a tag's only child is another tag and that thag has a `.string`, then the parent tag is considered to have the same `.string` as its child:

In [54]:
head_tag.contents

[<title>The Dormouse's Story</title>]

In [56]:
head_tag.string

"The Dormouse's Story"

If a tahg contains more than one thing, then it's not clear what `.string` should refer to, so `.string` is defined to `None`:

In [57]:
print(soup.html.string)

None


### `.strings` and `striped_strings`

If there's more than one thing inside a tag, you can still look at just the strings. Use the `.strings` generator:

In [58]:
for string in soup.strings:
    print(repr(string))


'\n'
'\n'
'\n'
"The Dormouse's Story"
'\n'
'\n'
"The Dormouse's Story"
'\n'
'Once upon a time there were three little sisters; and their names where '
'Elsie'
', '
'Lacie'
' and '
'Tillie'
' and they lived at the botton of a well.'
'\n'
'\n'


These strings tend to have a lot of extra whitespace, which you can remove by using the `stripped_strings` generator instead:

In [59]:
for string in soup.stripped_strings:
    print(repr(string))

"The Dormouse's Story"
"The Dormouse's Story"
'Once upon a time there were three little sisters; and their names where'
'Elsie'
','
'Lacie'
'and'
'Tillie'
'and they lived at the botton of a well.'


Here, strings consisting entirely of whitespaces are ignored, and whitespace at the beginning and end of strigs is removed.

## Going up

Continuing the "family tree" analogy, every tag and every string has a parent: the tag that contains it.

### `.parent`

You can access an element's parent with the `.parent` attribute. In the example "three sisters" document, the `<head>` tag is the parent of the `<title>` tag:

In [60]:
title_tag = soup.title
title_tag

<title>The Dormouse's Story</title>

In [62]:
title_tag.parent

<head><title>The Dormouse's Story</title></head>

The title string itself has a parent: the `<title>` tag that contains it:

In [63]:
title_tag.string.parent

<title>The Dormouse's Story</title>

The parent of a top-level tag like `<html>` is the `BeautifulSoup` object itself:

In [64]:
html_tag = soup.html
type(html_tag.parent)

bs4.BeautifulSoup

And the `.parent` of a `BeautifulSoup` object is defined as None:

In [65]:
print(soup.parent)

None


### .parents

You can iterate over all of an element's parents with `.parents`. This example uses `.parents` to travel from an `<a>` tag burried deep within the document, to the very top of the document:

In [67]:
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [69]:
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

p
body
html
[document]


## Going sideways

Consider a simple document like this:

In [73]:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></a>", "lxml")
print(sibling_soup.prettify())

<html>
 <body>
  <a>
   <b>
    text1
   </b>
   <c>
    text2
   </c>
  </a>
 </body>
</html>


The `<b>` tag and the `<c>` tag are at the same level: they're both direct children of the same tag. We call them *siblings*. When a document is pretty-printed, siblings show up at the same indentation level. You can also use this relationship in the code you write.

### `.next_sibling` and `.previous_sibling`

You can use `.next_sibling` and `.previous_sibling` to navigate between page elements that are on the same level of the parse tree:

In [74]:
sibling_soup.b.next_sibling

<c>text2</c>

In [75]:
sibling_soup.c.previous_sibling

<b>text1</b>

The `<b>` tag has a `.next_sibling`, but no `.previous_sibling`, because there's nothing before the `<b>` thag *on the same level of the tree*. For the same reason, the <c> tag has a `.previous_sibling` but no `.next_sibling`:

In [76]:
print(sibling_soup.b.previous_sibling)

None


In [77]:
print(sibling_soup.c.next_sibling)

None


The strigs "text1" and "text2" are *not* siblings because they don't have the same parent:

In [78]:
sibling_soup.b.string

'text1'

In [79]:
print(sibling_soup.b.string.next_sibling)

None


In real documents, the `.next_sibling` or `.previous_sibling` of a tag will usualy be a string containg whitespace. Going back to the "three sisters" document:
