# References

* [How To Use Beautiful Soup In Python | Part 1](https://www.youtube.com/watch?v=s2zKTklVavM)
* [Web Scraping With Beautiful Soup in Python](https://github.com/areed1192/sigma_coding_youtube/blob/master/python/python-data-science/web-scraping/Web%20Scraping%20Wikipedia.ipynb)

In [36]:
from bs4 import BeautifulSoup
import requests
import json

In [57]:
%%html
<style>
table {float:left}
</style>

In [82]:
url = "https://en.wikipedia.org/wiki/Lyndon_Rive"
response = requests.get(url)

if response.status_code == 200:
    content_html = response.content.decode("utf-8") 
else:
    print(f"HTML from {url} failed with status {response.status_code}")

In [83]:
soup = BeautifulSoup(content_html, 'html.parser')

---
# Tag object
XML/HTML tag element is a ```bs4.element.Tag``` object in BS4.

```find``` or ```find_all``` method to retreieve specific tag object(s).

In [100]:
for link in soup.find_all('a', href=True):
    if "class" in link.attrs and link.attrs['class'] == "image":
        break
        
print(type(link))

<class 'bs4.element.Tag'>


In [101]:
print(link.prettify())

<a href="https://www.mediawiki.org/">
 <img alt="Powered by MediaWiki" height="31" loading="lazy" src="/static/images/footer/poweredby_mediawiki_88x31.png" srcset="/static/images/footer/poweredby_mediawiki_132x47.png 1.5x, /static/images/footer/poweredby_mediawiki_176x62.png 2x" width="88"/>
</a>


## Tag object properties

| property | description                       |
|:----------|:-----------------------------------|
| name     | tag name                          |
| attrs    | tag attributes                    |
| contents | contents if element has its value |
| children | child elements |


In [102]:
for prop in vars(link):
    if not prop.startswith("_"): print(prop)

parser_class
name
namespace
prefix
sourceline
sourcepos
known_xml
attrs
contents
parent
previous_element
next_element
next_sibling
previous_sibling
hidden
can_be_empty_element
cdata_list_attributes
preserve_whitespace_tags


In [103]:
link.name

'a'

In [108]:
link.parent

<li id="footer-poweredbyico"><a href="https://www.mediawiki.org/"><img alt="Powered by MediaWiki" height="31" loading="lazy" src="/static/images/footer/poweredby_mediawiki_88x31.png" srcset="/static/images/footer/poweredby_mediawiki_132x47.png 1.5x, /static/images/footer/poweredby_mediawiki_176x62.png 2x" width="88"/></a></li>

In [104]:
for content in link.contents:
    print(content)

<img alt="Powered by MediaWiki" height="31" loading="lazy" src="/static/images/footer/poweredby_mediawiki_88x31.png" srcset="/static/images/footer/poweredby_mediawiki_132x47.png 1.5x, /static/images/footer/poweredby_mediawiki_176x62.png 2x" width="88"/>


## Tag attributes

Tag element attribute (e.g. href attribute of ```a``` element) is accessible with dot ```.``` or ```[<attribute>]```.


In [105]:
print(json.dumps(link.attrs, indent=4))

{
    "href": "https://www.mediawiki.org/"
}


In [106]:
for kv in link.attrs.items():
    print(kv)

('href', 'https://www.mediawiki.org/')


## Nested Tag elements

Nested element(s) can be retrieved with ```find``` or ```find_all``` methods.

In [107]:
img = link.find('img')
print(type(img))

<class 'bs4.element.Tag'>


Same can be done with children property.

In [92]:
for child in link.children:
    print(child)

Jump to navigation


# Traversing DOM like SAX

To go through the DOM tree and handle each nested element, use ```descendants``` property of the Tag object.

In [93]:
for descendant in soup.descendants:
    if descendant.name == 'a': print(descendant)

<a id="top"></a>
<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>
<a class="mw-jump-link" href="#searchInput">Jump to search</a>
<a class="image" href="/wiki/File:Lyndon_Rive_2015.jpg" title="Lyndon Rive on a plane in 2019"><img alt="Lyndon Rive on a plane in 2019" data-file-height="475" data-file-width="314" decoding="async" height="333" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/31/Lyndon_Rive_2015.jpg/220px-Lyndon_Rive_2015.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/3/31/Lyndon_Rive_2015.jpg 1.5x" width="220"/></a>
<a href="/wiki/Pretoria" title="Pretoria">Pretoria</a>
<a href="/wiki/South_Africa" title="South Africa">South Africa</a>
<a href="#cite_note-1">[1]</a>
<a href="/wiki/SolarCity" title="SolarCity">SolarCity</a>
<a href="/wiki/Chief_executive_officer" title="Chief executive officer">CEO</a>
<a href="/wiki/SolarCity" title="SolarCity">SolarCity</a>
<a href="/wiki/Elon_Musk" title="Elon Musk">Elon Musk</a>
<a href="/wiki/Kimbal_Musk" title