In [127]:
import sys
print(sys.version)

import re
import bs4
from bs4 import BeautifulSoup
print(bs4.__version__)

3.8.10 (default, Nov 26 2021, 20:14:08) 
[GCC 9.3.0]
4.10.0


# How BS4 handles namespaces (XML Parser)

* [How can I access namespaced XML elements using BeautifulSoup?](https://stackoverflow.com/a/70586414/4281353)


## Without namespace definition

BS4/XML parser **simply drops the namespace without namespace definitions**. Hence you cannot use namespace in the search strings, but you **can use the ```tagname```** in ```<namespace:tagname>```.

According to [BeautifulSoup.find_all() method not working with namespaced tags](https://stackoverflow.com/a/44681560/4281353), the BS4 with XML parser simply drop the namespace.

```
# Optimization to find all tags with a given name.
if name.count(':') == 1:
    # This is a name with a prefix.
    prefix, name = name.split(':', 1)
```

Hence the parsing the XML:
```
<?xml version="1.0" encoding="UTF-8"?>
<ns:Web>
<ns:Total>4000</ns:Total>
<ns:Offset>0</ns:Offset>
</ns:Web>
</xml>
```

is the same with parsing below where the namespace ```ns:``` has been dropped.
```
<?xml version="1.0" encoding="UTF-8"?>
<Web>
<Total>4000</Total>
<Offset>0</Offset>
</Web>
```

In [128]:
xml = """
<?xml version="1.0" encoding="UTF-8"?>
<ns:Web>
<ns:Total>4000</ns:Total>
<ns:Offset>0</ns:Offset>
</ns:Web>
</xml>
"""

In [129]:
soup = BeautifulSoup(xml, 'xml')

In [130]:
# namespace is dropped (and another xml line is inserted...)
soup

<?xml version="1.0" encoding="utf-8"?>
<?xml version="1.0" encoding="UTF-8"?><Web>
<Total>4000</Total>
<Offset>0</Offset>
</Web>

In [134]:
# You can NOT use the namespace as part of the tag name to search
soup.find("ns:Offset")

In [136]:
# You can only use the tag name
soup.find("Offset")

<Offset>0</Offset>

# With namespace definition

When the namespace definition is provided, BS4/XML Parser can accpe the ```<namespace:tagname>```.

In [139]:
xbrl_with_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<xbrl
    xmlns:dei="http://xbrl.sec.gov/dei/2020-01-31"
>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""

In [140]:
soup = BeautifulSoup(xbrl_with_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant.prettify())

<dei:EntityRegistrantName>
 Hoge, Inc.
</dei:EntityRegistrantName>



Verify without namespace definition.

In [116]:
xbrl_without_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""

In [141]:
# Cannot use the namespace "dei"
soup = BeautifulSoup(xbrl_without_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)

None


---
# How BS4 handles namespaces (HTML Parser)

HTML Parser does not tell namespace and tagname, hence regard ```<namespace:tagname>``` as a single tag.
Besides **HTML Parser convert string into lower letters**.

In [143]:
xbrl_without_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""

In [144]:
soup = BeautifulSoup(xbrl_without_namespace, 'html.parser')

None


In [145]:
# Does not match as HTML parser conveted into lower letters.
registrant = soup.find("dei:EntityRegistrantName") 
print(registrant)

None


In [147]:
registrant = soup.find("dei:EntityRegistrantName".lower()) 
print(registrant)

<dei:entityregistrantname>
Hoge, Inc.
</dei:entityregistrantname>
