# **D. XML/HTML**
In the previous sections, we learned how to `csv`, `txt`, `json` files in Python.<br><br> In Section 3, we will be moving onto **`XML` and `HTML` in Python**.


## _Objective_
1. **Definitions and features of XML/HTML**: Understanding what XML and HTML are and how they function.
2. **XML/HTML I/O**: Understanding how to handle XML and HTML data in Python.

# [1. Definitions and features of **XML/HTML**</code>:]

HTML and XML are both markup languages.  
However, while HTML is enjoying its reputation in web design, XML seems to be less familiar to many of us. 

So, we're going to look at what they are and how they function.

## 1.Definitions of XML/HTML

### (1) XML

- XML stands for eXtensible Markup Language
- XML is a text-based markup language
- XML has a structure that is both human-readable and machine-readable
- XML is designed to store and transport data<br><br><br>

### (2) HTML

- HTML is a short for Hyper Text Markup Language
- HTML is the standard markup language for Web pages, and web pages are called HTML documents
- HTML tags are used to determine the design or functionality of a web page.
- HTML is intended to present the web data more nicely and organized.
<br><br><br>

## 2.Similarities and Differences of XML and HTML

### (1) Similarities

<img src = 'https://imgur.com/shZ2lt7.jpg' align=left width=200/>

Both languages are tag-based languages.  
Tag-based languages are languages that are used to define elements in a document or web page by embedding codes (tags)  
surrounded by common start and stop characters. 
<br><br>

### (2) Differences

1. The main use of HTML is for the design and layout of the web page, and it affects how web pages are organized and displayed.  
On the other hand, XML is primarily aimed at the efficient use of data displayed on a web page rather than the presentation of the web page.  

    In short, HTML works only in a web environment, but XML is independent of any environment.<br>


2. Even though both languages are made up of tags, the tag names are different.

```
<html>
    <head>
      <title>title</title>
    <body>
      <h1>Header text</h1>
      <p>Paragraph text</p>
    </body>
    </head>
</html>
```

For example, in HTML, tag names and their use are pre-defined, so using tags in the right context is crucial in HTML.<br>
The above code will result in the illustration below.


---

<html>
    <head>
      <title>title</title>
    <body>
      <h1>Header text</h1>
      <p>Paragraph text</p>
    </body>
    </head>
</html>

---

Then, let's move on to XML.

```
<data>
<student>
    <name>peter</name>
    <age>24</age>
    <hobby category = "ball">
        <soccer>soccer</soccer>
        <basketball>basketball</basketball>
    </hobby>
</student>
</data>
```

Compared to HTML, tag names and the use are **user-defined** in XML.

# [2. XML I/O with ElementTree]

Now, we'll be looking at how to deal with XML data in Python. <br>
A Python library `xml` provides a module called `.ElementTree` which can be used for parsing and creating XML data.<br>
The main purpose of `ElementTree` is to transform an XML document into a tree-shaped object. <br>
Let's have a look at creating and saving XML data <br>


## 1. Input / Output

Before we get started with parsing xml data, we need to import `etree.ElementTree` to the current working file.

In [1]:
from xml.etree.ElementTree import Element, dump, ElementTree

### (1) Creating tags

Once the module has been imported, you can define an element interface using `.Element()`.

In [2]:
#Objectifying XML data
name_tag = Element("name")
name_tag.text = "john"
type(name_tag)

xml.etree.ElementTree.Element

An element object has been created under a tag name `name_tag`.<br>
If you want to see the content of the element object, you can use `.dump()`.

In [3]:
dump(name_tag)

<name>john</name>


The element object `john` is created under the element name `name_tag`.

### (2) Creating a nested tag 
Like control flow statements, you can create a nested tag as a tag inside another tag using `.append()`.

In [4]:
root_tag = Element("Student") 
leaf_tag1 = Element("name") # tag - name
leaf_tag1.text = "john"
root_tag.append(leaf_tag1) # inserting `name` inside `Student` 

leaf_tag2 = Element("age") # tag - age 
leaf_tag2.text = "24"
root_tag.append(leaf_tag2) # inserting `age` inside `Student` 

dump(root_tag)

type(root_tag)

<Student><name>john</name><age>24</age></Student>


xml.etree.ElementTree.Element

`Student` has two leaves (or two sub-tags), each `name` and `age`.<br>

### (3) Saving XML in Python
Using `.write()`, the XML object from above can be saved as an xml file <br><br><br><br> 


In [5]:
ElementTree(root_tag).write('./student_xml.xml')

# [3. HTML I/O with BeautifulSoup]
Now, let's move on to HTML. <br>
Before we start with writing and reading HTML, we'll be parsing HTML data directly from the web using a famous parsing tool `BeautifulSoup`.

## 1.BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. With BeautifulSoup, you can parse the HTML data directly from the web.
<br>

### (1) Data Parsing with BeautifulSoup

First, you need to import BeautifulSoup to the current working file.

In [6]:
# importing BeautifulSoup
from bs4 import BeautifulSoup

In [7]:
tag = "<body><p class='Python'> Hello Python </p></body>" # assigning an html-like string to a new variable naed `tag`

In [8]:
type(tag) 

str

Then, convert the Python string to a BeautifulSoup object using `.BeautifulSoup()` and store it to a new variable `bs`.<br>

In [9]:
bs = BeautifulSoup(tag) # data type conversion from string to BeautifulSoup
bs

<html><body><p class="Python"> Hello Python </p></body></html>

In [10]:
type(bs)

bs4.BeautifulSoup

The BeautifulSoup object reads the data in HTML format. For most purposes, you can treat it as a Tag object.  
This means it supports most of the methods described in Navigating the tree and Searching the tree.  
(details on BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/)


Now, the data is set in the right format, and it's time to see what BeautifulSoup offers to us. 

First, `BeautifulSoup` provides a method **`.find()`**. it allows you to easily find tags of interest.<br>
For instance, let's assume you looking for a tag named `p`. Then, pass the tag name `p` to `.find()`.<br> 



In [11]:
bs.find('p')

<p class="Python"> Hello Python </p>

Thanks to the functionality of `.find()`, we could find the tag `p` with its content by a single line of code.<br>
<br><br><br>
Alternatively, you can find a tag by its attributes if exists. By tag attribute, it refers to additional texts inside a tag.<br>For instance, tag `p` consists of two parts, the tag name `p` and the attribute `class = "Python"`.<br>
In this case, you can simply pass the attribute to `.find()` for the same result as the previous one.

In [12]:
bs.find(class_ = 'Python') 

<p class="Python"> Hello Python </p>

In case you want to leave out the tag information and need only the content, you can simply add  `.text` to the previous code.

In [13]:
bs.find(class_ = 'Python').text

' Hello Python '

### (2) Practice) BeautifulSoup

So far, we've seen how to use BeautifulSoup for parsing HTML data. Then, let's practice BeautifulSoup with a real-world example.

There is an online database for TV shows and movies called `IMDB`, and we're going to use the titles of 250 top rated TV shows. 

<img src="https://imgur.com/SZxXgTz.jpg" align=left width=400 height=400/>

### Web data crawling using a web api (`urlopen`)<br><br>

In [14]:
import urllib.request
from urllib.request import urlopen # 

url = 'https://www.imdb.com/chart/toptv/?ref_=nv_tvv_250'
page = urlopen(url)
bs = BeautifulSoup(page, "html.parser")

#### Data by user-input

In [15]:
imdb = open('./data/other/imdb_top250.html', 'r', encoding='utf-8') # opening an html file 
top_250 = imdb.read()
top_250


'<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n        \n<script type=\'text/javascript\'>var ue_t0=ue_t0||+new Date();</script>\n<script type=\'text/javascript\'>\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\n\nvar ue_csm = window,\n    ue_hob = +new Date();\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=function(b,a){return function(){try{return b.apply(this,arguments)}catch(c){ueLogError(c,{attribution:a||"undefined",logLevel:"WARN"})}}}})(ue_csm);\n\n\n    var ue_err_chan = \'jserr\';\n(function(d,e){function h(f,b){if(!(a.ec>a.mxe)&&f){a.ter.push(f);b=b||{};var c=f.logLevel||b.logLevel;c&&c!==k&&c!==m&&c!==

In [16]:
type(top_250)

str

`imdb_top250.html` is opened in read-only mode, read by `.read()`, and stored `in top_250`.<br>
At this stage, the content of `imdb_top2500.html` is a simple string object and needs to be parsed by `BeautifulSoup()`.

In [17]:
bs = BeautifulSoup(top_250, "html.parser")
bs

<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<script type="text/javascript">
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=function(b,a){return function(){try{return b.apply(this,arguments)}catch(c){ueLogError(c,{attribution:a||"undefined",logLevel:"WARN"})}}}})(ue_csm);


    var ue_err_chan = 'jserr';
(function(d,e){function h(f,b){if(!(a.ec>a.mxe)&&f){a.ter.push(f);b=b||{};var c=f.logLevel||b.logLevel;c&&c!==k&&c!==m&&c!==n&&c!==p||a.ec++;c&&c!=k||a.ecf++;b.pageURL=

In [18]:
type(bs)

bs4.BeautifulSoup

Thanks to `BeautifulSoup()`, you can see the data in HTML format.<br>
Now we need to use `.find()` to access the movie title.<br>
You can first check the location of the tag of interest with **URL inspection tool**.

<img src="https://imgur.com/8Vtf7s7.jpg" align=left width=1000 height=500/>

It seems that the show titles exist inside `<td>` with an attribute **titleColumn**.<br>
Now, you know the tag name to find and know how to access to it using `.find()`.<br>
Since the tag attribute is a unique value, it would be better to access by the tag attribute.


In [19]:
first_movie = bs.find(class_="titleColumn")
first_movie

<td class="titleColumn">
      1.
      <a href="/title/tt5491994/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=12230b0e-0e00-43ed-9e59-8d5353703cce&amp;pf_rd_r=GBHMV58NPC2HGHVH98R0&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=toptv&amp;ref_=chttvtp_tt_1" title="David Attenborough">Planet Earth II</a>
<span class="secondaryInfo">(2016)</span>
</td>

Even though we see the movie title, `<td>` is unfortunately not the innermost tag for the movie title.  
Instead, you need to go one more tag inside, namely `<a>`.<br>Let's access `<a>` to get the movie titles.
<br> <br> <br>

In [20]:
first_movie.find('a')

<a href="/title/tt5491994/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=12230b0e-0e00-43ed-9e59-8d5353703cce&amp;pf_rd_r=GBHMV58NPC2HGHVH98R0&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=toptv&amp;ref_=chttvtp_tt_1" title="David Attenborough">Planet Earth II</a>

With `.text`, you can directly access the content of the tag<br>

In [21]:
first_movie.find('a').text

'Planet Earth II'

We've finally found the title of the first show out of 250. 
<br>
Now, let's get all of 250 titles using `.find_all()`. <br><br>

In [22]:
find_title = bs.find_all(class_="titleColumn")
find_title

[<td class="titleColumn">
       1.
       <a href="/title/tt5491994/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=12230b0e-0e00-43ed-9e59-8d5353703cce&amp;pf_rd_r=GBHMV58NPC2HGHVH98R0&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=toptv&amp;ref_=chttvtp_tt_1" title="David Attenborough">Planet Earth II</a>
 <span class="secondaryInfo">(2016)</span>
 </td>,
 <td class="titleColumn">
       2.
       <a href="/title/tt0795176/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=12230b0e-0e00-43ed-9e59-8d5353703cce&amp;pf_rd_r=GBHMV58NPC2HGHVH98R0&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=toptv&amp;ref_=chttvtp_tt_2" title="David Attenborough, Sigourney Weaver">Planet Earth</a>
 <span class="secondaryInfo">(2006)</span>
 </td>,
 <td class="titleColumn">
       3.
       <a href="/title/tt0185906/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=12230b0e-0e00-43ed-9e59-8d5353703cce&amp;pf_rd_r=GBHMV58NPC2HGHVH98R0&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=toptv&amp;ref_=chttvtp_tt_3" title="Scott Grimes, Dam

The variable `find_title` contains a list of 250 tags.<br>
Let's access `<a>` of each using for loop.

In [23]:
titles = []
for i in find_title:
    titles.append(i.find('a').text)

Create an empty list named `titles` and append each show title to it.

In [24]:
titles

['Planet Earth II',
 'Planet Earth',
 'Band of Brothers',
 'Breaking Bad',
 'Chernobyl',
 'The Wire',
 'Blue Planet II',
 'Our Planet',
 'Cosmos: A Spacetime Odyssey',
 'Cosmos',
 'Game of Thrones',
 'Avatar: The Last Airbender',
 'Rick and Morty',
 'The Sopranos',
 'The Last Dance',
 'The World at War',
 'Life',
 'Hagane no renkinjutsushi',
 'The Vietnam War',
 'Sherlock',
 'The Twilight Zone',
 'Human Planet',
 'Sahsiyet',
 'The Beatles Anthology',
 'The Blue Planet',
 'Batman: The Animated Series',
 'Frozen Planet',
 'Firefly',
 'Dekalog',
 'True Detective',
 'Death Note: Desu nôto',
 'The Civil War',
 'Apocalypse: La 2ème guerre mondiale',
 'Fargo',
 'Kaubôi bibappu',
 'Africa',
 'When They See Us',
 'Last Week Tonight with John Oliver',
 'Hunter x Hunter',
 'TVF Pitchers',
 'Only Fools and Horses....',
 'The Office',
 'Friends',
 'Das Boot',
 'Gravity Falls',
 'Seinfeld',
 "Monty Python's Flying Circus",
 'Ramayan',
 'Pride and Prejudice',
 'How the Universe Works',
 'Black Mirror