## Parsing HTML

### HTML 
**Hyper Text Markup Language**

 = Document Markup Language that was developed in the early 1990's for the use in web browsers.

---


HTML (HyperText Markup Language) is the most basic building block of the Web. It defines the meaning and structure of web content. Other technologies besides HTML are generally used to describe a web page's appearance/presentation (CSS) or functionality/behavior (JavaScript).

"Hypertext" refers to links that connect web pages to one another, either within a single website or between websites. Links are a fundamental aspect of the Web. By uploading content to the Internet and linking it to pages created by other people, you become an active participant in the World Wide Web.




An HTML element is set off from other text in a document by "tags", which consist of the element name surrounded by "<" and ">".  The name of an element inside a tag is case insensitive. 
    https://developer.mozilla.org/en-US/docs/Web/HTML



### Beautiful Soup
It is a python library (published in 2004), which enables us to scrape all the content from a web page. BeautifulSoup provides a set of tools to pull this data from the web page and locate the content which is hidden within the HTML structure.

### What we did in the course to scrape HTML from the web:

In [35]:
import requests




👉🏼 The requests package  allows you to send HTTP requests using Python. 

# with requests:
`response = requests.get(<your_url>)`

### requests uses the python standard library package "urllib"

the code for making a request with urllib would look like this:
```python
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
```


### Working with the HTML response

In [36]:
from bs4 import BeautifulSoup




BeatifulSoup a python library, which enables us to scrape all the content from a web page. BeautifulSoup provides a set of tools to pull this data from the web page and locate the content which is hidden within the HTML structure.



### Examle of a HTML response:


In [40]:
html_doc = '''
    <!DOCTYPE html>
    <html lang="en">
      <head>
        <title>My HTML Document</title>
      </head>
      <body>
        <p>Some body content</p>
      </body>
    </html>
'''

#### what we get as hmtl string from a website (and is not very useful):

In [54]:
html_doc

'\n    <!DOCTYPE html>\n    <html lang="en">\n      <head>\n        <title>My HTML Document</title>\n      </head>\n      <body>\n        <p>Some body content</p>\n      </body>\n    </html>\n'

In [45]:
soup = BeautifulSoup(html_doc, 'lxml')

### BeatifulSoup( ) arguments:

     BeautifulSoup(
        markup='',
        features=None,
        ...
    )

the second argument passed to BeautifulSoup allows you to specify a parser (by default BeautifulSoup will try **lxml**, followed by html5lib and finally the built-in html5 parser from the standard python library)

👉🏼 A **parser** is a software program that takes some input (usually text) and produces a AST (abstract syntax tree).

👉🏼 best practice is to specify a **parser explicitly** because in a different enviroment you might get a different result if the underlying parser is not installed.

### Working with the HTML tree

In [56]:
soup = BeautifulSoup(html_doc)

soup

<!DOCTYPE html>
<html lang="en">
<head>
<title>My HTML Document</title>
</head>
<body>
<p>Some body content</p>
</body>
</html>

In [53]:
type(soup)

bs4.BeautifulSoup

**after passing a string into the BeautifulSoup function we get a data structured we can work with**

### Attributes on the 'BeautifulSoup object' relate to the HTML tags:

In [57]:
print(soup.head)

<head>
<title>My HTML Document</title>
</head>


In [50]:
print(soup.title)


<title>My HTML Document</title>


In [51]:
print(soup.body)

<body>
<p>Some body content</p>
</body>


### Accessing the .text attribute on the HTML tag will remove its surrounding tags: 
*returns the actual part of the tag that is outside of the < angled brackets >*

In [52]:
print(soup.title.text)
print(soup.body.text)

My HTML Document

Some body content



In [59]:
soup.text

'\n\nMy HTML Document\n\n\nSome body content\n\n\n'

### BeautifulSoup gives us useful methods to query the structure of the HTML document (e.g. find_all( ), get( ) )

The most used method is `.find_all()`:

`soup.find_all(name, attrs, recursive, string, limit, **kwargs)`

* name — name of the tag; e.g. “a”, “div”, “img”
* attrs — a dictionary with the tag’s attributes; e.g. {“class”: “nav”, “href”: “#menuitem”}
* recursive — boolean; if false only direct children are considered, if true (default) all children are examined in the search
* string — used to search for strings in the element’s content
* limit — limit the search to only this number of found elements



#### What if we need a piece of data that is not inside the element, but as the value of an attribute? 

`attribute.get(key, default=None)`

Returns the value of the 'key' attribute for the tag, or
the value given for 'default' if it doesn't have that
attribute.

In [85]:
soup.html.get('lang') 

'en'

#### Example:

In [87]:
html_doc = '''
<html lang="en">
  <head>
    <title>My HTML Document</title>
  </head>
  <body>
    <a style="color: red;" href="https://www.spiced-academy.com">Spiced</a>
    <a style="color: blue;" href="https://www.google.com">Spiced</a>
  </body>
</html>
'''

soup = BeautifulSoup(html_doc)


https://www.spiced-academy.com
color: red;
https://www.google.com
color: blue;


In [88]:
soup.find_all('a')

[<a href="https://www.spiced-academy.com" style="color: red;">Spiced</a>,
 <a href="https://www.google.com" style="color: blue;">Spiced</a>]

In [90]:
for link in soup.find_all('a'):
    print(link.get('href'))
    print(link.get('style'))

https://www.spiced-academy.com
color: red;
https://www.google.com
color: blue;
