# 🌐 Scraping, Part 1.0: What is HTML?
   

*... and what is it made of?*


## Q: What is HTML?

## A: It's the stuff webpages are made of!

```html
<p>
    Hello, world!
</p>
<p>
    <strong>Hello</strong>, <em>world</em>!
</p>
<p>
    <span style="background-color: lime;">Hello</span> ...
</p>
<ul>
    <li>world!</li>
    <li>class!</li>
    <li>robot!</li>
</ul>
```

<center>↓↓↓</center>

<p>
    Hello, world!
</p>
<p>
    <strong>Hello</strong>, <em>world</em>!
</p>
<p>
    <span style="background-color: lime;">Hello</span> ...
</p>
<ul>
    <li>world!</li>
    <li>class!</li>
    <li>robot!</li>
</ul>

## Q: What is HTML made of?

## A: Elements

![](https://developer.mozilla.org/en-US/docs/Glossary/Element/anatomy-of-an-html-element.png)

<cite>Source: https://developer.mozilla.org/en-US/docs/Glossary/Element</cite>

## Q: What are elements made of?


## A: Tags, attributes, and content (... sitting in a tree)


## Tags 🏷

Tags define __elements__, which are the fundamental units of a webpage. Elements can represent paragraphs, lists, videos, and many other things. 

<small><cite>There are lots of different tags/elements, each with their own purpose: https://www.w3schools.com/tags/</cite></small>

Tags are (almost) always __come in pairs__ — one marking the beginning of the container, and another marking the end.

The pairs (typically) __surround some content__ (more on this later).

__Opening tags__ start with a __left angle bracket__ (`<`)
- ... followed by the __tag's name__ (e.g., `span`)
- ... optionally followed by a __series of attributes__ (e.g., `style="background-color: lime;"`)
- ... followed by a __right angle bracket__ (`>`)

Then comes the __content__ (e.g., `Hello`).

__Closing tags__ always take the same form: `</tagname>`
- For instance: `</span>`
- Forgetting the slash is a common mistake
    


![](https://developer.mozilla.org/en-US/docs/Glossary/Element/anatomy-of-an-html-element.png)

<cite>Source: https://developer.mozilla.org/en-US/docs/Glossary/Element</cite>

### An exception to the pairing rule

Some simple elements — those that can never contain content — are known as "__void elements__" and don't use a closing tag.

A common example is `<hr>` (the horizontal rule).

<small><cite>Read more: https://developer.mozilla.org/en-US/docs/Glossary/Void_element</cite></small>

## Attributes ℹ️

Attributes define an element's __non-content properties__.

Attributes are written as __key-value pairs__, in the form of `key="value"`, e.g., `href="https://example.com"`.

__Sometimes the value is blank__, in which case, the attribute's implicit value is `true`. For instance, `<input type="checkbox" checked>` (<input type="checkbox" checked/>).

One of the most common attributes you'll see is the `href` attribute on the `<a>` tag. 

An `<a>` tag represents a hyperlink, and the `href` attribute defines its destination — i.e., its URL.

For instance, `<a href="https://example.com">Click here</a>` generates this: <a href="https://example.com">Click here</a>.

## Content 🔤

Non-void elements can hold __text and/or other elements__.

For example, the `Click here` in `<a href="https://example.com">Click here</a>` is *text content*.

Content can also be __one or more elements__.

Here's an example with elements inside another element, and text inside *those* elements: `<ol> <li>One</li> <li>Two</li> <li>Three</li> </ol>`

And here's how that'd render:

<blockquote>
<ol> <li>One</li> <li>Two</li> <li>Three</li> </ol></blockquote>

## ... sitting in a tree 🌳

Because elements can contain other elements, a webpage's structure comes to resemble a tree.

You'll see this tree structure referenced often as the "__DOM__", short for Document Object Model.

As with a family tree, you'll hear/see the relationships between elements (also sometimes called "nodes") referred to as __parent__, __child__, and __sibling__ elements.

... but unlike real families, __parents can have infinite children__ but a __child can only have one parent__.

The __top of the tree__ is consistent:

```
  ┌──────────┐
  │  <html>  │
  └─┬────────┘
    │  ┌──────────┐
    ├─►│  <head>  │
    │  └──────────┘
    │  ┌──────────┐
    └─►│  <body>  │
       └──────────┘
```

... but everything else is largely free-form.

Elements can be nested infinitely ...

... but they *cannot cross one another's boundaries*. 

For instance, `<p><span>Hello</p></span>` is invalid.

## Let's look at some HTML!

1. Open your browser and visit __https://example.com__
2. View the source — here's how:
    - On Firefox: `Command-U`
    - On Chrome: `Command-Option-U`
    - On Safari: Preferences->Advanced->Show Develop menu ... then `Command-Option-U`
3. Alternatively, use the tool at __https://neatnik.net/view-source/__

## What do you see?

What do you think the various tags and attributes represent?


## Exercise: Try that with a few of your favorite websites

What do you see?

---

---

---