# Scaping Data from the Web

Corresponds to a DataQuest course. We'll use `requests` to download a web page, and `beautifulsoup` to navigate its content.

In [2]:
import requests
from bs4 import BeautifulSoup

Let us download the HTML code for a simple page. This is done by `requests.get`

In [6]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
print(f"status_code = {response.status_code}\n")
print(response.content.decode("utf-8"))

status_code = 200

<!DOCTYPE html>
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <p>Here is some simple content for this page.</p>
    </body>
</html>


Basics about HTML are provided in this [MDN guide](https://developer.mozilla.org/en-US/docs/Learn/HTML).

We can now use `beautifulsoup` in order to navigate the nested structure of the page, and to search for content corresponding to certain tags. The nested tag structure of a page can be viewed as a tree, which is how `beautifulsoup` represents it:

* Root `html`: Children `head`, `body`
* Node `head`: Child `title`
* Node `body`: Child `p`

In [8]:
parser = BeautifulSoup(response.content, 'html.parser')
head = parser.head
title = head.title
print(title.text)

A simple example page


`find_all` finds all occurences of a tag in some element (which can be the whole page, or a subtree) and returns them as a list.

In [10]:
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")

# Get the paragraph tag.
p = body[0].find_all("p")

# Get the text.
print(p[0].text)

Here is some simple content for this page.


Tags can have an `id` attribute with a unique value. We can use `find_all` to find only tags with a certain `id`. For example: `parser.find_all("p", id="first")`.

Tags can also have a `class` attribute, whose value is not unique. This modified the tag content in a certain way, for example formats it. We can use `find_all` to find only tags with a certain `class`. For example: `parser.find_all("p", class_="inner_text")`. Note it is `class_`.

## CSS Selectors

Cascading Style Sheets, or CSS, is a language for adding styles to HTML pages. You may have noticed that our simple HTML pages from the past few screens didn't have any styling (all of the paragraphs had black text and the same font size). Most Web pages use CSS to display a lot more than basic black text.

CSS uses selectors to add styles to the elements and classes of elements you specify. You can use selectors to add background colors, text colors, borders, padding, and many other style choices to the elements on HTML pages.

This lesson doesn't include an in-depth lesson on CSS. If you'd like to learn more about CSS, [MDN's guide](https://developer.mozilla.org/en-US/docs/Learn/CSS) is a great place to start.

A CSS selector applies a certain formatting style to tags with certain `class` or `id` attributes.

This CSS will make the text inside all paragraphs red:

```
p{
    color: red
 }
```

This CSS will change the text color to red for any paragraphs that have the class inner-text. We select classes with the period or dot symbol (.):

```
p.inner-text{
    color: red
 }
```

This CSS will change the text color to red for any paragraphs that have the ID first. We select IDs with the pound or hash symbol (#):

```
p#first{
    color: red
 }
```

You can also style IDs and classes without using any specific tags. For example, this CSS will make the element with the ID first red (not just paragraphs):

```
#first{
    color: red
 }
```

This CSS will make any element with the class inner-text red:

```
.inner-text{
    color: red
 }
```

We can use CSS selectors to select elements in `beautifulsoup` as well.