# 1 -  Introduction



A lot of data aren't accessible through data sets or APIs. They may exist on the Internet as Web pages, though. One way to access the data without waiting for the provider to create an API is to use a technique called **Web scraping**.

Web scraping allows us to load a Web page into Python and extract the information we want. We can then work with the data using standard analysis tools like **pandas** and **numpy**.

Before we can do Web scraping, we need to understand the structure of the Web page we're working with, then find a way to extract parts of that structure in a sensible way.

We'll use the **requests** library heavily as we learn about Web scraping. This library enables us to download a Web page. We'll also use the **beautifulsoup** library to extract the relevant parts of the Web page.



#2 - Web Page Structure




Web pages use HyperText Markup Language (HTML). HTML isn't a programming language like Python. It's a markup language with its own syntax and rules. When a Web browser like Chrome or Firefox downloads a Web page, it reads the HTML to determine how to render it and display it to you.

Here's the HTML for a very simple Web page:

```html
<!DOCTYPE html>
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <p>Here is some simple content for this page.</p>
    </body>
</html>
```

You can see what this page looks like on our [GitHub site](https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/simple.html).

HTML consists of tags. We open a tag like this:

```html
<p>
```

We close a tag like this:


```html
<\p>
```

Anything in between the opening and closing of a tag is the content of that tag. We can nest tags to create complex formatting rules. Here's an example:

```html
<p><b>Hello, this is a bold paragraph<\b><\p>
```


The **b** tag bolds the content inside it, and the **p** tag creates a new paragraph. The HTML above will display as a bold paragraph because the **b** tag is inside the **p** tag. In other words, the **b** tag is nested within the **p** tag.

HTML documents contain a few major sections. The **head** section contains information that's useful to the Web browser that's rendering the page; the user doesn't see it. The **body** section contains the bulk of the content the user interacts with on the page.

Different tags have different purposes. For example, the title tag tells the Web browser what page **title** to display at the top of your tab. The **p** tag indicates that the content inside it is a single paragraph.

We won't cover tags comprehensively here, but please read the [Mozilla Developer Network's (MDN)](https://developer.mozilla.org/en-US/Learn/Getting_started_with_the_web/HTML_basics) article on HTML basics if you need more of a grounding on this topic. Check out [MDN's guide to the HTML element](https://developer.mozilla.org/en-US/docs/Web/HTML/Element) for a list of all possible HTML tags. In order to do Web scraping effectively, you'll need a solid understanding of the various tags and how they work.


**HyperText Transfer Protocol (HTTP)**

- Foundation of data communication for the web
- HTTP is the protocol that is used by web servers and browsers to communicate. 
- HTTP is based on a request and a response.

<img width="400" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=0BxhVm1REqwr0SW5VVkYyTzhxYmM">


```python
from urllib.request import urlopen, Request
url = 'https://www.wikipedia.org'
request = Request(url)
response = urlopen(request)
content = response.read()
response.close()
```


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">



1. Make a GET request to this [Link](https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/simple.html), and assign the result to the variable **response**.
2. Use **response.read()**  to get the content of the response, and assign it to **content**.
  - Note how the content is the same as the HTML above.


In [0]:
# put your code here
from urllib.request import urlopen, Request
url = 'https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/simple.html'
request = Request(url)
response = urlopen(request)
content = response.read()
response.close()

In [0]:
content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

#3 - Requests: HTTP for Humans




[**Requests**](http://docs.python-requests.org/en/master/) allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There's no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic.

Requests is one of the most downloaded Python packages of all time, pulling in over 400,000 downloads each day. Join the party!

In [0]:
# import package
import requests

# specify the url
url = 'https://www.wikipedia.org'

# packages the request, send the request and catch the response
response = requests.get(url)

# extract the text
text = response.text


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">


1. Make a GET request to this [Link](https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/simple.html), and assign the result to the variable **response**.
2. Use **response.content**  to get the content of the response, and assign it to **content**.
3. Note how the content is the same as the HTML above.


In [0]:
# put your code here
url = 'https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/simple.html'
response = requests.get(url)

# extract the text
content = response.text

In [0]:
content

'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

#4 -  Understanding status code and headers




```python
response.status_code
response.headers
```

- **200** - Everything went okay, and the server returned a result (if any).
- **301** - The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint's name has changed.
- **401** - The server thinks you're not authenticated. This happens when you don't send the right credentials to access an API (we'll talk about this in a later mission).
- **400** - The server thinks you made a bad request. This can happen when you don't send the information the API requires to process your request, among other things.
- **403** - The resource you're trying to access is forbidden; you don't have the right permissions to see it.
- **404** - The server didn't find the resource you tried to access.



**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">


1. Make a GET request to https://portal.imd.ufrn.br/portal, and assign the result to the variable **response**.
2. Investigate the return of properties **status_code** and **headers**. What kind of information is possible to extract to?


In [0]:
# put your code here
response = requests.get(' https://portal.imd.ufrn.br/portal')

response.status_code

200

#5 - Retrieving Elements from a Page




Downloading the page is the easy part. Let's say that we want to get the text in the first paragraph. Now we need to parse the page and extract the information we want.

We'll use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library to parse the Web page with Python. This library allows us to extract tags from an HTML document.

We can think of HTML documents as "trees," and the nested tags as "branches" (similar to a family tree). BeautifulSoup works the same way.

If we look at this page, for example, the root of the "tree" is the **html** tag:

```html
<!DOCTYPE html>
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <p>Here is some simple content for this page.</p>
    </body>
</html>
```

The **html** tag contains two "branches", **head** and **body**.**head** contains one "branch", title.body contains one branch, **p**. Drilling down through these multiple branches is one way to parse a Web page.

To extract the text inside the **p** tag, we would first need to get the **body** element, then the **p** element, and then finally the text inside the **p** element.

In [0]:
# import package
import requests
from bs4 import BeautifulSoup

# specify the url
url = 'https://nbviewer.jupyter.org/github/ivanovitchm/\
datascience_one_2019_1/blob/master/Lesson%2319/html/simple.html'

# packages the request, send the request and catch the response
response = requests.get(url)

# extract the content
content = response.content

In [0]:
# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')

# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, 
# we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body

# Get the p tag from the body.
p = body.p

# Print the text inside the p tag.
print(p.text)

Here is some simple content for this page.


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">


1. Get the text inside the **title** tag, and assign the result to **title_text**.


In [0]:
# put your code here
title = parser.head.title

title.text

'A simple example page'

# 6 - Using Find All




While it's nice to use the tag type as a property, it's not always a very robust way to parse a document. It's usually better to be more explicit by using the <span style="background-color: #F9EBEA; color:##C0392B">find_all</span> method. This method will find all occurrences of a tag in the current element, and return a list.

If we only want the first occurrence of an item, we'll need to index the list to get it. Aside from this difference, it behaves the same way as passing in the tag type as an attribute.

In [0]:
parser = BeautifulSoup(content, 'html.parser')

# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")

# Get the paragraph tag.
p = body[0].find_all("p")

# Get the text.
print(p[0].text)

Here is some simple content for this page.


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">

1. Apply the **find_all** method to get the text inside the title tag, and assign the result to **title_text**.
2. Print all links from https://portal.imd.ufrn.br/portal/
```python
# Get a list of all links.
a_tags = parser.find_all("a")
# Get the text
for link in a_tags:
     print(link.get('href'))
```


In [0]:
# put your code here
title_text = parser.find_all('title')

title_text

[<title>A simple example page</title>]

In [0]:
# Get a list of all links.
a_tags = parser.find_all("a")
# Get the text
for link in a_tags:
  print(link)

# 7 - Element IDs




HTML allows elements to have **IDs**. Because they are unique, we can use an **ID** to refer to a specific element.

Here's an example page:

```html
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p id="first">
                First paragraph.
            </p>
        </div>
        <p id="second">
            <b>
                Second paragraph.
            </b>
        </p>
    </body>
</html>
```

You can see the page [here](https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/simple_ids.html).

HTML uses the **div** tag to create a divider that splits the page into logical units. We can think of a divider as a "box" that contains content. For example, different dividers hold a Web page's footer, sidebar, and horizontal menu.

There are two paragraphs on the page; the first is nested inside a **div**. Luckily, the paragraphs have **IDs**. This means we can access them easily, even through they're nested.

Let's use the **find_all** method to access those paragraphs, and pass in the additional **id** attribute.

In [0]:
# Get the page content and set up a new parser.
url = 'https://nbviewer.jupyter.org/github/ivanovitchm/\
datascience_one_2019_1/blob/master/Lesson%2319/html/simple_ids.html'

response = requests.get(url)

# extract the content
content = response.content

# Initialize the parser
parser = BeautifulSoup(content, 'html.parser')

# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)


                First paragraph.
            


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">


1. Get the text of the second paragraph (what's inside the second **p** tag), and assign the result to **second_paragraph_text**.


In [0]:
# put your code here
second_paragraph = parser.find_all("p", id="second")[0]
print(second_paragraph.text)



                Second paragraph.
            



# 8 - Element Classes




In HTML, elements can also have classes. Classes aren't globally unique. In other words, many different elements belong to the same class, usually because they share a common purpose or characteristic.

For example, you may want to create three dividers to display three of your photographs. You can create a common look and feel for these dividers, such as a border and caption style.

This is where classes come into play. You could create a class called "gallery," define a style for it once using CSS (which we'll talk about soon), and then apply that class to all of the dividers you'll use to display photos. One element can even have multiple classes.

```html
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class="inner-text">
                First paragraph.
            </p>
            <p class="inner-text">
                Second paragraph.
            </p>
        </div>
        <p class="outer-text">
            <b>
                First outer paragraph.
            </b>
        </p>
        <p class="outer-text">
            <b>
                Second outer paragraph.
            </b>
        </p>
    </body>
</html>
```

Take a look at [this page](https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/simple_classes.html) to see how we've used classes to style paragraphs.



We can use **find\_all** to select elements by class. We'll just need to pass in the **class\_** parameter.

In [0]:
# Get the website that contains classes.
url ='https://nbviewer.jupyter.org/github/ivanovitchm/\
datascience_one_2019_1/blob/master/\
Lesson%2319/html/simple_classes.html'

response = requests.get(url)
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Get the first inner paragraph.
# Find all the paragraph tags with the class inner-text.
# Then, take the first element in that list.
first_inner_paragraph = parser.find_all("p", class_="inner-text")[0]
print(first_inner_paragraph.text)


                First paragraph.
            


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">


1. Get the text in the second inner paragraph, and assign the result to **second_inner_paragraph_text**.
2. Get the text of the first outer paragraph, and assign the result to **first_outer_paragraph_text**.

In [0]:
# put your code here
second_inner_paragraph = parser.find_all("p", class_="inner-text")[1]
print(second_inner_paragraph.text)

first_outer_paragraph = parser.find_all("p", class_="outer-text")[0]
print(first_outer_paragraph.text)



                Second paragraph.
            


                First outer paragraph.
            



# 9 - CSS Selectors




**Cascading Style Sheets**, or **CSS**, is a language for adding styles to HTML pages. You may have noticed that our simple HTML pages from the past few screens didn't have any styling; all of the paragraphs had black text and the same font size. Most Web pages use CSS to display a lot more than basic black text.

CSS uses **selectors** to add styles to the elements and classes of elements you specify. You can use selectors to add background colors, text colors, borders, padding, and many other style choices to the elements on HTML pages.

An in-depth lesson on CSS is outside the scope of this mission. If you'd like to learn more about it on your own, [MDN's guide](https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_started) is a great place to start.

What we do need to know is how CSS selectors work.

This CSS will make all of the text inside all paragraphs red:

```css
p{
    color: red
 }
```

You can see what this style looks like on our [GitHub site](https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/simple_red.html).

This CSS will change the text color to red for any paragraphs that have the class **inner-text**. We select classes with the period or dot symbol (**.**):

```css
p.inner-text{
    color: red
 }
```

You can see what the result looks like [here](https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/simple_inner_red.html).

This CSS will change the text color to red for any paragraphs that have the ID **first**. We select IDs with the pound or hash symbol (**#**):

```css
p#first{
    color: red
 }
```


Take a look at the results on our [GitHub site](https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/simple_ids_red.html).

You can also style IDs and classes without using any specific tags. For example, this CSS will make the element with the ID **first** red (not just paragraphs):

```css
#first{
    color: red
 }
```

This CSS will make any element with the class **inner-text** red:

```css
.inner-text{
    color: red
 }
```

In the examples above, we used CSS selectors to select one or more elements, then apply styles to only those elements. CSS selectors are very powerful and flexible. Perhaps not surprisingly, we also use CSS selectors to select elements when we do Web scraping.

#10 -  Using CSS Selectors





We can use BeautifulSoup's **.select** method to work with CSS selectors. Here's the HTML we'll be working with on this screen:

```css
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class="inner-text first-item" id="first">
                First paragraph.
            </p>
            <p class="inner-text">
                Second paragraph.
            </p>
        </div>
        <p class="outer-text first-item" id="second">
            <b>
                First outer paragraph.
            </b>
        </p>
        <p class="outer-text">
            <b>
                Second outer paragraph.
            </b>
        </p>
    </body>
</html>
```

You may have noticed that the same element can have both an ID and a class. We can also assign multiple classes to a single element; we just separate the classes with a space.

Take a look at the [Web page](https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/ids_and_classes.html) that corresponds to the HTML above.


In [0]:
# Get the website that contains classes and IDs.
url = 'https://nbviewer.jupyter.org/github/ivanovitchm/\
datascience_one_2019_1/blob/master/\
Lesson%2319/html/ids_and_classes.html'

response = requests.get(url)
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Select all of the elements that have the first-item class.
first_items = parser.select(".first-item")

# Print the text 
print(first_items[0].text)


                First paragraph.
            


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">



1. Select all of the elements that have the class **outer-text**.
    - Assign the text of the first paragraph that has the class **outer-text** to **first_outer_text**.
2. Select all of the elements that have the ID **second**.
    - Assign the text of the first paragraph that has the ID **second** to the variable **second_text**.



In [0]:
# put your code here
first_outer_text = parser.select(".outer-text")[0].text

# Print the text 
print(first_outer_text)



                First outer paragraph.
            



In [0]:
second_text = parser.select('#second')[0].text

print(second_text)



                First outer paragraph.
            



# 11 - Nesting CSS Selectors




We can nest CSS selectors similar to the way HTML nests tags. For example, we could use selectors to find all of the paragraphs inside the **body** tag. Nesting is a very powerful technique that enables us to use CSS to do complex Web scraping tasks.

This selector will target any paragraph inside a **div** tag:

```
div p
```

This selector will target any item inside a div tag that has the class **first-item**:

```
div .first-item
```

This one is even more specific. It selects any item that's inside a **div** tag inside a **body** tag, but only if it also has the ID **first**:

```
body div #first
```

This selector zeroes in on any items with the ID **first** that are inside any items with the class **first-item**:

```
.first-item #first
```

As you can see, we can nest CSS selectors in infinite ways. This allows us to extract data from websites with complex layouts. You can test selectors by using the **.select** method as you write them. Because it's easy to write a selector that doesn't work the way you expect, we highly recommend doing this.

# 12 - Using Nested CSS Selectors




Now that we know about nested CSS selectors, let's try them out. We can use them with the same **.select** method we used for our CSS selectors.

We'll be practicing on this HTML:

```html
<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title>2014 Superbowl Team Stats</title>
    </head>
    <body>
        <table class="stats_table nav_table" id="team_stats">
            <tbody>
                <tr id="teams">
                    <th></th>
                    <th>SEA</th>
                    <th>NWE</th>
                </tr>
                <tr id="first-downs">
                    <td>First downs</td>
                    <td>20</td>
                    <td>25</td>
                </tr>
                <tr id="total-yards">
                    <td>Total yards</td>
                    <td>396</td>
                    <td>377</td>
                </tr>
                <tr id="turnovers">
                    <td>Turnovers</td>
                    <td>1</td>
                    <td>2</td>
                </tr>
                <tr id="penalties">
                    <td>Penalties-yards</td>
                    <td>7-70</td>
                    <td>5-36</td>
                </tr>
                <tr id="total-plays">
                    <td>Total Plays</td>
                    <td>53</td>
                    <td>72</td>
                </tr>
                <tr id="time-of-possession">
                    <td>Time of Possession</td>
                    <td>26:14</td>
                    <td>33:46</td>
                </tr>
            </tbody>
        </table>
    </body>
</html>
```

It's an excerpt from a box score of the [2014 Super Bowl](https://en.wikipedia.org/wiki/Super_Bowl_XLVIII), a [National Football League (NFL)](https://en.wikipedia.org/wiki/National_Football_League) game in which the New England Patriots played the Seattle Seahawks. The box score contains information on how many yards each team gained, how many turnovers each team had, and so on. Check out the [Web page](https://nbviewer.jupyter.org/github/ivanovitchm/datascience_one_2019_1/blob/master/Lesson%2319/html/2014_super_bowl.html) this HTML renders.

The page renders as a table with column and row names. The first column is for the Seattle Seahawks, and the second column is for the New England Patriots. Each row represents a different statistic.


In [0]:
# Get the Superbowl box score data.
url = 'https://nbviewer.jupyter.org/github/ivanovitchm/\
datascience_one_2019_1/blob/master/\
Lesson%2319/html/2014_super_bowl.html'

response = requests.get(url)
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Find the number of turnovers the Seahawks committed.
turnovers = parser.select("#turnovers")[0]
seahawks_turnovers = turnovers.select("td")[1]
seahawks_turnovers_count = seahawks_turnovers.text
print(seahawks_turnovers_count)

1


**Exercise**

<img width="100" src="https://drive.google.com/uc?export=view&id=1E8tR7B9YYUXsU_rddJAyq0FrM0MSelxZ">




1. Find the Total Plays for the New England Patriots, and assign the result to **patriots_total_plays_count**.
2. Find the Total Yards for the Seahawks, and assign the result to **seahawks_total_yards_count**.


In [0]:
# put your code here
total_plays = parser.select("#total-plays")[0]
patriots_total_plays_count = total_plays.select("td")[2].text

patriots_total_plays_count

'72'

In [0]:
total_yards = parser.select("#total-yards")[0]
seahawks_total_yards_count = total_yards.select("td")[1].text

seahawks_total_yards_count

'396'