The last chapter of the book covers the concept of web scraping. This is the programmatic process of obtaining information from a web page. To do this we need to get up to speed on a number of things:

- HTML
- Obtaining a webpage
- Getting information from the webpage

To do this we will create our own website using Python that we will scrape with our own code.

### An Introduction to HTML
- HTML stands for **Hyper Text Markup Language** and is the standard **markup language** for creating web pages. 
- It is essentially the language that makes up what you see on the internet.
- An HTML file tells a web browser how to display the text, images, and other content on a webpage. 
- The purpose of HTML is to describe **how the content is structured** and **not how it will be styled**, and rendered within a web browser. 
- To render the page you need to use a cascading style sheet (CSS) and an HTML page can link to a CSS file to get information on colours, fonts, and other information relating to the rendering of the page.
- HTML is a markup language, so in creating HTML content, you are embedding the text to be displayed alongside how the text should be displayed. 
- The way this is done is by using **HTML tags** which can contain **name-value pairs** which are known as **attributes**. 
- Information within a tag is known as an HTML element. 
- Well-formed HTML should have an open and a close tags, and before you start a new tag you should close off your old tag.
- Now that we have described what HTML is, we will give some examples of how you create elements within it and show how to put together a page. 
- Let’s start by looking at some tags. 
- It is important to remember that when we open a tag, we close it with a / (forward slash).
- Let’s demonstrate with a header tag.

#### Header

<h>This is how we define a header</h>

<h1>

and to close it we have

</h1>

#### Paragraph

<p>This is how we define a paragraph</p>

#### Define

<a href="https://www.google.com">This is how we define a link</a>

#### Table

<table>
    <tr>
        <th>Name</th>
        <th>Year</th>
        <th>Month</th>
    </tr>
    <tr>
        <td>Avengers: Infinity War</td>
        <td>2018</td>
        <td>March</td>
    </tr>
    <tr>
        <td>Ant Man and the Wasp</td>
        <td>2018</td>
        <td>August</td>
    </tr>
</table>

#### Thead and Tbody

- There are two other tags that we can add to this table and that is the **thead** and **tbody** tag.
- Within a table these can separate out the head and body of the table. 
- They are used as follows:

<table>
<thead>
    <th>Name</th>
    <th>Year</th>
    <th>Month</th>
</thead>
<tbody>
    <tr>
        <td>Avengers: Infinity War</td>
        <td>2018</td>
        <td>March</td>
    </tr>
    <tr>
        <td>Ant Man and the Wasp</td>
        <td>2018</td>
        <td>August</td>
    </tr>
</tbody>
</table>

#### Div

- The last tag we will introduce is a div tag. 
- This is a tag that defines a **section in the html**.
- So, linking back to the previous table example we can put a div tag around it. 
- By putting html within a div, we can apply the format to the whole section covered by it.

<div>
<table>
<thead>
    <th>Name</th>
    <th>Year</th>
    <th>Month</th>
</thead>
<tbody>
    <tr>
        <td>Avengers: Infinity War</td>
        <td>2018</td>
        <td>March</td>
    </tr>
    <tr>
        <td>Ant Man and the Wasp</td>
        <td>2018</td>
        <td>August</td>
    </tr>
</tbody>
</table>
</div>

#### HTML Attributes

<p title="It will show when you hover over the text">This is how we define a paragraph</p>

#### Id and Classes

- Having introduced attributes, we will now look at two important ones which can help us **locate elements within html**, these are the **id** and **class** **attributes**. 
- An id element is unique to that html element whereas a class can be used in multiple elements. 
- Let’s demonstrate this by looking at three headers with associated information.

<!-- A unique element -->
<h1 id='myHeader'>MCU Films</h1>

<!-- Multiple similar elements -->
<h2 class='film'>Avengers: Infinity War</h2>
<p>Can Earth's mightiest heroes protect us from the threat of Thanos?</p>

<h2 class='film'>Ant Man and the Wasp</h2>
<p>Following the events of Civil War will Scott Lang don the Ant Man suit again?</p>

<h2 class='film'>Captain Marvel</h2>
<p title="I don't know">Plot Unknown.</p>

- As per below, we have one header with an id attribute. 
- This unique attribute specifically identifies that header. 
- The remaining headers all have the same class film which has been applied to each one. 
- This is only intended as a brief introduction to html, so while this will help us in the remainder of the chapter, it is not a comprehensive exploration of all things html.
- So, if you are interested, there are lots of resources online which cover html.

### Web Scraping

- Having introduced html and how it works, we now move onto how we obtain the data using Python. 
- When we think of web scraping, we generally think of the process of getting and processing data from a website. 
- Actually this can be broken down into two distinct processes: **web crawling** and **web scraping**.

**Web crawling** is the process of **getting data from one or more urls** which can be obtained by **traversing through a websites** html. For example, say a website has a front page with lots of links to other pages and you want to get all the information from all the links, you would traverse through all the links programmatically and then visit all the relevant pages and store them.

**Web scraping** is the process of **getting the information** from the page in question, so in the previous part you would have to scrape the page to get the links that you want to traverse. When you scrape the page, you would programmatically get the information from the page and when you have that you would be able to store or process the data.

- Given scraping plays an important part in the whole process, the combination of crawling and scraping will be referred to as web scraping. 
- There is no code of conduct that you need to sign up in order to conduct web scraping activities. 
- However, as soon as you try to get data from a website there are some things to note that are important.

**Check if you are allowed to get and use data from the respective website.** 

While you may think that any data on the website is fair game, it is not always the case. Check the website’s terms of use as while they may not be able to stop you obtaining the data via web scraping, they may have something to say that if they see the data used in any research, so be careful. The issues around legality of getting data from the web are really important and if you are in any doubt, please get legal advice. The examples used in this book will involve creating our own web page locally and getting the data from it.

**Check if there is a fair usage policy.** 

Some websites are happy for you to scrape the data as long as you do it in an appropriate way. What do we mean by that? Well, every time you make a call to the website, you are providing traffic to that site. By doing this via a computer, you can send a lot of traffic to a site very quickly and this can be problematic to the site. If you are seen to do this, your IP address can be blocked from the site which would mean you wouldn’t be able to access the site in question. So, what you need to consider is how often you plan to run code to hit certain websites and what is appropriate and necessary for you and whether the site will allow you to do. For code that does a call to a single url, it is just about how often you run it. However, if you wrote a code that crawled across lots of urls and brought back the data from them, then you would need to ensure that your code is running at an appropriate speed. To do this you would need to consider adding time delays to what you do to ensure that you are not sending excessive traffic to the site.

**Robots.txt** 

Again, linked to the above points, if you go to the **websites url/robots.txt**, you will get information on what web crawlers can and can’t do. If present, then it’s expected that you read and understand what can and can’t be scrapped. If there is no robots.txt, then there is no specific instruction on how the site can be crawled. However, don’t assume you can scrape everything on the site.

- Ultimately, you need to take care when scraping a website and if they have an application programming interface (API) available, then you should be using that. 
- If you are not sure, please get the appropriate advice. 
- Before you can start crawling the site and scraping the data you need to understand the page. 
- Python cannot just get you the data you want, instead you need to tell it how to find the data you are after. 
- So you need to understand how the html works and where to look.

To inspect the page you have a couple of options:

**Through a Web Browser**

- You can use the tools of the web browser to inspect what HTML refers to what elements of the page. 
- The manner in which you can inspect is specific to the browser itself. 
- Ultimately, it involves you selecting the element of the page and inspecting the corresponding HTML and then it showing what the html refers to that element. 
- Different browsers have different ways of inspecting the pages they show, so refer to the documentation around the specific browser that you are using.

**Saving the Page and Physically Searching**

- You can physically save the page and then search for the name or value of certain text and in the same way determine what html refers to that element. 
- Ultimately, what you are trying to do is learn what html refers to what values in the website. 
- This isn’t an exact science due to the fact html can be written in different ways. 
- The key is understanding the definitions of html and how they fit together and then use this to understand what you need to access in the html. 
- With any piece of code you need to plan ahead and with parsing html you need to develop a plan for how you want to get the data from the html.

- So, going back to what we mentioned at the start, we described web scraping and web crawling. 
- Web crawling is the process of accessing the data from a url or multiple urls. 
- To do this, we need a mechanism, luckily making a web request can be in the same way we accessed an API endpoint in the Chapter 18, so we will use the requests library. 
- This will be demonstrated later in the chapter where we setup our own webpage.
- Having a mechanism to get the data is great but we also need to process what we get back so we need a Python library that can do this. 
- Python has many options and this book isn’t intended to be a review of the best packages for processing html as the landscape is constantly changing. 
- Instead we will cover one specific parser namely BeautifulSoup.

### BeautifulSoup

- BeautifulSoup is not only an **html parser** but can also **parse xml by using the lxml library** which we covered earlier in the book. 
- The way that BeautifulSoup works, is that it takes advantage of other parsers. 
- So, to run BeautifulSoup you would:

In [None]:
from bs4 import BeautifulSoup

In [None]:
# BeautifulSoup(content_to_be_parsed, "parser_name")

Here, **content_to_be_parsed** is the content from the site which could have been **obtained using requests** as shown before and the ‘parser name’ is the name of the parser to use. The four examples of these parsers are:

- **html.parser**: This is the default Python html parser. It has decent speed and is lenient in how it accepts html as of 3.2.2.
- **lxml**: This is built upon the lxml Python library which is built upon the C version. It’s very fast and very lenient.
- **lxml-xml or xml**: Again built upon lxml so similar to above. However, this is the only xml parser you can use with BeautifulSoup. So, while we introduced how to parse XML with lxml, you could also do the same in BeautifulSoup.
- **htmllib5**: This is built upon the Python html5lib and is very slow, however, it’s very lenient and it parses the page in the same way a web browser does to create valid html5.

Now, for the rest of this section, we will concentrate on using the **html.parser**. So to create soup we can do as follows:

In [None]:
import requests

In [None]:
from bs4 import BeautifulSoup

In [None]:
# Run Film.py

url = "http://127.0.0.1:5000/films"

In [None]:
r = requests.get(url)

In [None]:
response_text = r.text

In [None]:
response_text

In [None]:
soup = BeautifulSoup(response_text, "html.parser")

In [None]:
soup

- Now, this have transformed the html into a format where we can access elements within it. 
- What we will show now are the methods available to us.

In [None]:
# For example:

text = "<b class='boldest'>This is bold</b>"
soup = BeautifulSoup(text, "html.parser")
soup

Now we can access this tag as follows:

In [None]:
tag = soup.b
tag

Now, if we had **multiple b tags** using soup.b would only **return the first one**.

In [None]:
text = """<b class='boldest'>This is bold</b><b class='boldest'>This is also bold</b><h4 class='film'>This is header</h4>"""
soup = BeautifulSoup(text, "html.parser")
soup

In [None]:
tag = soup.b
tag

In [None]:
tag_h4 = soup.h4
tag_h4

So, we won’t get all the b tags, back only the first one. The tag itself has a name which can be accessed as follows:

In [None]:
tag.name

In [None]:
tag_h4.name

The tag also has a dictionary of attributes which are accessed as follows:

In [None]:
tag.attrs

In [None]:
tag['class']

In [None]:
tag_h4.attrs

In [None]:
tag_h4['class']

Now, let’s have a look in a more complicated example. If we look at something like a table we can parse it as follows:

In [None]:
text = """
<table>
    <tr>
        <th>Name</th>
        <th>Year</th>
        <th>Month</th>
    </tr>
    <tr>
        <td>Avengers: Infinity War</td>
        <td>2018</td>
        <td>March</td>
    </tr>
    <tr>
        <td>Ant Man and the Wasp</td>
        <td>2018</td>
        <td>AUgust</td>
    </tr>
</table>
"""

In [None]:
text

In [None]:
soup = BeautifulSoup(text, "html.parser")

In [None]:
soup

Now, we can access elements of the table by just traversing down the tree structure of the html.

In [None]:
soup.table

In [None]:
soup.table.tr

In [None]:
soup.table.tr.th

- Now, in each example we get the **first instance of the tag** that we are looking for. 
- Assume we want to **find all tr tags** within the table, we can do so as follows using the **find_all** method.

In [None]:
soup.find_all('tr')

- So, here we get back a list of all the tr tags. 
- Similarly, we can get back the list of td tags using the same method.

In [None]:
td_tags = soup.find_all('td')

In [None]:
td_tags

So, with regards to get data out, if we look back and consider our original table, we can use the find all method to get all the tr tags and then loop over these and get the td tags.

In [None]:
table_rows = soup.find_all('tr')

In [None]:
table_rows

In [None]:
header = []
content = []

In [None]:
for tr in table_rows:
    header_tags = tr.find_all('th')
    if len(header_tags) > 0:
        for ht in header_tags:
            header.append(ht.text)
    else:
        row = []
        row_tags = tr.find_all('td')
        for rt in row_tags:
            row.append(rt.text)
        content.append(row)

In [None]:
header

In [None]:
content

- What we are doing here is looping over the high level tr tags to get every row and then looking for the th tags and if we find them, we know its the table header and if not we get the td tags and associate both with the appropriate list namely headers or content. 
- The important thing to note here is that we know the structure of the data as we will have inspected the html so we build the parsing solution knowing what we will get.

- So, we have introduced the find_all method in a single table. 
- But if we had two tables and the table we wanted had a specific id we could use the find method as follows:

In [None]:
text = """
<table id="unique_table">
<tr>
<th>Name</th>
<th>Year</th>
<th>Month</th>
</tr>
<tr>
<td>Avengers: Infinity War</td>
<td>2018</td>
<td>March</td>
</tr>
<tr>
<td>Ant Man and the Wasp</td>
<td>2018</td>
<td>August</td>
</tr>
</table>
<table id="second_table">
<tr>
<th>Name</th>
<th>Year</th>
<th>Month</th>
</tr>
<tr>
<td>Avengers: End Game</td>
<td>2019</td>
<td>April</td>
</tr>
<tr>
<td>Spider-man: Far from home</td>
<td>2019</td>
<td>June</td>
</tr>
</table>
<table id="other_table">
<tr>
<th>Name</th>"""

In [None]:
text

In [None]:
soup = BeautifulSoup(text, "html.parser")

In [None]:
soup

And we can get the data from this table in a similar way as before but again we can take advantage of the find method to find the text in the specific element.

In [None]:
table_2 = soup.find('table', id='second_table')

In [None]:
table_2

In [None]:
table_rows = table_2.find_all('tr')

In [None]:
table_rows

In [None]:
header = []
content = []

In [None]:
for tr in table_rows:
    header_tags = tr.find_all('th')
    if len(header_tags) > 0:
        for ht in header_tags:
            header.append(ht.text)
    else:
        row = []
        row_tags = tr.find_all('td')
        for rt in row_tags:
            row.append(rt.text)
        content.append(row)

In [None]:
header

In [None]:
content

- This shows that, when we have multiple tables, we can obtain the information from a specific one, this is really dependent on the table having an id attribute which made the process much easier.
- So, we now know how to process html, and the next stage is to grab data from a website and then parse that data using Python. - To do this we will build our own website locally which we will grab data from and parse the results. 
- Given we have covered the libraries to get and process the data how do we go about creating a website?
- As in the Chapter 18, we will use the package Flask to create a simple website that we can run locally and then scrape the data from. 
- Let’s just get started and write a hello world example to show how it will work. 
- Here, we will create a file called **my_flask_website.py** and put the following code in it:

- Now, if you think back to the Chapter 18, what we have here is a reduced down version of what we used to create our API. 
- We import Flask from the flask package and then create ourselves an app. 
- Unlike with the API where we created a class, we simply define a **hello_world function** which returns the string Hello World.

Again we use the syntax below to run our application.

- As with the API we built, if we open a terminal or command prompt and move to the location of the file and run the code using **python my_flask_website.py**, then we will get a webpage running on http://127.0.0.1:5000/.

- Now, one part of the code that we didn’t cover was the use of @app.route, this is an example of a decorator. The purpose is to bind a location to a function.
- So, when we apply the following:

- What we are doing is mapping any call of http://127.0.0.1:5000/ to the **function hello_world**, so when that url is called the hello_world function is executed and the results displayed. 
- This is a specific use of a decorator where in general, decorators are functions that can take functions as an argument. 
- The best way to explain is by demonstration so we could decorate the hello_world function with a decorator that make a string all lowercase.

- What this function does is take in another function as an argument and then run the wrapper function in the return statement.
- The function wrapper then runs the function that is passed in to make_uppercase function and take the output from it and make it lowercase and return that value.
-  Let’s demonstrate with an example.

- We have our website up and running, we can see the Hello World is displayed in lowercase.
- So, let’s programmatically get the data from it. 
- To do this, we can use requests in the same way we did in the API chapter to obtain a get request.

In [None]:
import requests

In [None]:
r = requests.get('http://127.0.0.1:5000')

In [None]:
r.text

- Notice that this time we looked at the text attribute as opposed to the json method and that is because the content of our website is not json. 
- This is all well and good but its not much of a challenge to process the data mainly because its not in html format.
- We can change that pretty easily by just modifying the code in our flask application so let’s change the output hello world as follows:

- It looks pretty similar to what we saw before, so what has changed? 
- If we run the code to get the data from the webpage, we get the following:

In [None]:
import requests

In [None]:
r = requests.get('http://127.0.0.1:5000')

In [None]:
r.text

- You can now see that instead of just the text representation, we have some html around that with the h tags.
- Let’s modify the code once more and change the h tags to h1 tags.

Again running the same requests code on this website bring back the h1 tags.

In [None]:
import requests

In [None]:
r = requests.get('http://127.0.0.1:5000')

In [None]:
r.text

- So, now we have our website running, lets add something a little harder to parse and create a table that we can look to programmatically obtain. 
- To do this, we will add a new route to the flask application and look to add a html table. 
- To do this, we will make use of some existing data from the seaborn package namely the tips data.

In [None]:
import seaborn as sns

In [None]:
tips = sns.load_dataset('tips')

In [None]:
tips.head().to_html()

- Now, we have imported the head of tips dataset and we can make use of the to_html method from pandas, which takes the DataFrame and give us back html that we could put on our website. 
- If we look back to our previous table example, we might want to add an id to the table to allow us to access the table and we can do that using to_html by passing in the table_id argument and setting it to the name that we want our table to have. 
- So, let’s apply it by setting the name to be tips.

In [None]:
tips.head().to_html(table_id='tips')

- So, we can now see we have added the id attribute with the name tips. 
- Our next step is to add this to our website and we can alter the code as follows to do so:

- Now, the difference here is that we have added the imports for seaborn and then imported the tips dataset. 
- To display this, we then create another function called **table_view** and in it return 20 rows of the DataFrame and convert it to html with the **id of tips**. 
- A **decorator** then defines the route of this to be **/table** which means when we go to the http://127.0.0.1:5000/table, we will see the result of this function. 
- Let’s do that and go to the url.
- Now, we can see the table but it doesn’t look great, we can customise this using some of the options that come with pandas.
- First, we will remove the index from the table as you normally wouldn’t see this on a website. 
- Next, we will centre the table headings and we will also make the borders more prominent. 
- So, our flask application is now modified to this.

- If we use some of what we covered earlier, we can add a title and some information about the website in a paragraph. 
- To do that, we can use the h1 and p tags to create a header and paragraph, respectively, and to show that everything belongs together let’s put this all within a div tag, so it resembles what you might find on a production web page. 
- The flask application now looks like the following:

Ok, so now we have a website. We want to scrape it, so let’s use requests to get the html that we will look to obtain.

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
r = requests.get('http://127.0.0.1:5000/table')

In [None]:
r.text

- So, we can see that it was relatively straight forward to get the data but unlike with our static table example before, the data from the webpage is more than just table data. 
- The next step is to pass this into BeautifulSoup to parse the html.

In [None]:
soup = BeautifulSoup(r.text, 'html.parser')

In [None]:
soup

In [None]:
dir(soup)

In [None]:
soup.find('table', id='tips')

- By using the table id, we can go directly to the table within the html and we then have access to all the rows within it just like before. 
- Note that we have only shown a subset of this data as we have 20 rows. 
- Now, if we want to parse the data from the html we can use something like we used on the dummy data.

In [None]:
table = soup.find('table', id='tips')

In [None]:
table_rows = table.find_all('tr')

In [None]:
table_rows

In [None]:
table_rows[0:3]

In [None]:
header = []
content = []

In [None]:
for tr in table_rows:
    header_tags = tr.find_all('th')
    if len(header_tags) > 0:
        for ht in header_tags:
            header.append(ht.text)
    else:
        row = []
        row_tags = tr.find_all()
        for rt in row_tags:
            row.append(rt.text)
        content.append(row)

In [None]:
header

In [None]:
content

As we can see, we have now pulled the data from the html and got it into two separate lists, bit to go a step further we can put it back into a DataFrame pretty simply by using what we have covered earlier in the book.

In [None]:
import pandas as pd

In [None]:
data = pd.DataFrame(content)

In [None]:
data

In [None]:
data.columns = header

In [None]:
data

- Now, we have gone full circle and used a DataFrame to populate a table within our website and then scraped that data and converted it back into a DataFrame.
- This chapter has covered a lot of content from introducing html to parsing it out to building our own website and scraping from there. 
- The examples have been focussed on table data but they can be applied to any data we find within html. 
- When it comes to web scraping, Python is a powerful and popular choice to interact and obtain data from the web.