Welcome to week 7 of the [Noisebridge Python class](https://github.com/audiodude/PythonClass)!

In this lesson, we will discuss web scraping. While you may not have an immediate use for the techniques outlined here, it is worthwhile to learn in order to get a better understanding of the structure of the web and the possibilities for extracting data from websites.

We will be using [requests](https://requests.readthedocs.io/en/latest/) and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/). There are lots of other solutions for web scraping out there, including those written in other languages and hosted in cloud services.

In this lesson, you will learn:

* How to get the raw source code of websites using their URL.
* How to parse that source code into a data structure.
* How to identify and refer to specific locations inside of that data structure that contain useful information.
* How to save the data in a Pandas DataFrame for easy querying, manipulation and export to CSV

The Mozilla Developer Network has [an article](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works) about the basic way the web works. Basically, when you type a URL into your browser and hit 'enter', the following happens:

1. The browser looks up the DNS address (a number) that corresponds to the domain in the URL.
1. The browser sends an HTTP GET request to the server (there are other HTTP "verbs" other than GET as well).
1. The server locates the [HTML](https://en.wikipedia.org/wiki/HTML) file you requested (in the case of a `example.com/foo/bar.html` type URL, where bar.html is the file) or communicates with an application to generate an HTML response, and sends it back to the browser over the [TCP/IP](https://www.cloudflare.com/learning/ddos/glossary/tcp-ip/) connection.
1. The browser receives the response, parses it, and renders a webpage based on the received HTML along with [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS) and [Javascript](https://developer.mozilla.org/en-US/docs/Web/JavaScript).

Let's look at a simple website from NPR, https://text.npr.org/.

![Screenshot of NPR basic "text" news site](https://pixelfed.social/storage/m/_v2/588554065884192073/bf0f52ff2-92677b/e5wi2rcyLEpI/RftouyiJETFbf1rKHXWTDMArbpHfMVOtZTjfMLfw.png)

We can use the program `curl` on macOS or Linux to retrieve the raw HTML contents of a web page. For example, when we run:

`$ curl https://text.npr.org/`

We get:

![Screenshot of the result of running curl on text.npr.org](https://pixelfed.social/storage/m/_v2/588554065884192073/bf0f52ff2-92677b/QcxYbe5gcGJE/pFJs6XokUP7rgLRbRz76ak3MchCMmVZqZoIILKTA.png)

and if we scroll further down we see:

![Screenshot of the result of running curl on text.npr.org](https://pixelfed.social/storage/m/_v2/588554065884192073/bf0f52ff2-92677b/2YqKQi3pOe4z/R2N4ZlWO8ht0hmTS1mwiBXrx1a2GZPt0Xl9zFJKp.png)

We see that the first news item in the web page, "How climate change could cause a home insurance meltdown", corresponds to HTML code in the curl response:

```
        <li><a class="topic-title" href="/1186540332">How climate change could cause a home insurance meltdown</a></li>
```

This is the "raw" code that the browser reads to create the web page seen in the first screenshot. The `<li>` tag means that the article is part of a list, and the `<a>` tag means that the article name should be rendered as as link, which leads to the article.

The important thing to understand is that `curl` is retrieving the source code of the webpage in the exact same way your web browser is, except that the browser takes the extra step of turning the textual HTML data into a graphical web page.

We can also get the source code of the NPR text news website in Python using the `requests` library, which is for HTTP requests. The result is the same code that we saw in the `curl` example.

In [None]:
import requests

base_url = 'https://text.npr.org/'

resp = requests.get(base_url)
resp.raise_for_status()

code = resp.text

print(code[:200])

---

Let's say we wanted to get a list of all news story headlines from this page. At this point, we could attempt to extract that data from the textual representation of the HTML (the raw code) using regex or simple string find operations.

In [None]:
li_idx = code.find('<li>')
li_end_idx = code.find('</li>', li_idx)
print(code[li_idx:li_end_idx+5])

next_li_idx = code.find('<li>', li_end_idx)
next_li_end_idx = code.find('</li>', next_li_idx)
print(code[next_li_idx:next_li_end_idx+5])

However this approach is tricky and error prone. In this example, it's particularly simple, because each of the `<li>` elements contains exactly one `<a>` element. However in other cases, there could be nested structures that only occur in some of the desired target elements. For this reason, we will **parse** the HTML into a **document tree** and use the tree data structure to access the data we're interested in.

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(code, 'lxml')

soup.find_all('a')

This is much easier, and seems very promising. We seem to have gotten all of the article links! But there's a problem, we're also getting links to the "Terms of Use" and "Go to Full Site".

Now we enter the part of web scraping that is more of an art than a science, and that is finding unique identifiers for data we care about, and navigating the document tree.

*Can you see a pattern with the links we care about (the news articles) that we could potentially use to extract just those?*

In [None]:
# We use class_ with an underscore because `class` is a keyword in Python.
articles = soup.find_all('a', class_='topic-title')
print(articles)

At this point we need to extract the actual article titles and dispense with the markup. We can use the `.text` attribute of the returned `Tag` objects. This attribute actually abstracts some of the details of dealing with what are called "Text Nodes" in HTML documents, where a single logical chunk of text might be split across multiple nodes or contain embedded HTML tags. For example, this HTML:

```
<a>Hello, I <em>love</em> learning Python!</a>
```

Contains mutliple text nodes under the `<a>` tag. However, using `.text` we can get all of the text under the tag, as we would expect.

In [None]:
s2 = BeautifulSoup('<a>Hello, I <em>love</em> learning Python!</a>', 'lxml')
s2.a.text

In [None]:
[a.text for a in articles]

And that's it, that's web scraping. Lesson over, go home! (kidding not kidding)

Imagine that twitter bots were still a thing, or better yet you wanted to post to Mastodon (as we did in lesson 1) with a post every time a new article was published to the NPR text website. You could set up a scraper that runs every 5 minutes and if it finds an article title it hasn't seen before, it creates a post with a link to the article. But how do we get the link to the article? It's in the `href` **attribute** of the `<a>` tag:

In [None]:
articles[0].get('href')

This is a **relative url**. It's not a full website address. It leads to a document that is at a path relative to the `base_url` of the website. Let's look again at `base_url`:

In [None]:
base_url

By using a function called `urljoin` in the `urllib.parse` module, we can reconstruct the full URL which we will need to make an additional request using `requests`.

In [None]:
import urllib.parse

first_url = urllib.parse.urljoin(base_url, articles[0].get('href'))
first_url

We could then do:

```
post_to_mastodon('NPR: %s (%s)' % (articles[0].text, first_url))
```

---

## Additional considerations

### User-Agent

Some websites are sensitive to being scraped. The easiest thing those sites can do to disallow scraping it to check the **User-Agent** of the program that is being used to request the data. All browsers send a User-Agent **header** as part of their requests to websites. This is a string which identifies the browser version and operating system that is making the request. Currently, on Windows 10 with Chrome the value of my User-Agent header is:

```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36
```

By default, requests will send a User-Agent of `Python-requests/x.y.z` where `x.y.z` is the version number of the library. A website might see this and decide it's not a real web browser and deny the request.

Since the User-Agent is data provided by the client (us), we can set it to anything we want. Setting the User-Agent to look like the request is coming from a web browser is called **User-Agent spoofing**.

In [None]:
requests.get(
    'https://text.npr.org/',
    headers={
        'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
        '(KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'),
    }
)

If the website doesn't seem sensitive to User-Agent headers, another thing that you could do is send a User-Agent that identifies yourself or your bot. This is particularly courteous because it allows the operator of the website to identify you in the case that your bot is causing problems. Wikipedia, for example, has a [User-Agent policy](https://meta.wikimedia.org/wiki/User-Agent_policy) that requires bot operators and those running automated scripts to identify themselves in the User-Agent string.

### Rate limiting

The most common mistake beginners make when performing web scraping is to "hammer" the target website. You might have a loop that requests every news article from the front page:

In [None]:
def absolute_url(relative_url):
    return urllib.parse.urljoin(base_url, relative_url)

all_absolute_urls = [absolute_url(a.get('href')) for a in articles]
for url in all_absolute_urls:
    # This will make the requests as fast as possible, potentially multiple per second
    resp = requests.get(url)
    resp.text

As the comment says, if you run this code your computer will make network requests to the target website as fast as possible. Many websites will see this flood of requests and either recognize it as a scraping procedure, or even consider it a Denial of Service attack. This could lead to your IP address being blocked. The solution is to intentionally delay or **sleep** your code between requests. Although this might seem annoying it is both courteous to the website you're scraping and will prevent you from getting blocked.

In [None]:
import time

def absolute_url(relative_url):
    return urllib.parse.urljoin(base_url, relative_url)

all_absolute_urls = [absolute_url(a.get('href')) for a in articles]
for url in all_absolute_urls:
    # This will make the requests as fast as possible, potentially multiple per second
    time.sleep(2)
    resp = requests.get(url)
    resp.text

### IP Address

You've probably heard of IP Addresses. They're like your phone number on the Internet. Although most home internet providers use rotating IP addresses, it is likely that your home IP address will stay the same for a relatively long period (measured in weeks).

If you get banned from scraping a website, they will most likely do so by using your IP address, so that when they see a request from your IP, they will drop it or return an error code. This can look like requests that hang indefinitely.

There are web scraping services that will allow you to connect to a **proxy**, which will cause your IP address to be different every time you make a request to the website you are scraping. This will let you avoid IP bans.

### Pagination

Sometimes, the data you want will be spread across multiple "pages" of an app. We have a couple of options in this case. We could inspect the HTML of the page for the link to "next page" and use the URL of that link to visit the next page of results. We could also programatically construct the URLS by figuring out the pagination scheme. For example if the urls are like:

```
https://example.com/forum/posts?page=1
```

We could increment the page value to 2, 3, etc. Then we could request each of these URLs in order to scrape the data.