# Web Scraping Walk-Through

Often the data we want isn't available in a downloadable format or an web API. But if we can see the information we want in the content of web pages, we can usually scrape it.

In this walk-through, we will:

1. Discuss fundamental web technologies relevant to web scraping
2. Introduce third-party Python packages that support the web scraping workflow
3. Write some code that scrapes data from the Missouri State Highway Patrol's [website](https://www.mshp.dps.missouri.gov/HP68/SearchAction)

## Don't do this unless you really need to

Scraping a website should be your last resort. Unless under extraordinary circumstances, you should strive to work within the usual guidelines of public information requests.

Talk to the people who run the website you want to scrape, and ask them to provide you with their data, which is most likely in a relational database system (e.g., SQL Server, MySQL, Oracle). Be patient, kind and respectful.

## How the web works

When your web browser loads a website and you point and click your way around the pages of that site, there are several technologies working behind the scenes that make all this possible. So in order to do web scraping, it's helpful to have a good idea of [how the web works](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works). 

In particular, when we scrape a website, we interact directly with (at the very least) two fundamental web technologies.

### HTTP

[HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP) stands for **H**yper**t**ext **T**ransfer **P**rotocol, and it is the foundation of any data exchange on Web. It sets the rules for how clients like your preferred web browser (e.g., Firefox, Safari, Chrome) communicate with web servers that host web pages and other resources you might fetch.

The HTTP flow starts when a client sends a request to server. All requests have the following parts:

1. A [method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) indicating the user's desired action.
2. A path to a resource, indicated by Uniform Resource Locator (aka, [URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL)).
3. Optional headers that convey additional information from clients to the server.

### HTML

The second fundamental web technology we interact with in web scraping is **H**yper**t**ext **M**arkup **L**anguage (aka, [HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)). It's isn't a programming language like Python or Javascript. Rather, it is a language for defining the structure of web pages. Every web page is a document written in HTML.

In an HTML document, content is annotated (hence, the "Markup" part) using tags like [`<h1>` through `<h6>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/Heading_Elements) for headings and [`<p>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/p) for paragraphs.

Tags, plus their optional attributes and the content that they wrap, form [elements](https://developer.mozilla.org/en-US/docs/Glossary/element), which are the atomic units of an HTML document.

![](https://media.prod.mdn.mozit.cloud/attachments/2014/04/09/7659/a731e40efad1f6e0b728bfcf86c0035b/anatomy-of-an-html-element.png)

## How a web scraper works

With these two technologies in mind, we have an essential grasp of the work a web scraper needs to do:

1. Fetch HTML documents using HTTP
2. Parse those resources into a machine readable structure
3. Extract the exact content we want from that structure.
4. Store that extracted content in a format that affords further reporting, analysis, etc.

Often in the news room, reporters will perform this exact series of tasks using a web browser, a spreadsheet application and good ol' copy and paste. That tedious, repeative work *is* web scraping, though some might not use such a buzzy label to describe such a boring process.

Web scraping, as it is typically understood, is simply the automation of this general workflow, and automating this process requires different tools and different (typically non-graphical) interfaces.

## Essential Python packages

A couple of third-party Python packages are widely used in web scraping to handle the two web technologies described above.

### Requests for managing HTTP requests and responses

[Requests](https://requests.readthedocs.io/en/master/) also handles other intricacies, including sessions and cookies and URL and form encoding.

We can install Requests using pip, Python's default package manager:

```sh
pip install requests
```

Then we can import the main module:

In [None]:
import requests

#### Making a `GET` Request

The most common HTTP request method is [`GET`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/GET), which simply gets you a copy of the requested resource located at the URL.


Let's make a `GET` request for the [home page](https://www.mshp.dps.missouri.gov/HP68/SearchAction) of the State Highway Patrol website. The convention is to store the response in a variable called `r`.

In [2]:
r = requests.get("https://www.mshp.dps.missouri.gov/HP68/SearchAction")

The [`.get`](https://2.python-requests.org/en/master/api/#requests.get) function call has one required argument, which is the URL we want to get. This method returns a [`request.Response`](https://2.python-requests.org/en/master/api/#requests.Response) object that represents the server's response.

We can check if we got an "okay" response:

In [3]:
r.ok

True

We can also check the specific [response status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status):

In [4]:
r.status_code

200

We can also access the content of the response, that is, all the HTML that makes up the web page. We do this with `.content` attribute.

In [5]:
content = r.content

This content is represented in Python in [`bytes`](https://docs.python.org/3/library/stdtypes.html#bytes). If we want the HTML as a string, we can use the response's `.text` attribute.

In [6]:
content_as_text = r.text

Raw HTML is a bit overwhelming for human readers because it contains the full definition of every element and the entire hierarchy of the document.

Instead, we can write the content to a local file.

In [7]:
with open("index.html", 'w') as f:
    f.write(content_as_text)

Then open it in a web browser (or in Jupyter Lab), which will display the document without any associated CSS or Javascript files. Hence, we get an unstyled, bare-bones view of the content.

#### Making a `POST` request

After `GET`, the second most common HTTP request method is [`POST`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/POST).

A `POST` method allows a user agent to send data to a server, which typically happens via a web form, such as the search form on the Missouri State Highway Patrol's home page.

Granted, sending a `POST` method for the purposes of getting search results is a little confusing. Wouldn't a `GET` request make more sense?

Ideally, yes. However, search forms are often implemented with a `POST` method request especially if they allow users to submit really complex search queries or require users to submit sensitive data.

To figure out which method a search form requires, use your browser's web inspector. You can either check the method attribute of the `<form>` element in the HTML. Or you can watch the network tab for the request and check the method there.

The web inspector also reveals other essential information: The names of the form fields. Again, these can be found in either

- in the HTML (look at the name attribute on the form's various input fields);
- in the header of the `POST` request.

We need to know the names of the form fields and what values they will accept because we need to provide this information in our request for search results.

In the Python Requests library, this search info is specified via the `data` keyword argument when calling the [`.post()`](https://2.python-requests.org/en/master/api/#requests.post) function.

For instance, here's how to search for all the crashes that resulted in at least one fatal injury type.

In [8]:
r = requests.post("https://www.mshp.dps.missouri.gov/HP68/SearchAction", data={'searchInjury': 'FATAL'})

If the response is okay, we can write this to another local file.

In [9]:
r.ok

True

### BeautifulSoup for parsing HTML

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the most popular Python package for getting data out of HTML.

Just like Requests or any other third-party Python package, we can install BeautifulSoup via pip:

```sh
pip install beautifulsoup4
```

The parsing process begins by creating a [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup) object.

First we import this class.

In [10]:
from bs4 import BeautifulSoup

Then we create an instance of this class by passing in the HTML document's content.

In [11]:
soup = BeautifulSoup(content)

The [`BeautifulSoup` object](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup) represents the entire parsed HTML document, which we can now [navigate](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree) and [search](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree).

Technically, BeautifulSoup is an interface to lower-level Python parsing libraries. By default, it uses the the [`html.parser`](https://docs.python.org/3/library/html.parser.html) module in Python's standard library. But we have other options:

- [lxml](https://lxml.de/) tends to be faster
- [html5lib](https://html5lib.readthedocs.io/en/latest/) tends to be more lenient (i.e., tolerant of html docs with unclosed tags and other sub-standard syntax)

BeautifulSoup's docs have more details about [how to install alternate parsers](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) and the [differences](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) between them.

If we want to use a different parser, we pass in the name of that parser (as a `str`) in the second positional argument when we create the `BeautifulSoup` object.

In [12]:
soup = BeautifulSoup(r.content, 'html.parser')

We are now set up to extract data from the document.

At this point, we need to decide exactly what content in this document we want to capture. Web pages of course are designed to be read by humans. They have images, navigation menus, headings, paragraphs and other artifacts of the document's layout, most of which is probably irrelevant to our current reporting task.

We need to exercise our [news judgment](https://source.opennews.org/articles/making-good-news-judgments/) to decide exactly what information in this document is important to us. Then we identify patterns for finding that information in the document, and define rules based on those patterns. Finally, we encode that knowledge into Python's syntax.

In addition to helping you identify and describe the most relevant HTTP request/responses, your web browser's "developer tools" can help you sort through the elements of the web page. This tool is often called "Inspect element" or the "web inspector", and it allows you to see the full definition of each element, including the tag name and the element's attributes. You can also just use "view page source" or download and open the HTML in a text editor.

The order and hierarchy of the elements in the HTML document, along with the tag names and the attributes of each element, are useful for finding the exact elements that contain the content we want to extract.

#### BeautifulSoup's `.find` method

For instance, the results of all of the results of our previous search are contained within an element annotated with a [`<table>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/table) tag, which we can find via BeautifulSoup's [`.find`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) method:

In [13]:
table = soup.find('table')

The `.find` method returns either a [`Tag` object](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag) or a `NoneType` if there aren't any elements in the document with the given tag name. We can now access the tag's name:

In [14]:
type(table)

bs4.element.Tag

In [15]:
table.name

'table'

And it's attributes:

In [16]:
table.attrs

{'class': ['accidentOutput'],
 'summary': 'Listing of crash reports based on search criteria.'}

Which are provided as a `dict`, allowing us to access each value:

In [17]:
table.attrs['summary']

'Listing of crash reports based on search criteria.'

#### BeautifulSoup's `.find_all` method

BeautifulSoup's `.find` method returns the first tag that meets our specified criteria. However, HTML documents typically have many elements annotated with the same tag name (e.g., a several sub-headings annotated with `<h2>` through `<h6>` tags, several paragraphs of text annotated with `<p>` tags, and sometimes even mulitiple tables of data all annotated with `<table>` tags).

BeautifulSoup allows us to find and operate the entire set of tags that have the same criteria using the `.find_all` method.

For instance, we can find all of the elements with a `<table>` tag:

In [18]:
tables = soup.find_all('table')

Whereas `.find` returns an individual `Tag` object, `.find_all` returns a `ResultSet`:

In [19]:
type(tables)

bs4.element.ResultSet

Which is an iterable (like a list) of `Tag` objects. Because it's an iterable, we can access specific items within a `ResultSet`. Here's how we get the `'summary'` attribute of the first table:

In [20]:
tables[0].attrs['summary']

'Listing of crash reports based on search criteria.'

We can also check the length of a `ResultSet`, which is effectively counting the number of `<table>` elements in the document.

In [21]:
len(tables)

1

Both the `.find` and `.find_all` methods are available on Beautiful's `Tag` object, which allows us to search within the contents of a given element.

For instance, let's get all of the column headers of the search results table, each of which is annotated with a [`<th>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/th) (aka, table header) tag:

In [22]:
th_all = table.find_all('th')

Now we can loop over those tags, and print their text:

In [23]:
for th in th_all:
    print(th.text.strip())

Report
Name
Age
Person City/State
Personal Injury
Safety Device
Date
Time
Crash County
Crash Location
Troop


These headers will be useful down the road when we write our data to a csv file. We'll keep them around in a `list`:

In [24]:
headers = []

To which we will append a cleaned up version of each column name.

In [25]:
for th in th_all:
    header = th.text.strip().replace(" ", "_").replace("/", '_').lower()
    headers.append(header)

In [26]:
headers

['report',
 'name',
 'age',
 'person_city_state',
 'personal_injury',
 'safety_device',
 'date',
 'time',
 'crash_county',
 'crash_location',
 'troop']

Now we need to get the actual data—the rows of column values—out of this table.

Let's first find all of the table's rows, which are annotated with a [`<tr>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/tr) (aka, table row) tag:

In [27]:
tr_all = table.find_all('tr')[1:]

Notice how we're skipping the first `<tr>` because it contains the table headers we've already extracted.

The gist of the process is to grab the text within each [`<td>`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td) (aka, table data cell) element within each `<tr>` element.

To better understand how this will work, let's first simply print our expected output for the first few rows.

For each row, we'll print a visual delimiter (e.g., a string of dashes). Then for each row, we'll print the text (with whitespace striped) of each cell. We'll print the text of each element on a separate line to demonstrate that we are indeed accessing each element separately.

In [28]:
for tr in tr_all[:5]:
    print('--------------')
    for td in tr.find_all('td'):
        print(td.text.strip())

--------------
View
WORS, RICHARD W
42
FESTUS, MO
FATAL
NO
11/18/2020
4:53PM
DOUGLAS
PRIVATE PROPERTY EIGHT MILES SOUTHWEST OF WILLOW SPRINGS
G
--------------
View
LUTTRULL, JEANA L
58
LEWISTOWN, MO
FATAL
YES
11/17/2020
4:40PM
LEWIS
MO 6 AT VINE ST IN LEWISTOWN, MO
B
--------------
View
LUCKETT, JUSTIN R
43
ST. CHARLES, MO
FATAL
YES
11/15/2020
6:00PM
ST. CHARLES
EASTBOUND MISSOURI ROUTE 94 .9 OF A MILE EAST OF THE WELDON SPRING BOAT ACCESS
C
--------------
View
DAVIS, DENNIS E
61
AUXVASSE, MO
FATAL
NO
11/15/2020
2:30AM
CALLAWAY
US 54 WESTBOUND WEST OF COUNTRY ROAD 982
F
--------------
View
JUVENILE
14
HILLSBORO, MO
FATAL
NO
11/14/2020
4:50PM
JEFFERSON
SB HIGHWAY B SOUTH OF ROCKY FARM ROAD
C


Notice the first column value for every row is `View`, which seems pretty useless. Look back at the web page or markup, and you'll notice that this text is a hyperlink to another web page with additional details for a given incident report.

Knowing that we might want to scrape data from this page as well, it would be helpful to grab the URL of these hyperlinks so that we can make additional requests for this content down the road.

In HTML, a hyperlink defined by an [anchor element](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a), which is annotated with an `<a>` tag.  The opening `<a>` tag has an [`href`](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a#href) attrribute where the URL is stored.

Let's access the value of the `href` attribute for the first row in the search results.

In [29]:
tr_all[0].td.a.attrs['href']

'/HP68/AccidentDetailsAction?ACC_RPT_NUM=200576955'

## Cleaning the data

Functions are good way to encapsulate a discrete step a program needs to perform. This approach makes your code easier to read, understand and maintain.

For example, we can encapsulate the cleaning process for an entire row in a single function.

In [30]:
def clean_row(tr):
    
    tds = tr.find_all('td')
    
    row = {
        'report': tds[0].find('a').attrs['href'].strip(),
        'name': tds[1].text.strip(),
        'age': tds[2].text.strip(),
        'person_city_state': tds[3].text.strip(),
        'personal_injury': tds[4].text.strip(),
        'safety_device': tds[5].text.strip(),
        'date': tds[6].text.strip(),
        'time': tds[7].text.strip(),
        'crash_county': tds[8].text.strip(),
        'crash_location': tds[9].text.strip(),
        'troop': tds[10].text.strip()
    }
    
    return row

Now let's loop over the first five rows, call this function and print the results.

In [31]:
for tr in tr_all[:5]:
    print('--------------')
    row = clean_row(tr)
    print(row)

--------------
{'report': '/HP68/AccidentDetailsAction?ACC_RPT_NUM=200576955', 'name': 'WORS, RICHARD W', 'age': '42', 'person_city_state': 'FESTUS, MO', 'personal_injury': 'FATAL', 'safety_device': 'NO', 'date': '11/18/2020', 'time': '4:53PM', 'crash_county': 'DOUGLAS', 'crash_location': 'PRIVATE PROPERTY EIGHT MILES SOUTHWEST OF WILLOW SPRINGS', 'troop': 'G'}
--------------
{'report': '/HP68/AccidentDetailsAction?ACC_RPT_NUM=200575158', 'name': 'LUTTRULL, JEANA L', 'age': '58', 'person_city_state': 'LEWISTOWN, MO', 'personal_injury': 'FATAL', 'safety_device': 'YES', 'date': '11/17/2020', 'time': '4:40PM', 'crash_county': 'LEWIS', 'crash_location': 'MO 6 AT VINE ST IN LEWISTOWN, MO', 'troop': 'B'}
--------------
{'report': '/HP68/AccidentDetailsAction?ACC_RPT_NUM=200571996', 'name': 'LUCKETT, JUSTIN R', 'age': '43', 'person_city_state': 'ST. CHARLES, MO', 'personal_injury': 'FATAL', 'safety_device': 'YES', 'date': '11/15/2020', 'time': '6:00PM', 'crash_county': 'ST. CHARLES', 'crash_l

Looks like clean data! Notice how representing each row as a `dict` helps us quickly assess that we've properly mapped column values to to the proper column headers.

We could take this validation further. Each column value in each row is in Python's native `str` data type. We could instead coerce these values to more precise and useful data types (e.g., validate that every `'age'` is an `int`). Python's [`dataclasses`](https://docs.python.org/3/library/dataclasses.html), a rather new feature of the language, would be good for this.

Extra work would definitely help us identify all of the inconsistencies in our data and force us to make decisions about how to handle them, likely resulting in more reporting and more conditional logic in our scraper code. Welcome to [yak-shaving](https://softwareengineering.stackexchange.com/questions/388092/what-exactly-is-yak-shaving).

Is this the right time to begin enforcing consistency in our data? There isn't a clear cut answer. However, one common practical approach when taking a first cut at a scraper or data extraction process is to take a more *naive*, one that doesn't rely on preconceived notions of how the data should be formed.

Below is another version of the row cleaning function we defined above. But this time, the keys are not "hard-coded". The only assumptions we are making here is that:

- The table cell contains an `<a>` tag with an `href` attribute; and
- The order of the table cells matches the order of the column headers.

In [32]:
def clean_row_alt(tr):
    
    tds = tr.find_all('td')
    
    url = tds[0].find('a').attrs['href'].strip()
    
    values = [url] + [td.text.strip() for td in tds[1:]]
    
    return dict(zip(headers, values))

The code in this function is much more concise. Maybe even a little dense.

Now let's print the results of calling this function on the first five rows:

In [33]:
for tr in tr_all[:5]:
    print('--------------')
    row = clean_row_alt(tr)
    print(row)

--------------
{'report': '/HP68/AccidentDetailsAction?ACC_RPT_NUM=200576955', 'name': 'WORS, RICHARD W', 'age': '42', 'person_city_state': 'FESTUS, MO', 'personal_injury': 'FATAL', 'safety_device': 'NO', 'date': '11/18/2020', 'time': '4:53PM', 'crash_county': 'DOUGLAS', 'crash_location': 'PRIVATE PROPERTY EIGHT MILES SOUTHWEST OF WILLOW SPRINGS', 'troop': 'G'}
--------------
{'report': '/HP68/AccidentDetailsAction?ACC_RPT_NUM=200575158', 'name': 'LUTTRULL, JEANA L', 'age': '58', 'person_city_state': 'LEWISTOWN, MO', 'personal_injury': 'FATAL', 'safety_device': 'YES', 'date': '11/17/2020', 'time': '4:40PM', 'crash_county': 'LEWIS', 'crash_location': 'MO 6 AT VINE ST IN LEWISTOWN, MO', 'troop': 'B'}
--------------
{'report': '/HP68/AccidentDetailsAction?ACC_RPT_NUM=200571996', 'name': 'LUCKETT, JUSTIN R', 'age': '43', 'person_city_state': 'ST. CHARLES, MO', 'personal_injury': 'FATAL', 'safety_device': 'YES', 'date': '11/15/2020', 'time': '6:00PM', 'crash_county': 'ST. CHARLES', 'crash_l

Same output. Awesome! An added benefit to this approach is that, if the web developers make any changes this table element (e.g., re-order the columns or subtract or add new ones), unit of the code will continue to function properly.

Poorly timed consistency in our data workflows causes *brittleness* as opposed to fluidness. We do need to enforce (or at the very least check) consistency in our data before we analyze, visualize or otherwise make use of the data. However, in a data pipeline that's extracting data from an inconsistent source, we don't want to enforce consistency to early.

## Storing the data

We've demonstrated how to clean the first few rows. Now let's prepare all of the rows to be stored for later use. First, store all of the rows in memory.

In [34]:
rows = [
    clean_row_alt(tr) for tr in tr_all
]

We can even check if we got them all.

In [35]:
len(rows)

622

Python's `assert` statements are useful for verifying our assumptions. In this case, we are checking to make sure the count of rows we extracted matches the count of `<tr>` tags.

In [36]:
assert len(rows) == len(tr_all)

The [csv](https://docs.python.org/3.8/library/csv.html) module that is part of Python's standard library.

In [37]:
import csv

We're going to use the `DictWriter` class to write out the rows (each being `dict`) to a local csv file.

In [38]:
with open('data.csv', 'w', newline='') as f:
    
    writer = csv.DictWriter(f, fieldnames=headers)
    
    writer.writeheader()
    
    for row in rows:
        writer.writerow(row)

## Converting your notebook to a script

Jupyter is a good environment for experimenting and explaining a web scraper. However, if we want to write a program that we can run on a schedule and/or deploy to the cloud, we would probably rather define our web scraper in vanilla Python, either a single script or a module.

[nbconvert](https://pypi.org/project/nbconvert/) is a command-line utility for converting iPython notebooks (.ipynb files) into Python scripts (.py files).

We can install it with pip:

```sh
pip install nbconvert
```

Then invoke it as subcommand of jupyter:

```sh
jupyter nbconvert --no-prompt --to python walk-through.ipynb
```

This creates a new file .py file named after the original .ipynb file. Notice also that all of the markdown is also included in the .py file as code comments.

For a notebook (such as this one) with more prose than code, we might prefer to exclude the markdown:

```sh
jupyter nbconvert --no-prompt --to python --PythonExporter.exclude_markdown=True walk-through.ipynb
```

Now we can open and edit .py in a text editor. If you like, you can just use Jupyter Lab's text editor, but you might prefer a more full-featured text editor (e.g., [Sublime Text](https://www.sublimetext.com/), [Atom](https://atom.io/) or [Visual Code Studio](https://code.visualstudio.com/)).

## Legalities of web scraping

U.S. law as more blurry lines than bright ones, and this is especially true when it comes to digital technologies. If web scraping is an activity you are going to do regularly, I encourage you to do your own research on this topic.

[*Web Scraping with Python, 2nd Edition*](https://learning.oreilly.com/library/view/web-scraping-with/9781491985564/) by Ryan Mitchell has a thorough breakdown of this issue in ["Chapter 18. The Legalities and Ethics of Web Scraping"](https://learning.oreilly.com/library/view/web-scraping-with/9781491985564/ch18.html#c-19).

Since this book is published by OReilly, you can access it for free through MU Libraries Safari books account. If not, reach out to the j-school library. They are extraordinarily helpful.

The last time did my own "research" (i.e., Googling) there was a lot of chatter about a 2018 case in the D.C. federal district court, [*Sandvig v. Sessions*](https://www.aclu.org/legal-document/sandvig-v-sessions-opinion), No. 1:16-cv--01368, Dkt. 24 (D.D.C. Mar. 30, 2018). The ruling suggests that at least most of what passes for web scraping would not violate the (rather archaic and vague) federal Computer Fraud and Abuse Act. Here are some takes:

- <http://scraping.pro/us-court-scraping-against-tos-legal/>
- <https://www.eff.org/deeplinks/2018/04/dc-court-accessing-public-information-not-computer-crime>
- <https://www.technologylawdispatch.com/2018/05/big-data/d-c-federal-court-rules-that-web-scraping-does-not-violate-the-cfaa-and-may-be-protected-by-the-first-amendment/>

LinkedIn/Microsoft also lost a [similar case](https://arstechnica.com/tech-policy/2017/08/court-rejects-linkedin-claim-that-unauthorized-scraping-is-hacking/) against another company scraping their site.