# Scraping and crawling the web


## Outline

* [Parsing HTML](#parsing)
    * [Pretty parsing with `BeautifulSoup`](#BS)
    * [Getting human-readable text](#readable)
* [Crawling broadly with `Scrapy`](#scrapy)
    * [A simple (narrow) spider](#simple)
    * [Link extraction in a (broad) spider](#linkextraction)

**__________________________________**

# Narrow crawling/scraping <a id='narrow'> </a>

## Making `Requests` <a id='request'></a>

### Challenge

Get the HTML for [this claim review by fact checking site PolitiFact](https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/). 
Print out the first 1000 characters and compare it to the HTML you see when you view the source HTML in your browser.

In [1]:
# solution
import requests 

url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
response = requests.get(url)
html = response.text

html[:1000]

'\n<!DOCTYPE html>\n<html lang="en-US" dir="ltr">\n<head>\n<meta charset="utf-8">\n<meta http-equiv="x-ua-compatible" content="ie=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<title>PolitiFact | Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t.</title>\n<meta name="description" content="Republican opposition to President Joe Biden’s infrastructure proposal has been swift and vocal. Senate Minority Leader " />\n<meta property="og:url" content="https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/" />\n<meta property="og:image" content="https://static.politifact.com/politifact/rulings/meter-mostly-false.jpg" />\n<meta property="og:image:secure_url" content="https://static.politifact.com/politifact/rulings/meter-mostly-false.jpg" />\n<meta property="og:title" content="PolitiFact - Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t." />\

# Parsing HTML <a id='parsing'></a>

## Pretty parsing with `BeautifulSoup` <a id='BS'></a>

### Challenge

Find all the links in the above claim review page using the `<a>` tags and their `href` elements. Print every 10th link. What do you notice about where these links point?

In [2]:
# Import BeautifulSoup for parsing
from bs4 import BeautifulSoup

url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

# solution
for link in soup.find_all('a')[::10]: # every 10th element
    print(link.get('href'))

/
/pennsylvania/
/health-check/
/personalities/kamala-harris/
/personalities/rush-limbaugh/
/truth-o-meter/promises/trumpometer/?ruling=true
https://twitter.com/Politifact/
/staff/jon-greenberg/
https://www.congress.gov/bill/116th-congress/house-resolution/109
https://twitter.com/Citizens_United/status/1377308915227107336
#
#
#
/personalities/joe-biden/
/personalities/facebook-posts/
/factchecks/list/
/personalities/nancy-pelosi/
/texas/
/corrections-and-updates/
https://twitter.com/share?text=PolitiFact - Citizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t.&url=https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/


We see lots of relative links (e.g., `/pennsylvania/`), places where the `href` seems to point nowhere (e.g., `#`), and communication shortcuts (e.g., `https://twitter.com/share?text=PolitiFact - Citizens United calls...`). This could be cleaned up by appending relative links to the domain name (`https://www.politifact.com/`) and keeping only URLs (and nothing after).

## Getting human-readable text <a id='readable'></a>

Not all websites use the `<p>` tag to indicate the important, human-readable text. Sometimes we need to approach HTML parsing from the other end: By finding and removing all non-informative tags. Let's use `BeautifulSoup` to build such a method. 

### Challenge

Use `decompose()` to remove from the soup all tags showing anything other than human-readable text. Below is a list of such junk tags to use as a blacklist. 

```
"b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "dfn", "kbd", 
"samp", "var", "bdo", "map", "object", "q", "span", "sub", "sup", "head", 
"title", "[document]", "script", "style", "meta", "noscript"
```

In [3]:
# solution

# Define inline tags for cleaning out HTML
tags_blacklist = ["b", "big", "i", "small", "tt", "abbr", "acronym", "cite", "dfn", "kbd", 
                  "samp", "var", "bdo", "map", "object", "q", "span", "sub", "sup", "head", 
                  "title", "[document]", "script", "style", "meta", "noscript"]

# Get HTML and then soup
url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

# Remove non-visible tags from soup with two for-loops:
for tag in tags_blacklist:
    for elem in soup(tag):
        elem.decompose()
        
# Show result
visible = soup.get_text(strip=True)
print(visible[1000:3000])

 March 31, 2021 in a tweet:Says Joe Biden’s infrastructure plan “is the Green New Deal.”The White House infrastructure plan has $111 billion to improve water and sewer systems. (Shutterstock)ByJon GreenbergCitizens United calls Biden’s infrastructure plan the Green New Deal. It isn’t.If Your Time is shortThe White House infrastructure plan would cost about $2.3 trillion. A Green New Deal-type plan would cost $9.5 trillion.The Green New Deal included broader social economic goals, such as a guaranteed livable wage, affordable higher education and universal health care.See the sources for this fact-checkRepublican opposition to President Joe Biden’s infrastructure proposal has been swift and vocal.Senate Minority Leader Mitch McConnell said that as written, the $2.3 trillion American Jobs Plan released March 31 was a nonstarter. The conservative PAC Citizens United put Biden’s plan in the same boat as theGreen New Deal, a sweeping environmental and social justice agenda that Republicans 

You might have noticed that word boundaries get clobbered when you call `get_text()`. This is because the default setting for this method is `strip=True`, which tells `BeautifulSoup` to strip whitespaces (of any kind) from the beginning and end of each bit of text. Using `strip=False` leads to lots of extra whitespaces--usually, newlines--which requires some regular expressions to clean up.

### Challenge

Using the above tags blacklist and `decompose()` as before, this time use the `strip=False` parameter when calling `get_text()` to avoid combining words across whitespace boundaries. Instead, use regular expressions to clean up extra whitespaces.

In [4]:
# solution

url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'
html = requests.get(url)
soup = BeautifulSoup(html.text)

# Faster way to remove non-visible tags from soup:
[s.decompose() for s in soup(tags_blacklist)]

# Don't strip spaces in-between elements, to avoid clobbering word boundaries
visible = soup.get_text(strip=False)

import re
#visible = re.sub(r"\n+", "\n", visible) # This works, but less extensible than below

import regex # better unicode support than Python's built-in re package

# Use regex to replace all consecutive spaces (including in unicode), tabs, or "|"s with a single space
visible = regex.sub(r"[ \t\h\|]+", " ", visible)
# Replace any consecutive linebreaks with a single space
visible = regex.sub(r"[\n\r\f\v]+", "\n", visible)

print(visible)


Donate
State Editions
California
Florida
Illinois
Iowa
Missouri
New York
North Carolina
Pennsylvania
Texas
Virginia
West Virginia
Vermont
Wisconsin
Michigan
Issues
All Issues
Online hoaxes
Coronavirus
Health Care
Immigration
Taxes
Marijuana
Environment
Crime
Guns
Foreign Policy
People
All People
Joe Biden
Kamala Harris
Charles Schumer
Mitch McConnell
Bernie Sanders
Nancy Pelosi
Donald Trump
Media
PunditFact
Tucker Carlson
Sean Hannity
Rachel Maddow
Rush Limbaugh
Bloggers
Campaigns
2020 Elections
Truth-o-Meter
True
Mostly True
Half True
Mostly False
 False
Pants on Fire
Promises
Biden Promise Tracker
Trump-O-Meter
Obameter
Latest Promises
About Us
Our Process
Our Staff
Who pays for Politifact?
Advertise with Us
Suggest a Fact-check
Corrections and Updates
Donate
 Follow us
The Facts Newsletter
Sign up
Stand up for the facts!
Misinformation isn't going away just because it's a new year. Support trusted, factual information with a tax deductible contribution to PolitiFact.
More Info
I wo

### Challenge

You might have noticed that when we scraped HTML above from [this claim review by PolitiFact](https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/), we got headers and tags like this:
```html
<p>Misinformation isn't going away just because it's a new year. Support trusted, factual information with a tax deductible contribution to PolitiFact.</p>
<p>
<a class="m-disruptor-content__link" href="/membership/">More Info</a>
</p>
<p class="c-image__caption-inner copy-xs">
The White House infrastructure plan has $111 billion to improve water and sewer systems. (Shutterstock)
</p>
```
Use what you now know about identifying HTML, removing tags, and cleaning spacing to scrape a clean explanation from the body of this article. 

_Hint:_ Use your browser to inspect this website's HTML and identify any unique types and/or classes that enclose the explanation (and nothing else).

In [5]:
# solution

# Set URL to scrape
url = 'https://www.politifact.com/factchecks/2021/apr/02/citizens-united/citizens-united-calls-bidens-infrastructure-plan-g/'

# Scrape HTML with requests and beautifulsoup
html = requests.get(url) 
soup = BeautifulSoup(html.text)

explanation = soup.find('article', class_='m-textblock').get_text() # identify this class from looking at HTML

import re
explanation = re.sub(r"\n+", "\n", explanation)

print(explanation)


Republican opposition to President Joe Biden’s infrastructure proposal has been swift and vocal.
Senate Minority Leader Mitch McConnell said that as written, the $2.3 trillion American Jobs Plan released March 31 was a nonstarter. The conservative PAC Citizens United put Biden’s plan in the same boat as the Green New Deal, a sweeping environmental and social justice agenda that Republicans have condemned.
"Does this sound like an infrastructure bill to you?" the group tweeted March 31, with a link to a New York Times article about the proposal. "It's not. It's the Green New Deal. "
Does this sound like an infrastructure bill to you? It's not. It's the Green New Deal. "It is the first step in a two-part agenda to overhaul American capitalism, fight climate change and attempt to improve the productivity of the economy."https://t.co/ajIoRCttgl— Citizens United (@Citizens_United) March 31, 2021 
The Times article described Biden’s plan as the first step in a legislative package that aimed

Compare the output from this focused, site-specific scraping approach with that from the blacklist method above. <br/>
**Which method gives the cleaner output? Which method is more extensible?**

# Crawling broadly with `Scrapy` <a id='scrapy'> </a>

## Create a project

## A simple (narrow) spider <a id='simple'> </a>

```python
import scrapy

class SimpleSpider(scrapy.Spider):
    name = "simple"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
```

### How to run the spider

```shell
$ scrapy crawl simple
```

### Challenge
Modify and run the spider script above to scrape this short list of `start_urls`: 
```python
['http://www.baylessk12.org/', 'https://crcc.doniphanr1.k12.mo.us/', 'https://www.hazelwoodschools.org/southeastmiddle']
 ```

## Extracting data

### Challenge
Inspect [quotes.toscrape.com](quotes.toscrape.com) for the selectors associated with quotes. Use this information to display the text of one of the quotes in the scrapy shell. <br>
**Hint 1:** If you need help getting a better sense of website structure, use the HTML tree below (under "Extracting quotes and authors") as a visual guide.<br>
**Hint 2:** You can subset within selectors by using periods and spaces. For instance, the following produces a SelectorList for the class2 of each type2 within the class1 of each type1:
```shell
response.css('type1.class1 type2.class2')
```

**Solution**

```shell
response.css(div.quote span.text::text).get()
```

## Link extraction in a (broad) spider <a id='linkextraction'> </a>

### Challenge

Adapt the `CrawlSpider` in `broad.py` to scrape the text, author, and tag for each quote across all the page on `http://quotes.toscrape.com`. Assign the `text`, `author`, and `tags` fields to Items, then yield the Items. Edit the spider script first, then run it via your terminal, then check the output to make sure.

```python
# solution

from schools.items import SchoolsItem

class BroadSpider(CrawlSpider):
    name = 'broad'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['quotes.toscrape.com/']

    rules = (
        Rule(LinkExtractor(), 
             callback='parse_item', 
             follow=True),
    )

    def parse_item(self, response):
        for quote in response.css('div.quote'):
            item = SchoolsItem() # initialize
            
            item['text'] = quote.css('span.text::text').get(),
            item['author'] = quote.css('small.author::text').get(),
            item['tags'] = quote.css('div.tags a.tag::text').getall(),
    
            yield item
```

Call it like so:
```shell
scrapy crawl broad -o quotes_broad.json
```

### Challenge

Use what you learned above about removing tags with a blacklist to rewrite the `parse_item()` function in the `BroadSpider()` so that it doesn't depend on website structure (HTML, CSS, XPath, etc.). In other words, write a truly broad crawler that only returns text.

Make sure to clean up the spacing: convert multiple newlines into a single one or a space, depending on the output format you want. 

_Hint:_ Check your output--is it missing anything important? Consider removing specific tags from the blacklist. 


```python
# solution

import re
from bs4 import BeautifulSoup
...
def parse_item(self, response):
    item = SchoolsItem()
        
    # Load HTML into BeautifulSoup, extract text
    soup = BeautifulSoup(response.body, 'html.parser') # default parser, 'lxml' is faster
        
    # Remove non-visible tags from soup
    [tag.decompose() for tag in soup(tags_blacklist)]
        
    # Extract text, remove <p> tags
    visible_text = soup.get_text(strip = False) # get text from each chunk, leave unicode spacing (e.g., `\xa0`) for now to avoid globbing words
        
    # Replace any consecutive linebreaks with a single newline
    visible_text = re.sub(r"[ *\n *]+", " ", visible_text).strip() # remove trailing whitespaces too
        
    item['text'] = visible_text # assign text to item
    item['url'] = response.url # assign url too
        
    return item
```