# Crawling broadly with `Scrapy`


## Outline

* [Crawling broadly with `Scrapy`](#scrapy)
    * [A simple (narrow) spider](#simple)
    * [Link extraction in a (broad) spider](#linkextraction)

**__________________________________**

# Introduction to `Scrapy` <a id='scrapy'> </a>

## Create a project

## A simple (narrow) spider <a id='simple'> </a>

```python
import scrapy

class SimpleSpider(scrapy.Spider):
    name = "simple"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
```

### How to run the spider

```shell
$ scrapy crawl simple
```

### Challenge
Modify and run the spider script above to scrape this short list of `start_urls`: 
```python
['http://www.baylessk12.org/', 'https://crcc.doniphanr1.k12.mo.us/', 'https://www.hazelwoodschools.org/southeastmiddle']
 ```

## Extracting data

### Challenge
Inspect [quotes.toscrape.com](quotes.toscrape.com) for the selectors associated with quotes. Use this information to display the text of one of the quotes in the scrapy shell. <br>
**Hint 1:** If you need help getting a better sense of website structure, use the HTML tree below (under "Extracting quotes and authors") as a visual guide.<br>
**Hint 2:** You can subset within selectors by using periods and spaces. For instance, the following produces a SelectorList for the class2 of each type2 within the class1 of each type1:
```shell
response.css('type1.class1 type2.class2')
```

**Solution**

```shell
response.css(div.quote span.text::text).get()
```

## Link extraction in a (broad) spider <a id='linkextraction'> </a>

### Challenge

Adapt the `CrawlSpider` in `broad.py` to scrape the text, author, and tag for each quote across all the page on `http://quotes.toscrape.com`. Assign the `text`, `author`, and `tags` fields to Items, then yield the Items. Edit the spider script first, then run it via your terminal, then check the output to make sure.

```python
# solution

from schools.items import SchoolsItem

class BroadSpider(CrawlSpider):
    name = 'broad'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['quotes.toscrape.com/']

    rules = (
        Rule(LinkExtractor(), 
             callback='parse_item', 
             follow=True),
    )

    def parse_item(self, response):
        for quote in response.css('div.quote'):
            item = SchoolsItem() # initialize
            
            item['text'] = quote.css('span.text::text').get(),
            item['author'] = quote.css('small.author::text').get(),
            item['tags'] = quote.css('div.tags a.tag::text').getall(),
    
            yield item
```

Call it like so:
```shell
scrapy crawl broad -o quotes_broad.json
```

### Challenge

Use what you learned above about removing tags with an exclusion list to rewrite the `parse_item()` function in the `BroadSpider()` so that it doesn't depend on website structure (HTML, CSS, XPath, etc.). In other words, write a truly broad crawler that only returns text.

Make sure to clean up the spacing: convert multiple newlines into a single one or a space, depending on the output format you want. 

_Hint:_ Check your output--is it missing anything important? Consider removing specific tags from the exclusion list. 


```python
# solution

import re
from bs4 import BeautifulSoup
...
def parse_item(self, response):
    item = SchoolsItem()
        
    # Load HTML into BeautifulSoup, extract text
    soup = BeautifulSoup(response.body, 'html.parser') # default parser, 'lxml' is faster
        
    # Remove non-visible tags from soup
    [tag.decompose() for tag in soup(tags_exclusions)]
        
    # Extract text, remove <p> tags
    visible_text = soup.get_text(strip = False) # get text from each chunk, leave unicode spacing (e.g., `\xa0`) for now to avoid globbing words
        
    # Replace any consecutive linebreaks with a single newline
    visible_text = re.sub(r"[ *\n *]+", " ", visible_text).strip() # remove trailing whitespaces too
        
    item['text'] = visible_text # assign text to item
    item['url'] = response.url # assign url too
        
    return item
```