## Shell
test webpage data extraction in terminal
### in terminal:
start shell: `scrapy shell`  

### in scrapy shell:
- direct to page: `fetch("http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-1/")`  
- view downloaded data: `view(response)`
- scrape text data: `response.xpath('//h1/text()').extract()[0]`
    - output: `'Chapter 1: Down the Rabbit-Hole'`

# Set-up Project
data collection directories and files

## in terminal:
- create project: `scrapy startproject *project_name*`
    - creates `project_name` directory
- `project_name` folder structure:
    - `project_name` folder
        - `__pycache__` folder
        - `spiders` folder
            - `__pycache__` folder
            - `__init__.py` file
        - `__init__.py` file
        - `items.py` file
        - `middlewares.py` file
        - `pipelines.py` file
        - `settings.py` file
    - `scrapy.cfg` file
- create `*scraper*.py` file in `project_name/project_name/spiders` folder:
    - `touch project_name/project_name/spiders/scraper.py`
    - create spider scraper in this file
- run spider (in project top level directory): 
    - `cd project_name`
    - `scrapy crawl *spider_name*`
- output file located in project top level directory

## example - scrapy_alice
- `scrapy startproject scrapy_alice`
- `touch scrapy_alice/scrapy_alice/spiders/scraper.py`

### scraper.py script
in spiders directory:

```python
import scrapy

# create spider class for scraping
class AliceSpider(scrapy.Spider):

    # name to call spider in terminal
    name = 'alice_spider'

    # set spider settings
    custom_settings = {
        "DOWNLOAD_DELAY": 3, # wait time between downloading pages
        "CONCURENT_REQUESTS_PER_DOMAIN": 3, # max concurrent requests
        "FEED_FORMAT": 'csv', # output file type
        "FEED_URI": 'alice_data.csv' # output file name
    }

    # initial webpages
    start_urls = [
        'http://www.alice-in-wonderland.net/resources/chapters-script/alices-adventures-in-wonderland/chapter-1/'
    ]
    
    # this initial function must be called "parse"
    # function to list webpages
    def parse(self, response):
        # iterate through webpage hyperlink elements
        for link in response.xpath('//ul[@class="sub-menu-ul"]/ul/li/a/@href').extract():
            # load each page and run "parse_webpage" function
            yield scrapy.Request(
                url=link, # url for request
                callback=self.parse_webpage, # function to be called
                meta={'url': link} # metadata
            )

    # function to retrieve data from each webpage
    def parse_webpage(self, response):
        # load webpage for scraping
        url = response.request.meta['url']
        # scrape elements
        title = response.xpath('//h1/text()').extract()[0]
        first_paragraph = response.xpath('//main/article/p/text()').extract()[0]

        # yield scraped data (to file)
        yield {
            'title': title,
            'paragraph': first_paragraph
        }
        
```

### run spider
- `cd scrapy_alice`
- `scrapy crawl alice_spider`
- alice_data.csv file created in scrapy_alice folder