* Setup Scrapy project using Scrapy CLI
* Review Folder Structure
* Add Spider to the Project
* Update Settings to write to json file
* Run and Validate the Project
* Exercise and Solution - Scrape Data to JSON Files

* Setting up Project using Scrapy CLI

1. Run `scrapy startproject quotes` to setup scrapy project.
2. Check whether a folder by name quotes is not created or not.

* Review Folder Structure

1. Check configuration file
2. Review `spiders` folder
3. Review `settings.py`
4. Review `pipelines.py`

* Add Spider to the Project

1. Add a program file under `quotes/quotes/spiders` folder by name quotes_spider.py
2. Review and understand the code. The code have the ability to process the quotes in all the 100 pages that are available under base url - https://www.goodreads.com/quotes

```python
import scrapy


def generate_urls(base_url):
    urls = []
    for i in range(1, 101):
        urls.append(f'{base_url}?page={i}')
    return urls

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = generate_urls('https://www.goodreads.com/quotes')

    def parse(self, response):
        for quoteDetails in response.css('.quoteDetails'):
            payload = {
                'quoteText': quoteDetails.css('.quoteText::text').get(),
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get()
            }
            yield payload
```

* Using follow to navigate via next page

You can also develop the logic to scrape all the pages by following using next_page.

```python
import hashlib
import scrapy

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['https://www.goodreads.com/quotes']

    def parse(self, response):
        sha = hashlib.sha256()
        for quoteDetails in response.css('.quoteDetails'):
            quote_text = quoteDetails.css('.quoteText::text').get()
            sha.update(quote_text.encode())
            payload = {
                'quoteTextHash': sha.hexdigest(),
                'quoteText': quote_text,
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get(),
                'authorOrTitleUrl': quoteDetails.css('a.authorOrTitle::attr(href)').get(),
                'authorOrTitleText': quoteDetails.css('a.authorOrTitle::text').get()
            }
            yield payload

        for next_page in response.css('a.next_page'):
                if response.css('a.disabled'):
                     break
                yield response.follow(next_page, self.parse)
```

* Update Settings to write to json file

1. Go to `settings.py` in scrapy project folder.
2. Append below text to the file and save it. You can also specify the full path for the output file.

```python
FEEDS = {
    'quotes.json': {
        'format': 'json',
        'overwrite': True
    }
}
```

* Run and Validate the Project

1. Run `scrapy crawl quotes` to run the spider in the spider project.
2. Review the data in the files.

Note: If you run the project multiple times, the file will be overwritten.

* Exercise  - Scrape Data to JSON Files

Scrape quote text, author or title, author or title url, author or title url text into JSON format.

1. Make sure to have a project by name quotes.
2. Define spider to scrape all the 100 pages.
3. Save quote text, author or title, author or title url, and author or title url text details to json file to a file by name `quotes.json`

* Solution - Scrape Data to JSON Files

Scrape quote text, author or title, author or title url, author or title url text into JSON format.

1. Make sure to have a project by name quotes.

```shell
scrapy startproject quotes
```

2. Define spider to scrape all the 100 pages.

Create a file by name quotes_spider.py in spiders folder.

3. Save quote text, author or title, author or title url, and author or title url text details to json file to a file by name `quotes.json`

Update `quotes_spider.py` with below code.

```python
import scrapy
import json


def generate_urls(base_url):
    urls = []
    for i in range(1, 101):
        urls.append(f'{base_url}?page={i}')
    return urls

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = generate_urls('https://www.goodreads.com/quotes')

    def parse(self, response):
        for quoteDetails in response.css('.quoteDetails'):
            payload = {
                'quoteText': quoteDetails.css('.quoteText::text').get(),
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get(),
                'authorOrTitleUrl': quoteDetails.css('a.authorOrTitle::attr(href)').get(),
                'authorOrTitleText': quoteDetails.css('a.authorOrTitle::text').get()
            }
            yield payload
```

Make sure below setting is added to `settings.py`.

```python
FEEDS = {
    'quotes.json': {
        'format': 'json',
        'overwrite': True
    }
}
```

Run `scrapy crawl quotes` to add the data to `quotes.json`.