* Install Scrapy for Web Scraping
* Review the Code of the first spider
* Run and Validate the Application
* Review Website to be scraped
* Create Spider to read quotes
* Update logic to get urls for authors or titles
* Overview of writing data to json
* Exercise and Solution

* Install Scrapy for Web Scraping

You can use official documentation at https://scrapy.org to quickly install and start using it.

1. Install scrapy using `python -m pip install scrapy==2.8.0`. You can also install latest by saying `python -m pip install scrapy`. 
2. In the base folder add a file by name `myspider.py`
3. Copy paste this code and save the file. A new spider by name myspider will be created to scrape a given website.

```python
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://www.zyte.com/blog/']

    def parse(self, response):
        for title in response.css('.oxy-post-title'):
            yield {'title': title.css('::text').get()}

        for next_page in response.css('a.next'):
            yield response.follow(next_page, self.parse)
```

* Review the Code of the first spider

1. Make sure to go to the specified URL in `myspider.py` and understand how blog pages are organized.
2. When we run `myspider`, the logic in the `parse` will be executed for the urls specified under `start_urls` list.
3. First `for` loop gives us the titles in each blob post or page.
4. The blog posts/pages in the specified URL are paginated. The second for loop will take care of going to each page and then parse.

In the end we will get all the blog post/page titles from the website.

* Run and Validate the Application

1. Installing Scrapy will also provide us `scrapy` CLI. Review `scrapy` CLI. Here are the commands to get started with scrapy and run spider.

```shell
scrapy version
scrapy runspider myspider.py
```

2. Use `scrapy` CLI to run the myspider - `scrapy runspider myspider.py`

* Review Website to be scraped

Go to https://www.goodreads.com/quotes and review the details.

1. Visit the website and understand how quote are organized.
2. Go to the source code of the HTML page and review the tags related to quotes (quotes, quoteDetails, quoteText, etc.)

* Create Spider to read quotes

1. Create new Python program file by name `quotes_spider.py`.
2. Add the below code to read the quotes and save.

```python
import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['https://www.goodreads.com/quotes']

    def parse(self, response):
        for quoteDetails in response.css('.quoteDetails'):
            payload = {
                'quoteText': quoteDetails.css('.quoteText::text').get()
            }
            yield payload

        for next_page in response.css('a.next_page'):
            yield response.follow(next_page, self.parse)
```

3. Run and validate to get the quote text from each of the quote.

* Update logic to get urls for authors or titles

```python
import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['https://www.goodreads.com/quotes']

    def parse(self, response):
        for quoteDetails in response.css('.quoteDetails'):
            payload = {
                'quoteText': quoteDetails.css('.quoteText::text').get(),
                'authorOrTitle': quoteDetails.css('a.authorOrTitle::attr(href)').get()
            }
            yield payload

        for next_page in response.css('a.next_page'):
            yield response.follow(next_page, self.parse)
```

* Overview of writing data to json

```python
import json
import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['https://www.goodreads.com/quotes?page=95']

    def __init__(self):
        self.file = open('quotes.json', 'w')

    def closed(self, reason):
        self.file.close()

    def parse(self, response):
        for quoteDetails in response.css('.quoteDetails'):
            payload = {
                'quoteText': quoteDetails.css('.quoteText::text').get(),
                'authorOrTitleUrl': quoteDetails.css('a.authorOrTitle::attr(href)').get()
            }
            self.file.write(f'{json.dumps(payload)}\n')
            yield payload

        for next_page in response.css('a.next_page'):
            yield response.follow(next_page, self.parse)
```

* Exercise - Get URLs from the wiki page

1. Review the content of Wiki Page - https://en.wikipedia.org/wiki/Python_(programming_language)
2. Develop required logic to get all the external urls for the above Wiki Page. Make sure to get only those URLs which start with **http**.
3. Run and Validate to see if the URLs are retrieved or not.

* Solution - Get URLs from the wiki page

1. Review the content of Wiki Page - https://en.wikipedia.org/wiki/Python_(programming_language)
2. Develop required logic to get all the external urls for the above Wiki Page. Make sure to get only those URLs which start with **http**.

Add the logic to `wiki_spider.py`

```python
import scrapy


class WikiSpider(scrapy.Spider):
    name = 'wikispider'
    start_urls = ['https://en.wikipedia.org/wiki/Python_(programming_language)']

    def parse(self, response):
        for link_tag in response.css('link'):
            url = link_tag.css('::attr(href)').get()
            if url and url.startswith('http'):
                yield {'url': url}
        
        for atag in response.css('a'):
            url = atag.css('::attr(href)').get()
            if url and url.startswith('http'):
                yield {'url': url}
```

3. Run and Validate to see if the URLs are retrieved or not.

```shell
scrapy runspider wiki_spider.py
```