# Scrapy

**Scrapy is a powerful webscraping library that scrapes entire websites faster and with less code.**



Scrapy is the library, and spider is the class that we need to instantiate.

Called a spider as it can crawl around into tiny spaces!

Needs 4 things to work:

1. Needs a name e.g. name = 'crypto'

2. Needs start pages, starting urls e.g. start_urls= ['https://www.coingecko.com/en']

3. The allowed domain - the website domain it has to stay on so it doesn't follow hyperlinks! e.g. allowed_domains = ['coingecko.com']

4. What data to get (& what to do with it!)

**The function always has the same form:**

`def parse(self, response):`
    
Give it instructions --> finding and processing information

**Xpath** - always has an xpath - it is a querying language! Guides your Spider to particular HTML code. Takes the form respon.xpath('//....')


- / - would start at body and just inside of that find something called td

- // - would search all the way down to our td

Prices and coins at this point are lists, they are subsequently zipped together to form a dictionary!

Yield is like return but it waits until everything is finished before it returns, it produces a generator! **WE ALWAYS USE YIELD, not return**!

Puts the information into a queue.

Depth limit puts a cap on how many times it clicks on next for the next page!

### Code example:

In [None]:
import scrapy

from scrapy.http import Request


class CryptoSpider(scrapy.Spider):
    
    name = 'crypto'
    allowed_domains = ['coingecko.com']
    start_urls = ['https://www.coingecko.com/en/']

    custom_settings = {
        'DEPTH_LIMIT': 10,
    }

    def parse(self, response):

        prices = response.xpath('//td[@class="td-price price text-right"]//a/span[@class="no-wrap"]/text()').getall()
        coins = response.xpath('//td[@class="py-0 coin-name"]/@data-sort').getall()

        if len(coins) == len(prices):
            for c, p in zip(coins, prices):
                item = {'coin': c, 'price': p}
                print(item)      #just to show us in the terminal
                yield item

        next_url = response.xpath('//a[@rel="next"]/@href').get()
        next_url = response.urljoin(next_url)
        print(next_url)
        yield Request(next_url)
