### Scrapy

---

### Learning objectives:
* Setting up a scrapy project
* Classes & Inheritance
* Generators
* Xpaths

---

### Setting up a scrapy project

#### 1: Install the packages

First we'll need to `pip install scrapy`

#### 2: We can create a project using the command:

`scrapy startproject <PROJECT>`

This will create a default project. Lets to understand the file structure that scrapy creates.

├── <PROJECT>
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── __pycache__
│       └── scraper.py
└── scrapy.cfg

#### The following files aren't core to learning Scrapy basics

* scrapy.cfg - the scrapy deployment config file
* items.py - temporary storage for data scraped (not necessary)
* middleware.py - download customised middleware as required
* pipelines.py - create a customsed pipeline for more complex projects
* settings.py - customise your project, e.g. api settings, scraping layers

#### We really only care about the code in the spiders folder

---

#### Classes & inheritance
* Classes are a design pattern core to Object Oriented Programming Languages like Python. It allows us to wrap data and functions together in a reusable code package
* Inheritance is a technique for using funcitonality written in other classes.
* Don't worry too much about these ideas; we can just copy the structure Scrapy gives us.

#### Generators
* Generators are iterables which are only revealed once piece at a time, whenever the user wants. They are commonly used:
    * When you don't know exactly how many items your iterable is going to contain
    * When you worry you might have too much data in the iterable to fit into memory
* You can find generators in the python wilderness in the following:
    * zip() / enumerate() / open() / df.iterrows()
    * The list goes on.... looks like we could use a generator for it!
* You can create them using the `yield` command. This works like return but in a generative way

#### Xpaths
* Xpaths are a way of expressing the structure of webpages (technically the DOM) more like a file directory. This should make reading these xpaths a bit intuitive for UNIX users!

**3 basic statements**:

* `//` = select from the entire document (irrespective of where)
* `/` = select from the root node (of the current query)
* `@` =  select attributes specifically
* `[]` =  select predicates (i,e with conditional statements)

In [1]:
from lxml import html
import requests

### OK! The background is out the way, lets get one of these spiders up and running

In [3]:
import scrapy
from scrapy.http import Request
import re

#inherit from scrapy.Spider
class LyricsScraper(scrapy.Spider):
    #define the name of the spider which we will call from CLI
    name = 'scrape_lyrics'
    #set the range of urls which can be scraped over
    allowed_domain = ['https://www.metrolyrics.com/']
    start_urls = ['https://www.metrolyrics.com/top-artists.html/']

    def parse(self, response):

        #look through all the artists in the artists page
        #this returns a list of urls
        artists = response.xpath('//a[@class="image"]/@href').getall()
        for artist in artists:
            yield Request(artist)

        #find all their song urls
        songs = response.xpath('//td/a/@href').getall()
        for song in songs:
            yield Request(song)

        #connect to the url of each song
        #scrape the lyrics from this page
        lyrics = response.xpath('//div[@id="lyrics-body-text"]//text()').getall()
        lyrics = ' '.join(lyrics)

        item = {'lyrics': lyrics}
        yield item
        print(item)


### Ways you can improve the Spider:
* Write the results to disk rather than printing
* Log the status of the scrape to terminal

---

#### Further research:

Scrapy documentation: `https://docs.scrapy.org/en/latest/intro/tutorial.html`

Spiced scrapy documentation:`Chapter 6.4` 