# CPS600 - Python Programming for Finance 
###  
<img src="https://www.syracuse.edu/wp-content/themes/g6-carbon/img/syracuse-university-seal.svg?ver=6.3.9" style="width: 200px;"/>

## Lab: Scraping

###  October 23, 2018


**Exercise 1**

*This exercise may take you some time and some digging. Please stick with it.*

Write a spider like the ones we looked at in class. Your spider should:

* Start with [this URL](https://en.wikipedia.org/wiki/Time_series).

* Set the `DEPTH_LIMIT` to $2$. This means that you will not get carried away in your crawling and you will not collect too much data.

* Collect the text content of the first sentence of the introductory paragraph of each article you visit and write it to a file.

* Follow any hyperlinks in that first sentence and parse the resulting pages as well...(again, your `DEPTH_LIMIT` prevents this from going on and on).

You should definitely build on top of what we did in lecture, which is taken directly from the [scrapy tutorial](https://doc.scrapy.org/en/latest/intro/tutorial.html).

*In the next cell, we see our real crawler*

In [None]:
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

*Below, we see our more advanced page parser*

In [None]:
import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

**Exercise 2**

Sign up for Twitter. Create and gather your API credentials in a `json` formatted file as I demonstrated in class. Use Python to post a status update (i.e., to tweet) on [Twitter](http://twitter.com). Say something like "Hello Twitter, I am <your_name> and I love Python!".