# Scrapy

Referências:
[JJ'S World - Using Scrapy in Jupyter notebook](https://www.jitsejan.com/using-scrapy-in-jupyter-notebook.html)

In [1]:
import platform
platform.python_version() # Ver versão do python

'3.6.5'

In [3]:
import scrapy
from scrapy.crawler import CrawlerProcess

## 1 Quotes to scrape

Rodando o scrapy:

O scrapy irá extrair o texto das urls `http://quotes.toscrape.com/page/1/` e `http://quotes.toscrape.com/page/2/`, buscando pelas tags, classes ou ids do HTML. Para isso, basta clicar com botão direito do mouse na parte do código a ser extraída, e clicar em *inspecionar*. Analise o código e verifique qual estrutura a ser extraída.

```html
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">
  “For every minute you are angry you lose sixty seconds of happiness.”
 </span>
 <span>by <small class="author" itemprop="author">Ralph Waldo Emerson</small>
 <a href="/author/Ralph-Waldo-Emerson">(about)</a>
 </span>
 <div class="tags">
  Tags: <meta class="keywords" itemprop="keywords" content="happiness">
  <a class="tag" href="/tag/happiness/page/1/">happiness</a>
 </div>
</div>
``` 

Queremos extrair o texto, o author e as tags.

* texto: `div.quote` `span.text`;
* autor: `div.quote` `small.author`;
* tags: `div.quote` `div.tags a.tag`;

```python
for quote in response.css('div.quote'):
 yield {
  'text': quote.css('span.text::text').extract_first(),
  'author': quote.css('span small::text').extract_first(),
  'tags': quote.css('div.tags a.tag::text').extract()
 }
```   

In [4]:
import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('quoteresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

In [5]:
import logging

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'quoteresult.json'                        # Used for pipeline 2
    }
    
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

In [6]:
# Iniciar processo
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start()

2019-02-07 09:29:39 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2019-02-07 09:29:39 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.7.0, Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:23:52) [MSC v.1900 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2019-02-07 09:29:39 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'quoteresult.json', 'LOG_LEVEL': 30, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


### Extraindo dados do arquivo

In [8]:
import pandas as pd

In [9]:
scrap_data = pd.read_json('quoteresult.json')
scrap_data

Unnamed: 0,author,tags,text
0,Marilyn Monroe,"[friends, heartbreak, inspirational, life, lov...",“This life is what you make it. No matter what...
1,J.K. Rowling,"[courage, friends]",“It takes a great deal of bravery to stand up ...
2,Albert Einstein,"[simplicity, understand]","“If you can't explain it to a six year old, yo..."
3,Bob Marley,[love],"“You may not be her first, her last, or her on..."
4,Dr. Seuss,[fantasy],"“I like nonsense, it wakes up the brain cells...."
5,Douglas Adams,"[life, navigation]","“I may not have gone where I intended to go, b..."
6,Elie Wiesel,"[activism, apathy, hate, indifference, inspira...","“The opposite of love is not hate, it's indiff..."
7,Friedrich Nietzsche,"[friendship, lack-of-friendship, lack-of-love,...","“It is not a lack of love, but a lack of frien..."
8,Mark Twain,"[books, contentment, friends, friendship, life]","“Good friends, good books, and a sleepy consci..."
9,Allen Saunders,"[fate, life, misattributed-john-lennon, planni...",“Life is what happens to us while we are makin...


In [11]:
# Lendo o arquivo quoteresult.jl
dfjl = pd.read_json('quoteresult.jl', lines=True)
dfjl

Unnamed: 0,author,tags,text
0,Marilyn Monroe,"[friends, heartbreak, inspirational, life, lov...",“This life is what you make it. No matter what...
1,J.K. Rowling,"[courage, friends]",“It takes a great deal of bravery to stand up ...
2,Albert Einstein,"[simplicity, understand]","“If you can't explain it to a six year old, yo..."
3,Bob Marley,[love],"“You may not be her first, her last, or her on..."
4,Dr. Seuss,[fantasy],"“I like nonsense, it wakes up the brain cells...."
5,Douglas Adams,"[life, navigation]","“I may not have gone where I intended to go, b..."
6,Elie Wiesel,"[activism, apathy, hate, indifference, inspira...","“The opposite of love is not hate, it's indiff..."
7,Friedrich Nietzsche,"[friendship, lack-of-friendship, lack-of-love,...","“It is not a lack of love, but a lack of frien..."
8,Mark Twain,"[books, contentment, friends, friendship, life]","“Good friends, good books, and a sleepy consci..."
9,Allen Saunders,"[fate, life, misattributed-john-lennon, planni...",“Life is what happens to us while we are makin...


2 - 