# Instalando o scrapy

In [1]:
!pip install scrapy

Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Collecting scrapy
  Downloading Scrapy-2.4.0-py2.py3-none-any.whl (239 kB)
[K     |████████████████████████████████| 239 kB 4.4 MB/s eta 0:00:01
[?25hCollecting itemloaders>=1.0.1
  Downloading itemloaders-1.0.3-py3-none-any.whl (11 kB)
Collecting Twisted>=17.9.0
  Downloading Twisted-20.3.0-cp37-cp37m-manylinux1_x86_64.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 1.2 MB/s eta 0:00:01
Collecting parsel>=1.5.0
  Downloading parsel-1.6.0-py2.py3-none-any.whl (13 kB)
Collecting protego>=0.1.15
  Downloading Protego-0.1.16.tar.gz (3.2 MB)
[K     |████████████████████████████████| 3.2 MB 30.1 MB/s eta 0:00:01
[?25hCollecting cssselect>=0.9.1
  Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting zope.interface>=4.1.3
  Downloading zope.interface-5.1.2-cp37-cp37m-manylinux2

# Criando um projeto scrapy

In [3]:
!scrapy startproject helloscrapy

Error: scrapy.cfg already exists in /home/jovyan/labs/helloscrapy


In [None]:
!ls

In [None]:
!ls helloscrapy

In [None]:
!ls helloscrapy/helloscrapy

In [None]:
!ls helloscrapy/helloscrapy/spiders

# Como extrair informações do site abaixo?
http://quotes.toscrape.com/

# Criando um novo Spider

In [4]:
!cd helloscrapy/helloscrapy/spiders && touch first_spider.py

# Executando o spider

In [None]:
!cd helloscrapy && scrapy crawl spiderone

# Como extrair informações do site abaixo?
https://www.americanas.com.br/

# Executando o spider

In [None]:
!cd helloscrapy && scrapy crawl spiderone

# Arquivo robots.txt
https://www.amazon.com.br/robots.txt

https://www.americanas.com.br/robots.txt

# Exercício 1: Coletar página inicial de Games do Mercado Livre

In [None]:
!cd helloscrapy && scrapy crawl spiderone

# Extraíndo informações específicas
# Scrapy shell, CSS Selector, Xpath, Inspect, SelectorGadget (Chrome)

## Scrapy shell
Digitar no terminal da máquina
``` bash 
scrapy shell "http://quotes.toscrape.com/"
```

## CSS Selector

``` bash
response.css('title')

response.css('title').extract()

response.css('title::text').extract()

response.css('title::text')[0].extract() #Se a lista tiver vazia apresenta erro

response.css('title::text').extract_first() #Se não é possível extrair nada do site, retorna NULL como valor (não apresenta erro)

```

## Inspect (Navegador)
```bash
response.css('span.text').extract()

response.css('span.text::text').extract()

response.css('span.text::text')[1].extract()
```

## Selector Gadget (Google Chrome)
https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=pt-BR

```bash
Pegnado todos os autores com o selector gadget: .author

response.css('.author::text').extract()
```

## XPath
Mais poderosa que os seletores CSS porque pode-se analisar o conteúdo das tags. Ex: selecionar o link que contem o texto "Próxima Página"

```bash
response.xpath('//title').extract()

response.xpath('//title/text()').extract()

response.xpath("//span[@class='text']/text()").extract()

response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "text", " " ))]/text()').extract() #usando o Selector Gadget

response.xpath("//span[@class='text']/text()")[2].extract()
```

## Inspect + CSS + Xpath
Inspecionando o botão "Next"

Extraindo seu conteúdo utilizando os seletores CSS e XPath:
```bash
response.css("li.next a").xpath("@href").extract()
```

In [None]:
!scrapy shell "http://quotes.toscrape.com/" -c 'response.css("li.next a").xpath("@href").extract()'

### Como coletar todas os links desta página?

In [None]:
!scrapy shell "http://quotes.toscrape.com/" -c 'response.css("a").xpath("@href").extract()'

# Extraindo citações e autores
Alterando o script first_spider

In [9]:
!cd helloscrapy && scrapy crawl spiderone

2020-10-24 16:07:20 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 16:07:20 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 16:07:20 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 16:07:20 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 16:07:20 [scrapy.extensions.telnet] INFO: Telnet Password: c82323a6103921d3
2020-10-24 16

### Como separar cada citação, author e tag??

In [8]:
!cd helloscrapy && scrapy crawl spiderone

2020-10-24 14:17:35 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 14:17:35 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 14:17:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 14:17:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 14:17:35 [scrapy.extensions.telnet] INFO: Telnet Password: 8588dc9d6c7fe1c2
2020-10-24 14

# Exercício 2: Coletar os nomes dos produtos na primeira página do Mercado Livre Games

In [9]:
!cd helloscrapy && scrapy crawl spiderone

2020-10-24 14:18:19 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 14:18:19 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 14:18:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 14:18:19 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 14:18:19 [scrapy.extensions.telnet] INFO: Telnet Password: 42ba00075417a7ca
2020-10-24 14

# Item containers
Extracted data > Temporary containers > Store in database

## Alterando o arquivo items.py
Declarando title, author e tag; Importando o arquivo items.py no Spider

In [5]:
!cd helloscrapy && scrapy crawl spiderone

2020-10-24 14:38:23 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 14:38:23 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 14:38:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 14:38:23 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 14:38:23 [scrapy.extensions.telnet] INFO: Telnet Password: 872d2fcd94a84d10
2020-10-24 14

# Armazenando o resultado em arquivos JSON

In [6]:
!cd helloscrapy && scrapy crawl spiderone -o items.json

2020-10-24 14:39:08 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 14:39:08 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 14:39:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 14:39:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 14:39:08 [scrapy.extensions.telnet] INFO: Telnet Password: 30cc0ed1faee41c0
2020-10-24 14

## Armazenando o resultado em CSV e XML

In [7]:
!cd helloscrapy && scrapy crawl spiderone -o items.csv

2020-10-24 14:39:24 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 14:39:24 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 14:39:24 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 14:39:24 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 14:39:24 [scrapy.extensions.telnet] INFO: Telnet Password: 6dd85a319012d6c6
2020-10-24 14

In [8]:
!cd helloscrapy && scrapy crawl spiderone -o items.xml

2020-10-24 14:39:26 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 14:39:26 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 14:39:26 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 14:39:26 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 14:39:26 [scrapy.extensions.telnet] INFO: Telnet Password: 64700027f26f002f
2020-10-24 14

# Exercício 3: Coletar os nomes dos produtos na primeira página do Mercado Livre Games e salvar em um arquivo JSON

In [10]:
!cd helloscrapy && scrapy crawl spiderone -o nomes.json

2020-10-24 14:44:42 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 14:44:42 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 14:44:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 14:44:42 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 14:44:42 [scrapy.extensions.telnet] INFO: Telnet Password: 82e644c7606b2111
2020-10-24 14

# Pipelines
Scraped data > Item Containers > Pipeline > Database

## Alterando o arquivo settings.py
Descomentando a linha ITEM_PIPELINES...

## Alterando o arquivo pipeline.py
Verificando o fluxo de informação passado no pipeline

In [28]:
!cd helloscrapy && scrapy crawl spiderone

2020-10-24 16:39:32 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 16:39:32 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 16:39:32 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 16:39:32 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 16:39:32 [scrapy.extensions.telnet] INFO: Telnet Password: 4730abd5f80f92f8
2020-10-24 16

# Criando o banco de dados relacional - Sqlite

In [24]:
!cd helloscrapy/helloscrapy/ && touch create_database.py

In [25]:
!cd helloscrapy/helloscrapy/ && python create_database.py

### Visualizando o banco criado
https://sqliteonline.com/

## Inserindo os dados no banco através do arquivo pipeline.py
Alterando o arquivo pipeline.py

In [29]:
!cd helloscrapy && scrapy crawl spiderone

2020-10-24 16:40:15 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 16:40:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 16:40:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 16:40:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 16:40:15 [scrapy.extensions.telnet] INFO: Telnet Password: 8ae22755a81db3d4
2020-10-24 16

# Exercício 4: Coletar os nomes dos produtos na primeira página do Mercado Livre Games e armazenar no MongoDB

In [132]:
db.nomes.drop()

In [135]:
!cd helloscrapy && scrapy crawl spiderone

2020-10-24 17:48:07 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: helloscrapy)
2020-10-24 17:48:07 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) - [GCC 7.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-101-generic-x86_64-with-debian-buster-sid
2020-10-24 17:48:07 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-24 17:48:07 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'helloscrapy',
 'NEWSPIDER_MODULE': 'helloscrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['helloscrapy.spiders'],
 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
               'like Gecko) Chrome/86.0.4240.75 Safari/537.36'}
2020-10-24 17:48:07 [scrapy.extensions.telnet] INFO: Telnet Password: 4fa8c8f1073b4980
2020-10-24 17

In [136]:
from pymongo import MongoClient
from pprintpp import pprint

mongoclient = MongoClient('localhost', 27017)
db = mongoclient.dbmongo
result = db.nomes.find()

# usando a função pretty print
for i in result:
    pprint( list(result) )


[
    {
        '_id': ObjectId('5f9468d9e5ea858841b11e04'),
        'nome': 'Super Mini Sfc Retrô 620 Jogos 2 Controles Envio Imediato Rj',
    },
    {
        '_id': ObjectId('5f9468d9e5ea858841b11e05'),
        'nome': 'Mini Game Retrô Portatil 400 Jogos Antigos Anos 80 Promoção',
    },
    {
        '_id': ObjectId('5f9468d9e5ea858841b11e06'),
        'nome': 'Super Mini Game Lcd 3  400 Jogos Portátil Av C/ Controle ',
    },
    {
        '_id': ObjectId('5f9468d9e5ea858841b11e07'),
        'nome': 'Mini Video Game Portatil Retrô + 3.000 Jogos Super Nintendo',
    },
    {
        '_id': ObjectId('5f9468d9e5ea858841b11e08'),
        'nome': 'Super Nintendo Mini 8 Mil Jogos 2 Controles Envio Imediato!',
    },
    {
        '_id': ObjectId('5f9468d9e5ea858841b11e09'),
        'nome': 'The Last Of Us Part 2 Midia Fisica Pronta Entrega Pt-br + Nf',
    },
    {
        '_id': ObjectId('5f9468d9e5ea858841b11e0a'),
        'nome': 'Cabo Flat Tipo J Play2 Para O Leitor Série 9000x 700

# Armazenando em banco de dados de documentos - MongoDB
Alterando o arquivo pipeline.py

In [None]:
!cd helloscrapy && scrapy crawl spiderone

### Consultando diretamente a coleção scrapy_col

In [None]:
from pymongo import MongoClient
from pprintpp import pprint
import warnings
warnings.filterwarnings('ignore')

mongoclient = MongoClient('localhost', 27017)
db = mongoclient.thedatasocietydb
result = db.scrapy_col.find({})

for document in result:
        pprint(document)
        print()

# Seguindo links
Alterando first_spider.py

In [None]:
!cd helloscrapy && scrapy crawl spiderone

# Exercício 5: Coletar o nome, preço e número de avaliações de todos os produtos do Mercado Livre Games e armazenar no MongoDB