* Validate Mongodb Database Setup
* Overview of Scrapy Pipelines
* Overview of using hash code for quote text
* Update Spider Logic to include hash code
* Develop Pipeline Logic to write to Mongodb
* Run the Pipeline to write to Mongodb
* Validate Data in Mongodb Collection
* Exercise and Solution

* Validate Mongodb Database Setup

1. Make sure Mongodb is running (Use telnet to validate - `telnet localhost 27017`)
2. Launch Mongo shell using `mongosh`.
3. We can also use `pymongo` to connect to Mongodb Database using Python.

```python
import pymongo
client = pymongo.MongoClient('localhost', 27017)

for db in client.list_databases():
    print(db['name'])

# We can create new database and then use relevant APIs to deal with collections and documents
db = client['quotes_db']

# If the database is empty, you will not see any collections
for collection in db.list_collections():
    print(collection)
```

* Overview of Scrapy Pipelines

Here are the details about Scrapy Pipelines.
1. We can define pipelines in `pipelines.py`.
2. The pipeline class will have the logic to write the data to specified target.
3. The logic to process HTML content and write to the target such as database are clearly separated.

We will understand how to write the extracted data into Mongo DB database using Scrapy pipelines.

* Overview of using hash code for quote text

```python
import hashlib
sha = hashlib.sha256()

s = 'Hello World'
sha.update(s.encode())

sha.hexdigest()
```

* Update Spider Logic to include hash code

```python
import hashlib
import scrapy

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['https://www.goodreads.com/quotes?page=90']

    def parse(self, response):
        sha = hashlib.sha256()
        for quoteDetails in response.css('.quoteDetails'):
            quote_text = quoteDetails.css('.quoteText::text').get()
            sha.update(quote_text.encode())
            payload = {
                'quoteTextHash': sha.hexdigest(),
                'quoteText': quote_text,
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get(),
                'authorOrTitleUrl': quoteDetails.css('a.authorOrTitle::attr(href)').get(),
                'authorOrTitleText': quoteDetails.css('a.authorOrTitle::text').get()
            }
            yield payload

        for next_page in response.css('a.next_page'):
                yield response.follow(next_page, self.parse)
```

* Develop Pipeline Logic to write to Mongodb

Update `pipelines.py`

```python
from pymongo import MongoClient


class QuotesPipeline:
    def __init__(self, mongo_uri, mongo_db, collection_name):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
        self.collection_name = collection_name

    @classmethod
    def from_crawler(cls, crawler):
        mongo_uri = crawler.settings.get('MONGO_URI')
        mongo_db = crawler.settings.get('MONGO_DATABASE')
        collection_name = crawler.settings.get('MONGO_COLLECTION')
        return cls(mongo_uri, mongo_db, collection_name)

    def open_spider(self, spider):
        self.client = MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        self.collection = self.db[self.collection_name]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        # self.collection.insert_one(dict(item))
        doc = self.collection.find_one({'quoteTextHash': dict(item)['quoteTextHash']})
        if not doc:
            self.collection.insert_one(dict(item))
        return item
```

Update `settings.py` with Mongo DB connectivity information and also pipeline details.

```python
ITEM_PIPELINES = {
    'quotes.pipelines.QuotesPipeline': 300,
}

MONGO_URI = 'mongodb://localhost:27017/'
MONGO_DATABASE = 'quotes_db'
MONGO_COLLECTION = 'quotes'
```

* Run the Pipeline to write to Mongodb

Run the pipeline using `scrapy crawl quotes`. It will process the data from the specified urls and load the data into Mongo DB collection.

* Validate Data in Mongodb Collection

1. Launch Mongo Shell
2. Switch to quotes_db using `use quotes_db`.
3. Check the count in the collection using `db.quotes.countDocuments({})`
4. Get first few records using pretty `db.quotes.find({}).pretty()`

* Exercise - Include page urls while writing to Mongodb

1. Ensure you add the logic related to adding page urls to the `parse` function. The attribute name should be `parseUrl`. It can be populated using `response.url`.
2. Make sure data is upserted or merged. If there is no record in mongodb with given quoteTextHash, then the document should be inserted otherwise document should be updated.
3. Validate by reviewing the data in the Mongodb collection.

* Solution - Include page urls while writing to Mongodb

1. Update `parse` function in `quotes_spider.py`

```python
import hashlib
import scrapy

    
class QuoteSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['https://www.goodreads.com/quotes?page=90']

    def parse(self, response):
        sha = hashlib.sha256()
        for quoteDetails in response.css('.quoteDetails'):
            quote_text = quoteDetails.css('.quoteText::text').get()
            sha.update(quote_text.encode())
            payload = {
                'quoteTextHash': sha.hexdigest(),
                'pageUrl': response.url,
                'quoteText': quote_text,
                'authorOrTitle': quoteDetails.css('span.authorOrTitle::text').get(),
                'authorOrTitleUrl': quoteDetails.css('a.authorOrTitle::attr(href)').get(),
                'authorOrTitleText': quoteDetails.css('a.authorOrTitle::text').get()
            }
            yield payload

        for next_page in response.css('a.next_page'):
                yield response.follow(next_page, self.parse)
```

2. Update `pipelines.py` with required changes to upsert into Mongodb collection

```python
from pymongo import MongoClient


class QuotesPipeline:
    def __init__(self, mongo_uri, mongo_db, collection_name):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
        self.collection_name = collection_name

    @classmethod
    def from_crawler(cls, crawler):
        mongo_uri = crawler.settings.get('MONGO_URI')
        mongo_db = crawler.settings.get('MONGO_DATABASE')
        collection_name = crawler.settings.get('MONGO_COLLECTION')
        return cls(mongo_uri, mongo_db, collection_name)

    def open_spider(self, spider):
        self.client = MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        self.collection = self.db[self.collection_name]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        # self.collection.insert_one(dict(item))
        query = {'quoteTextHash': dict(item)['quoteTextHash']}
        update = {'$set': dict(item)}
        self.collection.update_one(query, update, upsert=True)
        return item
```

3. Run `scrapy crawl quotes` to crawl the data and populate into Mongo collection.
4. Run below mongo commands to validate.

```mongodb
use quotes_db
db.quotes.countDocuments({})
db.quotes.find({}).pretty()
db.quotes.countDocuments({"pageUrl": "https://www.goodreads.com/quotes?page=90"})
db.quotes.find({"pageUrl": "https://www.goodreads.com/quotes?page=90"}).pretty()
```