# Scraping

refs:

* https://docs.scrapy.org/en/latest/topics/shell.html
* https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-tutorial-part-i-11e350bcdec0
    * `git clone https://github.com/harrywang/scrapy-tutorial-starter.git`
* https://towardsdatascience.com/scrapy-this-is-how-to-successfully-login-with-ease-ea980e2c5901


## Scrap shell

* It is also a regular iPython shell
* Use for testing XPath or CSS expressions and what data they extracting (DEBUG)
* Web site example:
    * http://quotes.toscrape.com/
    * https://www.beerwulf.com/en-gb/c/mixedbeercases  <===


```shell
# scrapy shell <url>
scrapy shell http://quotes.toscrape.com/

# file examples
# UNIX-style
scrapy shell ./path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

```

How to use: 

```shell
scrapy shell

```

Run inside the shell 

```python

# inside scrapy shell  <======
# inspecting settings object
seettings

# fetch the page
fetch('https://www.beerwulf.com/en-gb/c/mixedbeercases')


# check response object
response

response.status

from pprint import pprint
pprint(response.headers)

# > Out[2]: <200 https://www.beerwulf.com/en-gb/c/mixedbeercases>

# inspect html code 
response.body
response.body_as_unicode()

# extract title using css xpath 
response.css('title::text').get()  # get the first results
#> Out[12]: 'Mixed Beer Cases  | Discover our beers | Beerwulf'

# get all beers
response.css('h4::text').getall()  

# Out[13]:
#[' Search results',
# 'THE SUB  (2L)',
# 'BLADE  (8L)',
# 'Beer Tap Starter Packs',
# 'All Beer Taps',
# 'SUB Kegs',
# ...]

# inspect the object crawler
crawler.stats.get_stats()

```


## Project folder struture


* Create project


```shell
scrapy startproject tutorial # project-name 
```

Folder explained:

* scrapy.cfg: the project configuration file
* tutorial/: the project’s python module, you’ll later import your code from here.
* tutorial/items.py: the project’s items file.
* tutorial/pipelines.py: the project’s pipelines file.
* tutorial/settings.py: the project’s settings file.
* tutorial/spiders/: a directory where you’ll later put your spiders.


<img src="images/scrapy_project_folder_struture.png" style="float:left" width="300" align="right">

* Run scrawler


```shell
# run the crwaler called quotes
scrapy crawl quotes
```



## Xpath and css selectors

* https://www.w3schools.com/xml/xpath_syntax.asp
* https://www.w3schools.com/cssref/css_selectors.asp


HTML example

```html
<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book>
  <title lang="en">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="en">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>
```

* Selecting nodes
    * bookstore: select all nodes called bookstore
    * /bookstore: start from root
    * bookstore/book: Selects all book elements that are children of bookstore
    * //book: Selects all book elements no matter where they are in the document
    * bookstore//book: select all book that is child of bookstore no matter where they are under bookstore umbrela

* Predicate (positioning)

    * /bookstore/book[1]: select first
    * /bookstore/book[last()]
    * /bookstore/book[position()<3]
    * /bookstore/*: select all child
    
* Select multiple paths

    * //book/title | //book/price: title AND proce
    * //title | //price: title OR price

## Dealing woth login and credentials

refs:

* https://quotes.toscrape.com/login
* https://www.youtube.com/watch?v=I_vAGDZeg5Q



Steps:
* Inspecting the page login
* find the metadata for the token value

```html
<form action="/login" method="post" accept-charset="utf-8">
        <input type="hidden" name="csrf_token" value="DLIyfMtmuZjQJHSWCdhlsKiBPozwVbvREOqxFeUnNrTYAXGakpgc">
        <div class="row">
            <div class="form-group col-xs-3">
                <label for="username">Username</label>
                <input type="text" class="form-control" id="username" name="username">
            </div>
        </div>
        <div class="row">
            <div class="form-group col-xs-3">
                <label for="username">Password</label>
                <input type="password" class="form-control" id="password" name="password">
            </div>
        </div>
        <input type="submit" value="Login" class="btn btn-primary">
        
    </form>
```

<img src="images/scrap_with_login_page.png" style="float:left" width="1000" align="right">

* Use `FormRequest` from scrapy.


```python

from tutorial.items import QuoteItem
from scrapy.loader import ItemLoader

class QuotesSpider(scrapy.Spider):

    name = 'quotes-login'

    start_urls = ['http://quotes.toscrape.com/login']

    def start_scrap(self,response):

        self.logger.info('========== Start scrapping =========== ')
               
        if response.status != 200:
            
            self.logger.error("Login failed!")
            
            return 
        
        quotes = response.css("div.quote")

   
        quote_item = QuoteItem()
        
    
        for quote in quotes:

            text = quote.css('.text::text').get()
            author = quote.css('.author::text').get()
            tags = quote.css(".tag::text").getall()

            loader = ItemLoader(item=QuoteItem(), selector=quote)
        
            loader.add_css('quote_content', '.text::text')
            loader.add_css('tags', '.tag::text')
            quote_item = loader.load_item()
        
           
            self.logger.info(f'text: {text}')
            self.logger.info(f'author: {author}')
            self.logger.info(f'tags: {tags}')

            self.logger.debug("-------------------------")

    def parse(self, response):
    
        # get the token value (the token expiration shoud define the end of section I guess)
        token = response.css('form input::attr(value)').get()
        
        self.logger.info(f"token: {token}")

        return FormRequest.from_response(response,formdata={
            'csrf_token': token, 
            'username': 'leandro@gmail.com',
            'password': 'dadisgood'
        }, callback=self.start_scrap)

```




## Dealing with cookies