# Scraping

refs:

* https://docs.scrapy.org/en/latest/topics/shell.html
* https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-tutorial-part-i-11e350bcdec0
    * `git clone https://github.com/harrywang/scrapy-tutorial-starter.git`
* https://towardsdatascience.com/scrapy-this-is-how-to-successfully-login-with-ease-ea980e2c5901


In [16]:
import IPython
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Scrap shell

* It is also a regular iPython shell
* Use for testing XPath or CSS expressions and what data they extracting (DEBUG)
* Web site example:
    * http://quotes.toscrape.com/
    * https://www.beerwulf.com/en-gb/c/mixedbeercases  <===


```shell
# scrapy shell <url>
scrapy shell http://quotes.toscrape.com/

# file examples
# UNIX-style
scrapy shell ./path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

```

How to use: 

```shell
scrapy shell

```

Run inside the shell 

```python

# inside scrapy shell  <======
# inspecting settings object
seettings

# fetch the page
fetch('https://www.beerwulf.com/en-gb/c/mixedbeercases')


# check response object
response

response.status

from pprint import pprint
pprint(response.headers)

# > Out[2]: <200 https://www.beerwulf.com/en-gb/c/mixedbeercases>

# inspect html code 
response.body
response.body_as_unicode()

# extract title using css xpath 
response.css('title::text').get()  # get the first results
#> Out[12]: 'Mixed Beer Cases  | Discover our beers | Beerwulf'

# get all beers
response.css('h4::text').getall()  

# Out[13]:
#[' Search results',
# 'THE SUB  (2L)',
# 'BLADE  (8L)',
# 'Beer Tap Starter Packs',
# 'All Beer Taps',
# 'SUB Kegs',
# ...]

# inspect the object crawler
crawler.stats.get_stats()

```


## Project folder struture


* Create project


```shell
scrapy startproject tutorial # project-name 
```

Folder explained:

* scrapy.cfg: the project configuration file
* tutorial/: the project’s python module, you’ll later import your code from here.
* tutorial/items.py: the project’s items file.
* tutorial/pipelines.py: the project’s pipelines file.
* tutorial/settings.py: the project’s settings file.
* tutorial/spiders/: a directory where you’ll later put your spiders.


<img src="images/scrapy_project_folder_struture.png" style="float:left" width="300" align="right">

* Run scrawler


```shell
# run the crwaler called quotes
scrapy crawl quotes
```



## Xpath and css selectors

* https://www.w3schools.com/xml/xpath_syntax.asp
* xpath chest sheet: https://devhints.io/xpath
* https://www.w3schools.com/cssref/css_selectors.asp


HTML example

```html
<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book>
  <title lang="en">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="en">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>
```

* Selecting nodes
    * bookstore: select all nodes called bookstore
    * /bookstore: start from root
    * bookstore/book: Selects all book elements that are children of bookstore
    * //book: Selects all book elements no matter where they are in the document
    * bookstore//book: select all book that is child of bookstore no matter where they are under bookstore umbrela

* Predicate (positioning)

    * /bookstore/book[1]: select first
    * /bookstore/book[last()]
    * /bookstore/book[position()<3]
    * /bookstore/*: select all child
    
* Select multiple paths

    * //book/title | //book/price: title AND proce
    * //title | //price: title OR price

* common tasks (xpath or css)

    * get href link
    * get text of node
    * get image 

In [66]:
# this is good for test scrapy code
from scrapy import Selector

text = """
<a href = "http://example.com"">More Info<strong>click here</strong></a>
<img src="http://example.com/img.jpg" class='photo-large' />
"""

val = Selector(text = text)

val.xpath('//a//text()').getall()
val.xpath('//a/@href').get()
val.xpath('//img/@src').get()

['More Info', 'click here']

'http://example.com'

'http://example.com/img.jpg'

In [43]:
text = """
<div class="a-row a-spacing-base a-grid-vertical-align a-grid-bottom"><div class="a-column a-span6"><h2 class="a-size-large a-text-normal"> Contacts
</h2></div>
<div class="a-column a-span6 a-text-right a-span-last">
<a class="a-size- a-align- a-link- edit open_contribute_modal" href="https://contribute.imdb.com/contribute/name/nm0655945/name_representation?bus=pro&amp;return_url=https%3A%2F%2Fpro.imdb.com%2Freload_page&amp;site=www">Edit <span class="a-size-small a-color- glyphicons glyphicons-icon glyphicons-pencil"></span></a></div> </div>
<div id="contacts" class="a-section">
<div class="a-section a-spacing-medium header_section"><div class="a-row header contacts_header"><div class="a-column a-span12 a-span-last"><span class="a-size-base a-text-bold"> Talent Agent
</span></div></div>
<div class="a-section a-spacing-top-mini"> <ul class="a-unordered-list a-nostyle a-vertical">
<li><span class="a-list-item">
</span></li>
<li><span class="a-list-item">
<div class="a-row a-spacing-mini a-spacing-top-none"><div class="a-column a-span12"><span class="a-size-base-plus"><div class="a-fixed-right-grid"><div class="a-fixed-right-grid-inner" style="padding-right:0px"><div class="a-fixed-right-grid-col a-col-left" style="padding-right:0%;float:left;"><span class="aok-align-center"><span><a class="a-size- a-align- a-link-" href="https://pro.imdb.com/company/co0024357/?ref_=nm_cc_nm_1">Paradigm Talent Agency</a></span></span></div><div class="a-text-center a-fixed-right-grid-col a-col-right" style="width:0px;margin-right:-0px;float:left;"></div></div></div></span></div></div>
</span></li>
<li><span class="a-list-item">
<div class="a-section"><a class="a-size- a-align- a-link- clickable_share_link" target="_blank" rel="noopener" href="http://www.paradigmagency.com"> paradigmagency.com
</a></div>
</span></li>
<li><span class="a-list-item">
+1 310 288 8000
<span class="a-color-secondary"> phone
</span>
</span></li>
<li><span class="a-list-item">
+1 310 288 2000
<span class="a-color-secondary"> fax
</span>
</span></li>
<li><span class="a-list-item">
<div class="a-section a-spacing-top-mini"><div class="a-fixed-left-grid"><div class="a-fixed-left-grid-inner" style="padding-left:16px"><div class="a-fixed-left-grid-col a-col-left" style="width:16px;margin-left:-16px;float:left;"><span class="a-size- a-color-price glyphicons glyphicons-icon glyphicons-map-marker"></span></div><div class="a-fixed-left-grid-col a-col-right" style="padding-left:0%;float:left;"><span class="a-color-secondary"> 8942 Wilshire Boulevard
</span></div></div></div></div>
</span></li>
<li><span class="a-list-item">
<div class="a-section a-spacing-top-none"><div class="a-fixed-left-grid"><div class="a-fixed-left-grid-inner" style="padding-left:16px"><div class="a-fixed-left-grid-col a-col-left" style="width:16px;margin-left:-16px;float:left;"></div><div class="a-fixed-left-grid-col a-col-right" style="padding-left:0%;float:left;"><span class="a-color-secondary"> Beverly Hills,
CA
90211
</span></div></div></div></div>
</span></li>
<li><span class="a-list-item">
<div class="a-section a-spacing-top-none"><div class="a-fixed-left-grid"><div class="a-fixed-left-grid-inner" style="padding-left:16px"><div class="a-fixed-left-grid-col a-col-left" style="width:16px;margin-left:-16px;float:left;"></div><div class="a-fixed-left-grid-col a-col-right" style="padding-left:0%;float:left;"><span class="a-color-secondary"> USA
</span></div></div></div></div>
</span></li>
<li><span class="a-list-item">
<div class="a-section a-spacing-top-none"><div class="a-fixed-left-grid"><div class="a-fixed-left-grid-inner" style="padding-left:16px"><div class="a-fixed-left-grid-col a-col-left" style="width:16px;margin-left:-16px;float:left;"></div><div class="a-fixed-left-grid-col a-col-right" style="padding-left:0%;float:left;"><span class="a-color-secondary"><a class="a-size- a-align- a-link- clickable_share_link" target="_blank" rel="noopener" href="http://bing.com/maps/default.aspx?v=2&amp;where1=8942+Wilshire+Boulevard+Beverly+Hills+CA+90211+USA"> See map
</a> (bing.com)
</span></div></div></div></div>
</span></li>
</ul>
</div></div>
</div>
"""

In [53]:
def clean_contact_info(contact_info: str):
    
     return contact_info.strip().strip('\n')
    

response = Selector(text=text) 

divs = response.xpath("//div[@id='contacts']")

print(f"divs len: {len(divs)}")


spans = divs.xpath("//span[@class='a-color-secondary']")

for span in spans:
    
    #print(span.xpath("./text()").getall())
    txt = span.xpath("./text()").get()
    
    if 'phone' in txt:
        
        # get parent
        phone = spans.xpath("//span[@class='a-color-secondary']/../text()").get()
        phone = clean_contact_info(phone)
        
        print(f"phone: {phone}")
        
    if 'fax' in txt:
        
        # get parent
        fax = spans.xpath("//span[@class='a-color-secondary']/../text()").get()
        fax = clean_contact_info(fax)
        
        print(f"fax: {fax}")
        


divs len: 1
phone: +1 310 288 8000
fax: +1 310 288 8000


In [57]:
text = """
<div id="const_page_summary_section" class="a-fixed-right-grid-col a-col-left" style="padding-right:3.7%;float:left;">
<div class="a-row a-spacing-medium known_for"> Known for
<a class="a-size- a-align- a-link- ttip" href="https://pro.imdb.com/title/tt1360961/?ref_=nm_nav_ov_knownfor_1"><span>Caminho das Índias<span class="a-color-secondary"> (2009)</span></span></a>,
<a class="a-size- a-align- a-link- ttip" href="https://pro.imdb.com/title/tt6491190/?ref_=nm_nav_ov_knownfor_2"><span>Edge of Desire<span class="a-color-secondary"> (2017)</span></span></a>,
<a class="a-size- a-align- a-link- ttip" href="https://pro.imdb.com/title/tt0913160/?ref_=nm_nav_ov_knownfor_3"><span>Pé na Jaca<span class="a-color-secondary"> (2006-2007)</span></span></a>,
<a class="a-size- a-align- a-link- ttip" href="https://pro.imdb.com/title/tt3463250/?ref_=nm_nav_ov_knownfor_4"><span>Farewell<span class="a-color-secondary"> (2014)</span></span></a></div>
<div class="a-fixed-left-grid a-spacing-base"><div class="a-fixed-left-grid-inner" style="padding-left:140px"><div class="a-fixed-left-grid-col a-col-left" style="width:140px;margin-left:-140px;float:left;"><span class="a-color-secondary"> Details
</span></div><div class="a-fixed-left-grid-col a-col-right" style="padding-left:0%;float:left;"><span>
Mar
26,
1979<span class="a-color-secondary"> (age 42)
</span>
|
5' 7"
<span class="a-color-secondary"> (1.70m)
</span>
</span></div></div></div>
<div class="a-fixed-left-grid a-spacing-base"><div class="a-fixed-left-grid-inner" style="padding-left:140px"><div class="a-fixed-left-grid-col a-col-left" style="width:140px;margin-left:-140px;float:left;"><span class="a-color-secondary"> External links
</span></div><div class="a-fixed-left-grid-col a-col-right" style="padding-left:0%;float:left;"><span class="a-color-secondary">
<span class="a-declarative" data-action="open_standalone_page" data-open_standalone_page="{&quot;url&quot;: &quot;https://pro.imdb.com/name/nm0655945/websites?ref_=nm_med_sites&quot;}"><a class="a-size- a-align- a-link-" href="https://pro.imdb.com/name/nm0655945/websites?ref_=nm_med_sites"> 3 official web
sites &amp; 9 more
links</a></span></span></div></div></div>
<span class="a-text-bold"> Images
</span>
<span class="a-declarative" data-action="auto_scroll" data-auto_scroll="{&quot;tab&quot;: &quot;images&quot;, &quot;sort&quot;: &quot;DEFAULT&quot;, &quot;filter&quot;: &quot;DEFAULT&quot;, &quot;refmarker&quot;: &quot;nm_poster_photos&quot;}"><a class="a-size- a-align- a-link-" href="https://pro.imdb.com/name/nm0655945/images?ref_=nm_poster_photos"> (18)
</a></span>
<div id="featured_photo_set" class="a-section a-spacing-top-micro"><a data-source-const="nm0655945" class="a-size- a-align- a-link- photoSlideshow featured_image aok-inline-block" href="https://pro.imdb.com/name/nm0655945/#rmConst=rm1357798144"><img alt="Still of Juliana Paes and Elizângela in Edge of Desire and Episode #1.172" src="https://m.media-amazon.com/images/M/MV5BNjM5MjczMGEtMTBlNi00YzcxLWFiOGYtODcwMmNjMDM4MDc2XkEyXkFqcGdeQXVyNjI0NDkwNjE@._V1_UY160_CR45,0,160,160_.jpg" height="80" width="80" title="Still of Juliana Paes and Elizângela in Edge of Desire and Episode #1.172"></a><a data-source-const="nm0655945" class="a-size- a-align- a-link- photoSlideshow featured_image aok-inline-block" href="https://pro.imdb.com/name/nm0655945/#rmConst=rm1408129792"><img alt="Still of Juliana Paes in Edge of Desire and Episode #1.172" src="https://m.media-amazon.com/images/M/MV5BOTgwYTU0NGMtMTYwYi00MDgzLTk0ZWItZWYxNDI2ZWQ5OWJiXkEyXkFqcGdeQXVyNjI0NDkwNjE@._V1_UY160_CR45,0,160,160_.jpg" height="80" width="80" title="Still of Juliana Paes in Edge of Desire and Episode #1.172"></a><a data-source-const="nm0655945" class="a-size- a-align- a-link- photoSlideshow featured_image aok-inline-block" href="https://pro.imdb.com/name/nm0655945/#rmConst=rm1827560192"><img alt="Still of Juliana Paes in Edge of Desire and Episode #1.172" src="https://m.media-amazon.com/images/M/MV5BNjE5ZGQ5YjAtZTk0OC00MzU3LThjZTEtOWZlYzAxZTZhZTA3XkEyXkFqcGdeQXVyNjI0NDkwNjE@._V1_UY160_CR62,0,160,160_.jpg" height="80" width="80" title="Still of Juliana Paes in Edge of Desire and Episode #1.172"></a><a data-source-const="nm0655945" class="a-size- a-align- a-link- photoSlideshow featured_image aok-inline-block" href="https://pro.imdb.com/name/nm0655945/#rmConst=rm278384128"><img alt="Still of Juliana Paes and Nelson Xavier in Farewell" src="https://m.media-amazon.com/images/M/MV5BMjExMjA0NDU0MV5BMl5BanBnXkFtZTgwNzMyNDY3MTE@._V1_UY160_CR62,0,160,160_.jpg" height="80" width="80" title="Still of Juliana Paes and Nelson Xavier in Farewell"></a><a data-source-const="nm0655945" class="a-size- a-align- a-link- photoSlideshow featured_image aok-inline-block" href="https://pro.imdb.com/name/nm0655945/#rmConst=rm311938560"><img alt="Still of Juliana Paes and Nelson Xavier in Farewell" src="https://m.media-amazon.com/images/M/MV5BMjI0OTE2Njc3OV5BMl5BanBnXkFtZTgwNTMyNDY3MTE@._V1_UY160_CR62,0,160,160_.jpg" height="80" width="80" title="Still of Juliana Paes and Nelson Xavier in Farewell"></a><a data-source-const="nm0655945" class="a-size- a-align- a-link- photoSlideshow featured_image aok-inline-block" href="https://pro.imdb.com/name/nm0655945/#rmConst=rm295161344"><img alt="Still of Juliana Paes and Nelson Xavier in Farewell" src="https://m.media-amazon.com/images/M/MV5BMTk3OTgwMTA0NV5BMl5BanBnXkFtZTgwNjMyNDY3MTE@._V1_UY160_CR62,0,160,160_.jpg" height="80" width="80" title="Still of Juliana Paes and Nelson Xavier in Farewell"></a></div>
<div class="a-section a-spacing-top-medium">
Find more <span class="a-text-bold">actresses</span> with
<a class="a-size- a-align- a-link-" href="https://pro.imdb.com/discover?redirectUrl=%2Fdiscover%2Fpeople%3Fprofession%3Dactress&amp;ref_=nm_dsc_pe_ingress"><span class="a-declarative" data-action="log_event" data-log_event="{&quot;pageAction&quot;: &quot;nm-dsc-pe-ingress&quot;}">Discover People</span></a>.</div>
</div>
"""

response = Selector(text=text) 

divs = response.xpath("//div[@class='a-row a-spacing-medium known_for']")

print(f"divs len: {len(divs)}")


spans = divs.xpath("//a/span")

print(f"spans len: {len(spans)}")
for span in spans:
    
    #print(span.xpath("./text()").getall())
    txts = span.xpath("./text()").getall()
    
    for txt in txts:
        
        film_title = txt.xpath(".text()").get()
        film_link =  txt.xpath("../a/text()").get()
        
    

divs len: 1
spans len: 5
['Caminho das Índias']
['Edge of Desire']
['Pé na Jaca']
['Farewell']
['Discover People']


## Dealing with login and credentials

refs:

* https://quotes.toscrape.com/login
* https://www.youtube.com/watch?v=I_vAGDZeg5Q



Steps:
* Inspecting the page login

    1. Do one time the login and under the tab Network discovery the token variable name under **FormData**
    
        * Look the variable: **csrf_token: DLIyfMtmuZjQJHSWCdhlsKiBPozwVbvREOqxFeUnNrTYAXGakpgc**


<img src="images/inspecting_login_network_tab.png" style="float:left" width="1000" align="right">


* Inspecting the page login under tab Elements 
    * find the metadata for the token value

```html
<form action="/login" method="post" accept-charset="utf-8">
        <input type="hidden" name="csrf_token" value="DLIyfMtmuZjQJHSWCdhlsKiBPozwVbvREOqxFeUnNrTYAXGakpgc">
        <div class="row">
            <div class="form-group col-xs-3">
                <label for="username">Username</label>
                <input type="text" class="form-control" id="username" name="username">
            </div>
        </div>
        <div class="row">
            <div class="form-group col-xs-3">
                <label for="username">Password</label>
                <input type="password" class="form-control" id="password" name="password">
            </div>
        </div>
        <input type="submit" value="Login" class="btn btn-primary">
        
    </form>
```

<img src="images/scrap_with_login_page.png" style="float:left" width="1000" align="right">

* Use `FormRequest` from scrapy.


```python

from tutorial.items import QuoteItem
from scrapy.loader import ItemLoader

class QuotesSpider(scrapy.Spider):

    name = 'quotes-login'

    start_urls = ['http://quotes.toscrape.com/login']

    def start_scrap(self,response):

        self.logger.info('========== Start scrapping =========== ')
               
        if response.status != 200:
            
            self.logger.error("Login failed!")
            
            return 
        
        quotes = response.css("div.quote")

   
        quote_item = QuoteItem()
        
    
        for quote in quotes:

            text = quote.css('.text::text').get()
            author = quote.css('.author::text').get()
            tags = quote.css(".tag::text").getall()

            loader = ItemLoader(item=QuoteItem(), selector=quote)
        
            loader.add_css('quote_content', '.text::text')
            loader.add_css('tags', '.tag::text')
            quote_item = loader.load_item()
        
           
            self.logger.info(f'text: {text}')
            self.logger.info(f'author: {author}')
            self.logger.info(f'tags: {tags}')

            self.logger.debug("-------------------------")

    def parse(self, response):
    
        # get the token value (the token expiration shoud define the end of section I guess)
        token = response.css('form input::attr(value)').get()
        
        self.logger.info(f"token: {token}")

        return FormRequest.from_response(response,formdata={
            'csrf_token': token, 
            'username': 'leandro@gmail.com',
            'password': 'dadisgood'
        }, callback=self.start_scrap)

```




## How to ignore robots.txt for Scrapy spiders

Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites with the use of robots.txt file. The file resides on the root directory of a website and contain contain rules such as the followings;

Steps for ignoring:


1. set ignore robottxt rules in CLI

```sh

# when calling crawler
scrapy crawl --set=ROBOTSTXT_OBEY='False' quotes

# when start scrapy shell
scrapy shell  --set="ROBOTSTXT_OBEY=False"

```

1. ignore change conf file: edit settings.py


```yaml
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
```


## Passing user pamarameters to crawler


```sh
scrapy crawl imdb --set=ROBOTSTXT_OBEY='False' -a actor='juliana paes'
```


```python 
class ImdbSpider(scrapy.Spider):
    name = 'imdb'

    start_urls = ['https://secure.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.imdb.com%2Fregistration%2Fap-signin-handler%2Fimdb_pro_us&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=imdb_pro_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl9wcm9fdXMiLCJyZWRpcmVjdFRvIjoiaHR0cHM6Ly9wcm8uaW1kYi5jb20vIn0&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0']

    def __init__(self, actor='', **kwargs):

        super().__init__(**kwargs)  # python3

        self.logger.info(f'init spider for actor: {actor}')
```


## Dealing with cookies

In [42]:
def after_login(self, response):

    if response.status != 200:
        
        self.logger.error("Login failed!")
        
        return 

    cookie = response.headers.getlist('Set-Cookie')
    self.logger.info(f"cookies: {cookie}")

## Debbug and develop

* Use selector

In [30]:
val = Selector(text = """
            <div id="contacts" class="a-section">
                <div class="a-column a-span12 a-span-last"><span class="a-size-base a-text-bold"> Talent Agent</span></div>
                <span class="a-list-item"></span>
                <span class="a-list-item">+1 310 288 8000<span class="a-color-secondary"> phone</span>
                </span>
            </div>
          """)

divs = val.xpath("//div[@id='contacts']")
spans = divs.xpath("//span[@class='a-color-secondary']")

print("extract all text")
spans.xpath("//text()").getall()

print("extract current text")
spans.xpath("./text()").getall()

print("extract parent text")
spans.xpath("//span[@class='a-color-secondary']/../text()").getall()


extract all text


['\n                ',
 ' Talent Agent',
 '\n                ',
 '\n                ',
 '+1 310 288 8000',
 ' phone',
 '\n                ',
 '\n            ']

extract current text


[' phone']

extract parent text


['+1 310 288 8000', '\n                ']

* Open requested page on Browser

In [41]:
from scrapy.utils.response import response_status_message, open_in_browser


def parse_something(self, response):
    
    open_in_browser(response)
    self.logger.error(response.body)
    
    return