# Scraping

refs:

* https://docs.scrapy.org/en/latest/topics/shell.html
* https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-tutorial-part-i-11e350bcdec0
    * `git clone https://github.com/harrywang/scrapy-tutorial-starter.git`
* https://towardsdatascience.com/scrapy-this-is-how-to-successfully-login-with-ease-ea980e2c5901
* https://github.com/leandroohf/scrapping-tutorial  <=== **MY REPO**
    * `git clone https://github.com/leandroohf/scrapping-tutorial`


In [2]:
import IPython
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Archictecture




<img src="images/scrapy_architecture.png" style="float:left" width="800" align="right">

## Scrap shell

* It is also a regular iPython shell
* Use for testing XPath or CSS expressions and what data they extracting (DEBUG)
* Web site example:
    * http://quotes.toscrape.com/
    * https://www.beerwulf.com/en-gb/c/mixedbeercases  <===


```shell
# scrapy shell <url>
scrapy shell http://quotes.toscrape.com/

# file examples
# UNIX-style
scrapy shell ./path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

```

How to use: `scrapy shell`:

Run inside the shell 

```python

# inside scrapy shell  <======
# inspecting settings object
seettings

# fetch the page
fetch('https://www.beerwulf.com/en-gb/c/mixedbeercases')

# check response object
response

response.status

from pprint import pprint
pprint(response.headers)

# > Out[2]: <200 https://www.beerwulf.com/en-gb/c/mixedbeercases>

# inspect html code 
response.body
response.body_as_unicode()

# extract title using css xpath 
response.css('title::text').get()  # get the first results
#> Out[12]: 'Mixed Beer Cases  | Discover our beers | Beerwulf'

# get all beers
response.css('h4::text').getall()  

# Out[13]:
#[' Search results',
# 'THE SUB  (2L)',
# 'BLADE  (8L)',
# 'Beer Tap Starter Packs',
# 'All Beer Taps',
# 'SUB Kegs',
# ...]

# inspect the object crawler
crawler.stats.get_stats()

```


## Project folder struture


* Create project


```shell
scrapy startproject tutorial # project-name 
```

Folder explained:

* scrapy.cfg: the project configuration file
* tutorial/: the project’s python module, you’ll later import your code from here.
* tutorial/items.py: the project’s items file.
* tutorial/pipelines.py: the project’s pipelines file.
* tutorial/settings.py: the project’s settings file.
* tutorial/spiders/: a directory where you’ll later put your spiders.


<img src="images/scrapy_project_folder_struture.png" style="float:left" width="300" align="right">

### Run crawler


```shell
# run spider
 scrapy crawl quotes

# save output in json 
scrapy crawl quotes -o quotes.json
```


### Access settings

```pyton

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        print(f"Existing settings: {self.settings.attributes.keys()}")

```


## Xpath and css selectors

* https://www.w3schools.com/xml/xpath_syntax.asp
* xpath chest sheet: https://devhints.io/xpath
* https://www.w3schools.com/cssref/css_selectors.asp


HTML example

```html
<?xml version="1.0" encoding="UTF-8"?>

<bookstore>

<book>
  <title lang="en">Harry Potter</title>
  <price>29.99</price>
</book>

<book>
  <title lang="en">Learning XML</title>
  <price>39.95</price>
</book>

</bookstore>
```

* Selecting nodes
    * bookstore: select all nodes called bookstore
    * /bookstore: start from root
    * bookstore/book: Selects all book elements that are children of bookstore
    * //book: Selects all book elements no matter where they are in the document
    * bookstore//book: select all book that is child of bookstore no matter where they are under bookstore umbrela

* Predicate (positioning)

    * /bookstore/book[1]: select first
    * /bookstore/book[last()]
    * /bookstore/book[position()<3]
    * /bookstore/*: select all child
    
* Select multiple paths

    * //book/title | //book/price: title AND proce
    * //title | //price: title OR price

* common tasks (xpath or css)

    * get href link
    * get text of node
    * get image 

In [3]:
# this is good for test scrapy code
from scrapy import Selector

text = """
<a href = "http://example.com"">More Info<strong>click here</strong></a>
<img src="http://example.com/img.jpg" class='photo-large' />
<a name='Sport' class='a-size'>Basketball </a>
<a name='Proffession' class='a-size'>Data Scientis </a>
"""

val = Selector(text = text)

# text
val.xpath('//a//text()').getall()

# link
val.xpath('//a/@href').get()

# image 
val.xpath('//img/@src').get()


# filter 
val.xpath("//a[@class='a-size']/text()").getall()
val.xpath('//a[has-class("a-size")]/text()').getall()

# filter multiple criteria
val.xpath("//a[@name='Sport' and @class='a-size']/text()").getall()

['More Info', 'click here', 'Basketball ', 'Data Scientis ']

'http://example.com'

'http://example.com/img.jpg'

['Basketball ', 'Data Scientis ']

['Basketball ', 'Data Scientis ']

['Basketball ']

## ItemLoader


https://docs.scrapy.org/en/latest/topics/items.html

This make easy to save to json or database. 

Scrapy supports the following types of items, via the itemadapter library: dictionaries, Item objects, dataclass objects, 

Steps:

1. define QuoteItem class in item.py file
1. define the preprocessing functions in item.py 
1. 


Ex: item.py
```python
# file: item.py
 Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.loader.processors import MapCompose, TakeFirst
from scrapy.item import Item, Field
from datetime import datetime

## =============== Preprocessing functions to treat fields =============== 
# remove the Unicode quotation
def remove_quotes(text):
    # strip the unicode quotes
    text = text.strip(u'\u201c'u'\u201d')
    return text

def convert_date(text):
    # convert string March 14, 1879 to Python date
    return datetime.strptime(text, '%B %d, %Y')


def parse_location(text):
    # parse location "in Ulm, Germany"
    # this simply remove "in ", you can further parse city, state, country, etc.
    return text[3:]

# Ex: scrapy.Item
class QuoteItem(scrapy.Item):
    # define the fields for your item here like:
 
    quote_content = Field(
        input_processor=MapCompose(remove_quotes),
        # TakeFirst return the first value not the whole list
        output_processor=TakeFirst()
        )

    author_name = Field(
        input_processor=MapCompose(str.strip),
        output_processor=TakeFirst()
        )
    author_birthday = Field(
        input_processor=MapCompose(convert_date),
        output_processor=TakeFirst()
    )
    author_bornlocation = Field(
        input_processor=MapCompose(parse_location),
        output_processor=TakeFirst()
    )
    author_bio = Field(
        input_processor=MapCompose(str.strip),
        output_processor=TakeFirst()
        )

    tags = Field()
    
#Ex: dataclass. Has some beneficial 
from typing import List
from dataclasses import dataclass, field
from dataclass_type_validator import dataclass_validate

@dataclass_validate()
@dataclass()
class ProductItem():
    public_name: str
    product_type: str
    sales_page: str
    tags: List[str] = field(default_factory=list)
````


To load data in the pipeline you need to:
    
```python
from tutorial.items import QuoteItem # Load the item definition you wrote in item.py
from scrapy.loader import ItemLoader # Load the ItemLoader 

## in the middle of the code


def parse_author(self, response):

    # yield {
    #     'author_name': response.css('.author-title::text').get(),
    #     'author_birthday': response.css('.author-born-date::text').get(),
    #     'author_bornlocation': response.css('.author-born-location::text').get(),
    #     'author_bio': response.css('.author-description::text').get(),
    # }

    quote_item = response.meta['quote_item']

    loader = ItemLoader(item=quote_item, response=response)
    loader.add_css('author_name', '.author-title::text')
    loader.add_css('author_birthday', '.author-born-date::text')
    loader.add_css('author_bornlocation', '.author-born-location::text')
    loader.add_css('author_bio', '.author-description::text')
    
    yield loader.load_item()

```
    
Data class example

https://stackoverflow.com/questions/67360406/is-there-any-example-about-how-to-use-dataclass-and-scrapy-items
```python
def parse(self, response):
    
    # code to extract the fields
    
     product = ProductItem(
            public_name=public_name,
            product_type=product_type,
            sales_page=sales_page,
            tags=tags
             )
        
    # It is important to export to dict so you can use in the pipeline with an adapter  
    yield { "product": product.__dict__ }
    
```

## Save item in database

Steps:
    
1. Edit settings.py 
    1. Define how to connect on db insettings.py
    1. Add itepipeline in settings.py
1. model.py
    1. ORM: create a data model (the tables on db)
    1. develop a pipeline to save the items to a database. 

Pipleines:
* Each item returned by the spider is sent to Item Pipelines (See architecture).
* You define pipelines that are enables in settings.py
* Create the new pipeline to save in db by editing the file: pipelines.py

**settings.py**
```python

## =========== db connections
# Ex: sqlite
CONNECTION_STRING = 'sqlite:///scrapy_quotes.db'

# Exmaple: MySQL
CONNECTION_STRING = "{drivername}://{user}:{passwd}@{host}:{port}/{db_name}?charset=utf8".format(
     drivername="mysql",
     user="harrywang",
     passwd="tutorial",
     host="localhost",
     port="3306",
     db_name="scrapy_quotes",
)


# ================ define new pipeline 
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   #'tutorial.pipelines.TutorialPipeline': PRIORITY,  LOW priority execute first
   'tutorial.pipelines.DuplicatesPipeline': 200,  # this run FIRST
   'tutorial.pipelines.SaveQuotesPipeline': 300,
}
```

**model.py**

```python
from sqlalchemy import create_engine, Column, Table, ForeignKey, MetaData
from sqlalchemy.orm import relationship
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import (
    Integer, String, Date, DateTime, Float, Boolean, Text)
from scrapy.utils.project import get_project_settings

Base = declarative_base()


def db_connect():
    """
    Performs database connection using database settings from settings.py.
    Returns sqlalchemy engine instance
    """
    return create_engine(get_project_settings().get("CONNECTION_STRING"))


def create_table(engine):
    Base.metadata.create_all(engine)


# Association Table for Many-to-Many relationship between Quote and Tag
# https://docs.sqlalchemy.org/en/13/orm/basic_relationships.html#many-to-many
quote_tag = Table('quote_tag', Base.metadata,
    Column('quote_id', Integer, ForeignKey('quote.id')),
    Column('tag_id', Integer, ForeignKey('tag.id'))
)


class Quote(Base):  # define table Quote
    __tablename__ = "quote"

    id = Column(Integer, primary_key=True)
    quote_content = Column('quote_content', Text())
    author_id = Column(Integer, ForeignKey('author.id'))  # Many quotes to one author
    tags = relationship('Tag', secondary='quote_tag',
        lazy='dynamic', backref="quote")  # M-to-M for quote and tag


class Author(Base): # define table Author
    __tablename__ = "author"

    id = Column(Integer, primary_key=True)
    name = Column('name', String(50), unique=True)
    birthday = Column('birthday', DateTime)
    bornlocation = Column('bornlocation', String(150))
    bio = Column('bio', Text())
    quotes = relationship('Quote', backref='author')  # One author to many Quotes


class Tag(Base): # define table Tag
    __tablename__ = "tag"

    id = Column(Integer, primary_key=True)
    name = Column('name', String(30), unique=True)
    quotes = relationship('Quote', secondary='quote_tag',
        lazy='dynamic', backref="tag")  # M-to-M for quote and tag
```

**pipelines.py**
```python
class SaveQuotesPipeline(object):
    def __init__(self):
        """
        Initializes database connection and sessionmaker
        Creates tables
        """
        engine = db_connect()
        create_table(engine)
        self.Session = sessionmaker(bind=engine)


    def process_item(self, item, spider):
        """Save quotes in the database
        This method is called for every item pipeline component
        """
        session = self.Session()
        quote = Quote()
        author = Author()
        tag = Tag()
        author.name = item["author_name"]
        author.birthday = item["author_birthday"]
        author.bornlocation = item["author_bornlocation"]
        author.bio = item["author_bio"]
        quote.quote_content = item["quote_content"]

        # check whether the author exists
        exist_author = session.query(Author).filter_by(name = author.name).first()
        if exist_author is not None:  # the current author exists
            quote.author = exist_author
        else:
            quote.author = author

        # check whether the current quote has tags or not
        if "tags" in item:
            for tag_name in item["tags"]:
                tag = Tag(name=tag_name)
                # check whether the current tag already exists in the database
                exist_tag = session.query(Tag).filter_by(name = tag.name).first()
                if exist_tag is not None:  # the current tag exists
                    tag = exist_tag
                quote.tags.append(tag)

        try:
            session.add(quote)
            session.commit()

        except:
            session.rollback()
            raise

        finally:
            session.close()

        return item
```


## Dealing with login and credentials

refs:

* https://quotes.toscrape.com/login
* https://www.youtube.com/watch?v=I_vAGDZeg5Q



Steps:
* Inspecting the page login

    1. Do one time the login and under the tab Network discovery the token variable name under **FormData**
    
        * Look the variable: **csrf_token: DLIyfMtmuZjQJHSWCdhlsKiBPozwVbvREOqxFeUnNrTYAXGakpgc**


<img src="images/inspecting_login_network_tab.png" style="float:left" width="1000" align="right">


* Inspecting the page login under tab Elements 
    * find the metadata for the token value

```html
<form action="/login" method="post" accept-charset="utf-8">
        <input type="hidden" name="csrf_token" value="DLIyfMtmuZjQJHSWCdhlsKiBPozwVbvREOqxFeUnNrTYAXGakpgc">
        <div class="row">
            <div class="form-group col-xs-3">
                <label for="username">Username</label>
                <input type="text" class="form-control" id="username" name="username">
            </div>
        </div>
        <div class="row">
            <div class="form-group col-xs-3">
                <label for="username">Password</label>
                <input type="password" class="form-control" id="password" name="password">
            </div>
        </div>
        <input type="submit" value="Login" class="btn btn-primary">
        
    </form>
```

<img src="images/scrap_with_login_page.png" style="float:left" width="1000" align="right">

* Use `FormRequest` from scrapy.


```python

from tutorial.items import QuoteItem
from scrapy.loader import ItemLoader

class QuotesSpider(scrapy.Spider):

    name = 'quotes-login'

    start_urls = ['http://quotes.toscrape.com/login']

    def start_scrap(self,response):

        self.logger.info('========== Start scrapping =========== ')
               
        if response.status != 200:
            
            self.logger.error("Login failed!")
            
            return 
        
        quotes = response.css("div.quote")

   
        quote_item = QuoteItem()
        
    
        for quote in quotes:

            text = quote.css('.text::text').get()
            author = quote.css('.author::text').get()
            tags = quote.css(".tag::text").getall()

            loader = ItemLoader(item=QuoteItem(), selector=quote)
        
            loader.add_css('quote_content', '.text::text')
            loader.add_css('tags', '.tag::text')
            quote_item = loader.load_item()
        
           
            self.logger.info(f'text: {text}')
            self.logger.info(f'author: {author}')
            self.logger.info(f'tags: {tags}')

            self.logger.debug("-------------------------")

    def parse(self, response):
    
        # get the token value (the token expiration shoud define the end of section I guess)
        token = response.css('form input::attr(value)').get()
        
        self.logger.info(f"token: {token}")

        return FormRequest.from_response(response,formdata={
            'csrf_token': token, 
            'username': 'leandro@gmail.com',
            'password': 'dadisgood'
        }, callback=self.start_scrap)

```




## How to ignore robots.txt for Scrapy spiders

Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites with the use of robots.txt file. The file resides on the root directory of a website and contain contain rules such as the followings;

Steps for ignoring:


1. set ignore robottxt rules in CLI

```sh

# when calling crawler
scrapy crawl --set=ROBOTSTXT_OBEY='False' quotes

# when start scrapy shell
scrapy shell  --set="ROBOTSTXT_OBEY=False"

```

1. ignore change conf file: edit settings.py


```yaml
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
```


## Passing user pamarameters to crawler


```sh
scrapy crawl imdb --set=ROBOTSTXT_OBEY='False' -a actor='juliana paes'
```


```python 
class ImdbSpider(scrapy.Spider):
    name = 'imdb'

    start_urls = ['https://secure.imdb.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.imdb.com%2Fregistration%2Fap-signin-handler%2Fimdb_pro_us&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=imdb_pro_us&openid.mode=checkid_setup&siteState=eyJvcGVuaWQuYXNzb2NfaGFuZGxlIjoiaW1kYl9wcm9fdXMiLCJyZWRpcmVjdFRvIjoiaHR0cHM6Ly9wcm8uaW1kYi5jb20vIn0&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0']

    def __init__(self, actor='', **kwargs):

        super().__init__(**kwargs)  # python3

        self.logger.info(f'init spider for actor: {actor}')
```


## Dealing with cookies

* Example without scrapy
    * can be useful for debug while working with scrapy

In [6]:
import requests

params = {'username': 'Ryan', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params)

print("Cookie is set to:")
print(r.cookies.get_dict())
print("-----------")
print("Going to profile page...")

r = requests.get("http://pythonscraping.com/pages/cookies/profile.php",
cookies=r.cookies)
print(r.text)  # <== add this to a selector and extract using scrapy


Cookie is set to:
{'loggedin': '1', 'username': 'Ryan'}
-----------
Going to profile page...
Hey Ryan! Looks like you're still logged into the site!


* Some website chnage cookies all the time. You can use the seesion objects to manage the cookies for you
    * can be useful for debug while working with scrapy

In [7]:
import requests

session = requests.Session()
params = {'username': 'username', 'password': 'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params)

print("Cookie is set to:")
print(s.cookies.get_dict())
print("-----------")
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)

Cookie is set to:
{'loggedin': '1', 'username': 'username'}
-----------
Going to profile page...
Hey username! Looks like you're still logged into the site!


2

* Using scrapy
    * If you need to to call Request, then it is your responsability to manage cookies


You need to make sure the user agent is recetly one. Old user agents do not support multiple cookies (modern way to do it)
    
```python

 def __init__(self, actor_page='', **kwargs):

        super().__init__(**kwargs)  # python3

        self._credentials = {
            'email': 'email'
            'pwd': '123'
        }

        self.header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'}  #Setting up browser user agent
        #self.header =  {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'}

```

* For page swith login it you have to mange the cookies by yourself when doing the requests


```python
  def start_requests(self):
        
        # https://docs.scrapy.org/en/latest/_modules/scrapy/http/request.html
        request = Request(
            url=self.start_urls[0],
            headers=self.header,
            meta={'cookiejar':1},       #Open Cookies record and pass Cookies to callback function
            callback=self.parse
        ) 
        
        cookies = request.cookies  # inspect cookies sent in the request
        
        return [request] 
```

Any request you must do it:


```python 
# example:
request = response.follow(login_url, self.process_login_page, 
                            meta = {'cookiejar' : response.meta['cookiejar']},
                            headers = self.header)

# another example
return FormRequest.from_response(response,
            formdata={
            'appActionToken': token, 
            'email': self._credentials['email'],
            'password': self._credentials['pwd']},
            meta={'cookiejar':response.meta['cookiejar']},
            dont_filter=True,
            headers=self.header
            , callback=self.after_login)
```


* get the receive cookies

```python
    cookie = response.headers.getlist('Set-Cookie')
    self.logger.info(f"cookies: {cookie}") 
```


* print sent cookies


```python
self.logger.info("======================  Cookie Jars") 
cookieJar = response.meta.setdefault('cookie_jar', CookieJar())
cookieJar.extract_cookies(response, response.request)
self.logger.info(f"cookie_jar: {cookieJar._cookies}")


# from the request

cl = request.headers.getlist('Cookie')
if cl:
            
    self.logger.info(f"LHOF=> Sending cookies to: {request}")

    for c in cl:

        self.logger.info(f"Cookies: {c}")

```


## Debbug and develop

* Use selector

In [30]:
val = Selector(text = """
            <div id="contacts" class="a-section">
                <div class="a-column a-span12 a-span-last"><span class="a-size-base a-text-bold"> Talent Agent</span></div>
                <span class="a-list-item"></span>
                <span class="a-list-item">+1 310 288 8000<span class="a-color-secondary"> phone</span>
                </span>
            </div>
          """)

divs = val.xpath("//div[@id='contacts']")
spans = divs.xpath("//span[@class='a-color-secondary']")

print("extract all text")
spans.xpath("//text()").getall()

print("extract current text")
spans.xpath("./text()").getall()

print("extract parent text")
spans.xpath("//span[@class='a-color-secondary']/../text()").getall()


extract all text


['\n                ',
 ' Talent Agent',
 '\n                ',
 '\n                ',
 '+1 310 288 8000',
 ' phone',
 '\n                ',
 '\n            ']

extract current text


[' phone']

extract parent text


['+1 310 288 8000', '\n                ']

* Open requested page on Browser

In [41]:
from scrapy.utils.response import response_status_message, open_in_browser


def parse_something(self, response):
    
    open_in_browser(response)
    self.logger.error(response.body)
    
    return