<div style="float:right; padding-top: 15px; padding-right: 15px">
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="250">
        </a>
    </div>
</div>

# Advanced Web Scraping

In [1]:
import requests

## 1. Status Codes

https://http.cat/

In [2]:
r1 = requests.get('https://datamarket.es')

r1.status_code

200

In [7]:
r1.text[:1000]

'\n\n<!DOCTYPE html>\n\n<html lang="es">\n\n<head>\n\n    <title>Data Market</title>\n    <link rel="canonical" href="https://datamarket.es">\n    <link rel="shortcut icon" type="image/png" href="/static/core/favicon.png"/>\n\n    <!-- Global site tag (gtag.js) - Google Analytics -->\n<script async src="https://www.googletagmanager.com/gtag/js?id=G-BT4SLFDHKK"></script>\n<script>\n  window.dataLayer = window.dataLayer || [];\n  function gtag(){dataLayer.push(arguments);}\n  gtag(\'js\', new Date());\n\n  gtag(\'config\', \'G-BT4SLFDHKK\');\n</script>\n\n    <meta charset="UTF-8">\n<meta name="description" content="Toma decisiones basadas en Hechos, no en conjeturas">\n<meta name="keywords" content="data, market, data market, datos, mercado de datos, web scraping">\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n\n\n\n<meta property="og:title" content="Data Market"/>\n<meta property="og:description" content="Toma decisiones basadas en Hechos, no en conjeturas"/>

In [8]:
r2 = requests.get('https://datamarket.es/i_dont_exist')

r2.status_code

404

In [9]:
r3 = requests.get('http://datamarket.es')

r3.status_code

200

In [10]:
r3.history

[<Response [301]>]

In [11]:
r3.history[0].status_code

301

In [12]:
r3.history[0].text

'<html>\r\n<head><title>301 Moved Permanently</title></head>\r\n<body bgcolor="white">\r\n<center><h1>301 Moved Permanently</h1></center>\r\n<hr><center>nginx/1.14.0 (Ubuntu)</center>\r\n</body>\r\n</html>\r\n'

In [13]:
r4 = requests.get('https://datamarket.es')

r4.history

[]

It's a good idea to check the status code before you parse the response text

In [14]:
url = 'http://datamarket.es'

r = requests.get(url)

if r.status_code < 300:
    print('request was successful')
elif r.status_code >= 300 and r.status_code < 400:
    print('request was redirected')
elif r.status_code >= 400 and r.status_code < 500:
    print('request failed because the resource either does not exist or is forbidden')
elif r.status_code >= 500 and r.status_code < 600:
    print('request failed because the response server encountered an error')
else:
    print('we have found a new http protocol')

request was successful


It should be noted that the requests lib automatically makes additional requests to the redirected URL if the web resource is moved (i.e. 30x status codes). Even if a moved web resource redirects once again, the requests lib will track it all the way down until it receives success or failure as long as the number of redirects does not exceed the redirect limit (default 30). You can choose to disallow redirects by using requests.get(url, allow_redirects=False) so that requests will not track down the redirected URL. Or you can choose to reduce the max redirects allowed by using max_redirects=n so as to avoid endless redirects or save time in making requests.

In [15]:
r = requests.get('http://datamarket.es', allow_redirects=False)

r.status_code

301

## 2. Handling Request Errors

### 2.1 Exceeding Max Redirects

As previously mentioned, there is a max redirects number by default (30) which you can override with max_redirects. If the number of redirects exceeds the limit, requests will throw a __TooManyRedirects exception__.

In [18]:
session = requests.Session()
session.max_redirects = 0

try:
    session.get('http://datamarket.es').status_code
    print('successful request!')
    
except requests.exceptions.TooManyRedirects as ex:
    print('handled exception!')
    print(ex)


handled exception!
Exceeded 0 redirects.


### 2.2 Timeout

Sometimes a remote server is not responsive either because requests cannot connect to the intended web resource or because the remote server does not send back the promised data. If that happens, requests will typically wait for a long period of time until the connection is closed by the remote server then throw a __Timeout exception__. This is a big waste of time because most modern websites respond to web requests within a couple of seconds. Therefore, it's a desirable approach to supply a timeout argument to requests to limit the amount of wait time.

In [23]:
try:
    requests.get('https://datamarket.es', timeout=0.05)
    print('successful request!')
    
except requests.exceptions.Timeout as ex:
    print('handled timeout exception!')
    print(ex)

except requests.exceptions.ConnectionError as ex:
    print('handled connection exception!')
    print(ex)

finally:
    print('finally executed')

handled timeout exception!
HTTPSConnectionPool(host='datamarket.es', port=443): Read timed out. (read timeout=0.05)
finally executed


### 2.3 SSL Certificate Error

If a website wants to use SSL/TLS, it has to purchase (or not -> Let's Encrypt) a special certificate from certificate vendors and configure the web server properly in order for the certificate to be functional. If the SSL/TLS certificate is not installed properly or it has expired (purchased certificate has to be renewed every two years), modern browsers such as Chrome will indicate the problem to the users. The requests lib, similarly, will throw an exception if it detects the SSL certificate is problematic.

In [24]:
try:
    requests.get('https://localhost:8000/')
    print('successful request!')
    
except requests.exceptions.SSLError as ex:
    print('handled exception!')
    print(ex)

handled exception!
HTTPSConnectionPool(host='localhost', port=8000): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1091)')))


## 3. User Agent

Identifies the browser and operating system from which the request is made. Some websites send out different responses to the requests made with different user agents for reasons such as:

- They want to avoid bugs of the website that only occur in certain browsers or browser versions.
- They want to personalize user experience by sending different data.
- They want to track your requests to avoid web scraping.

In case you decide to fool the website by pretending to be a certain browser, use the approach below:

In [25]:
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}

response = requests.get('https://datamarket.es', headers=headers)

response.text[:500]

b'\n\n<!DOCTYPE html>\n\n<html lang="es">\n\n<head>\n\n    <title>Data Market</title>\n    <link rel="canonical" href="https://datamarket.es">\n    <link rel="shortcut icon" type="image/png" href="/static/core/favicon.png"/>\n\n    <!-- Global site tag (gtag.js) - Google Analytics -->\n<script async src="https://www.googletagmanager.com/gtag/js?id=G-BT4SLFDHKK"></script>\n<script>\n  window.dataLayer = window.dataLayer || [];\n  function gtag(){dataLayer.push(arguments);}\n  gtag(\'js\', new Date());\n\n  gtag(\'config'

With the user agent string above, the request pretends to be from Chrome browser v71.0.3578.98 in macOS 10.14.2 (Mojave). To see what the user agent string look like in other browsers/OS, check out https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

## 4. robots.txt

There are many reasons why some websites don't welcome bots:

- When bots crawls a website, it takes up some CPU, memory, and bandwidth resources of the server that should have been dedicated to the normal users. 
- Sometimes the website admin does not want to expose certain semi-confidential web resources to the search engines. 
- Sometimes the website admin wants to promote the most important web resources rather than letting the search engines see many irrelevant or outdated resources resting on the server. 

To achieve those different purposes, the common practice is to create a robots.txt file in the root level of the website to instruct the bots what to crawl and what not.

- In case of "not malicious" crawlers: Google?

The robots.txt file is there to tell crawlers and robots which URLs they should not visit on your website. This is important to help them avoid crawling low quality pages, or getting stuck in crawl traps where an infinite number of URLs could potentially be created, for example, a calendar section which creates a new URL for every day.

- In case of "malicious" crawlers:

They will give a shit about your robots.txt XD

For more info about robots.txt please have a look at: https://support.google.com/webmasters/answer/6062608?hl=en&ref_topic=6061961&visit_id=637139316066285039-3854058931&rd=1

<div style="padding-top: 25px; float: right">
    <div>    
        <i>&nbsp;&nbsp;© Copyright by</i>
    </div>
    <div>
        <a href="https://whiteboxml.com">
            <img src="https://whiteboxml.com/static/img/logo/black_bg_white.svg" width="125">
        </a>
    </div>
</div>