<a href="https://colab.research.google.com/github/lblogan14/web_scraping_with_python/blob/master/ch14_avoid_scraping_traps.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Looking Like a Human

##Adjust Your Headers
The *Requests* library is excellent for setting headers. HTTP headers are lists of attributes, or preferences, sent by users every time the users make a request to a web server. \\ 
The following seven fields are used by most major browsers when initiating any connection, for example:
* `Host`: `https://www.google.com/`
* `Connection`: `keep-alive`
* `Accept`:  `text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8`
* `User-Agent`: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36`
* `Referrer`: `https://www.google.com/`
* `Accept-Encoding`: `gzip, deflate, sdch`
* `Accept-Language`: `en-US,en;q=0.8`

The headers that a typical Python scraper using the default `urllib` library:
* `Accept-Encoding`: `identity`
* `User-Agent`: `Python-urllib/3.4`

Headers can be completely customized using the *Requests* library. Here the example website is *https://www.whatismybrowser.com*

In [0]:
import requests
from bs4 import BeautifulSoup

In [0]:
session = requests.Session()
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko) Chrome',
           'Accept':'text/html,application/xhtml+xml,application/xml; q=0.9,image/webp,*/*;q=0.8'}
url = 'https://www.whatismybrowser.com/developers/what-http-headers-is-my-browser-sending'
req = session.get(url, headers=headers)

In [3]:
bs = BeautifulSoup(req.text, 'html.parser')
print(bs.find('table', {'class': 'table-striped'}).get_text)

<bound method Tag.get_text of <table class="table table-striped">
<tr>
<th>ACCEPT</th>
<td>text/html,application/xhtml+xml,application/xml; q=0.9,image/webp,*/*;q=0.8</td>
</tr>
<tr>
<th>ACCEPT_ENCODING</th>
<td>gzip, deflate</td>
</tr>
<tr>
<th>CONNECTION</th>
<td>keep-alive</td>
</tr>
<tr>
<th>HOST</th>
<td>www.whatismybrowser.com</td>
</tr>
<tr>
<th>USER_AGENT</th>
<td>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko) Chrome</td>
</tr>
</table>>


The output should show that the headers are now the same ones set in the headers
dictionary object in the code.

##Handling Cookies with JavaScript
Handling cookies correctly can alleviate many scraping problems, although cookies can also be a double-edged sword. Websites that track your progression through a site using cookies might attempt to cut off scrapers that display abnormal behavior, such as completing forms too quickly, or visiting too many pages. Although these behaviors can be disguised by closing and reopening connections to the site, or even changing the IP address, if the cookie gives the identity away, the user's effort of disguise might be futile. 
As mentioned in Chapter 10, cookies can be necessary to scrape a site to stay looged in on a site to be able to hold and present a cookie from page to page.

If scraping a single targeted website or a small number of targeted sites, make sure to examine the cookies generated by those sites and consider which ones the scraper can handle. Some browser plug-ins can show this. For example, *EditThisCookie* is a Chrome extension that can do this.

To handle cookies using the *Requests* library, go back to Chapter 10. Since *Requests* library cannot execute JavaScript, it cannot handle cookies produced by modern tracking software, such as Google Analytics. In this case, use the Selenium and PhantomJS packages.

To view cookies on a site:

In [0]:
from selenium import webdriver
driver = webdriver.PhantomJS(executable_path='<Path to Phantom JS>')
driver.get('http://pythonscraping.com')
driver.implicitly_wait(1)
print(driver.get_cookies())

To run this on Chrome:

In [0]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path='drivers/chromedriver', 
                          chrome_options=chrome_options)
driver.get('http://pythonscraping.com')
driver.implicitly_wait(1)
print(driver.get_cookies())

TO manipulate cookies, use `delete_cookie()`, `add_cookie()`, and `delete_all_cookies()` functions:

In [0]:
from selenium import webdriver

In [0]:
phantomPath = '<Path to Phantom JS>'

driver = webdriver.PhantomJS(executable_path=phantomPath)
driver.get('http://pythonscraping.com')
driver.implicitly_wait(1)

savedCookies = driver.get_cookies()
print(savedCookies)

driver2 = webdriver.PhantomJS(executable_path=phantomPath)
driver2.get('http://pythonscraping.com')
driver2.delete_all_cookies()
for cookie in savedCookies:
  if not cookie['domain'].startswith('.'):
    cookie['domain'] = '.{}'.format(cookie['domain'])
  driver2.add_cookie(cookie)
  
driver2.get('http://pythonscraping.com')
driver.implicitly_wait(1)
print(driver2.get_cookies())

The first webdriver retrieves a website, prints the cookies, and then stores them in the variable `savedCookies`. The second webdriver loads the same website, deletes its own cookies, and adds the cookies from the first webdriver.

To run it on Chrome:

In [0]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")

driver = webdriver.Chrome(executable_path='drivers/chromedriver', 
                          chrome_options=chrome_options)
driver.get('http://pythonscraping.com')
driver.implicitly_wait(1)

savedCookies = driver.get_cookies()
print(savedCookies)

driver2 = webdriver.Chrome(executable_path='drivers/chromedriver',
                           chrome_options=chrome_options)

driver2.get('http://pythonscraping.com')
driver2.delete_all_cookies()
for cookie in savedCookies:
  driver2.add_cookie(cookie)

driver2.get('http://pythonscraping.com')
driver.implicitly_wait(1)
print(driver2.get_cookies())

##Timing Is Everything
Some well-protected websites might prevent users from submitting forms or interacting with the site if users do it too quickly. Thus, if possible, try to space the scrapers out by a few seconds, even if this means to add the following extra:


```
import time

time.sleep(3)
```



#Common Form Security Features

##Hidden Input Field Values
“Hidden” fields in HTML forms allow the value contained in the field to be viewable
by the browser but invisible to the users (unless they look at the site’s source code).

Hidden fields are used to prevent web scraping in two main ways: a field can be
populated with a randomly generated variable on the form page that the server is
expecting to be posted to the form-processing page. If this value is not present in the form, the server can reasonably assume that the submission did not originate organically from the form page, but was posted by a bot directly to the processing page. The
best way to get around this measure is to scrape the form page first, collect the ran‐
domly generated variable, and then post to the processing page from there.

The second method is a “honeypot” of sorts. If a form contains a hidden field with an
innocuous name, such as Username or Email Address, a poorly written bot might fill
out the field and attempt to submit it, regardless of whether it is hidden to the users.
Any hidden fields with actual values (or values that are different from their defaults
on the form submission page) should be disregarded, and the user may even be
blocked from the site.

##Avoiding Honeypots
If a field on a web form is hidden from a user via CSS, it is reasonable to assume that the average user visiting the site will not be able to fill it out because it doesn’t show up in the browser.
If the form is populated, there is likely a bot at work and the post will be discarded.

A page visit to a “hidden” link on a site can easily trigger a server-side script that will block the user’s IP address, log that user out of the site, or take some other action to prevent further access. In fact, many business models have been based on exactly this concept.

The example below uses the website *http://pythonscraping.com/pages/itsatrap.html*. This page contains two links, one hidden by CSS and another visible. It also contains a form with two hidden fields:

```
<html>
<head>
  <title>A bot-proof form</title>
</head>
<style>
  body {
    overflow-x:hidden;
  }
  .customHidden {
    position:absolute;
    right:50000px;
  }
</style>
<body>
  <h2>A bot-proof form</h2>
  <a href=
    "http://pythonscraping.com/dontgohere" style="display:none;">Go here!</a>
  <a href="http://pythonscraping.com">Click me!</a>
  <form>
    <input type="hidden" name="phone" value="valueShouldNotBeModified"/><p/>
    <input type="text" name="email" class="customHidden"
      value="intentionallyBlank"/><p/>
    <input type="text" name="firstName"/><p/>
    <input type="text" name="lastName"/><p/>
    <input type="submit" value="Submit"/><p/>
  </form>
</body>
</html>
```



These three elements are hidden from the user in three ways:
* The first link is hidden with a simple CSS `display:none` attribute.
* The phone field is a hidden input field.
* The email field is hidden by moving it 50,000 pixels to the right (presumably off
the screen of everyone’s monitors) and hiding the telltale scroll bar.

The following code can retrieve the previously described page and looks for hidden links and form input fields:

In [0]:
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement

In [0]:
driver = webdriver.PhantomJS(executable_path='<Path to Phantom JS>')
driver.get('http://pythonscraping.com/pages/itsatrap.html')
links = driver.find_elements_by_tag_name('a')
for link in links:
  if not link.is_displayed():
    print('The link {} is a trap'.format(link.get_attribute('href')))
fields = driver.find_elements_by_tag_name('input')
for field in fields:
  if not field.is_displayed():
    print('Do not change value of {}'.format(field.get_attribute('name')))

It is dangerous to simply ignore hidden fields, although be careful when interacting with them.