# https://www.scraperapi.com/blog/headers-and-cookies-for-web-scraping/
# https://www.scraperapi.com/blog/web-scraping-best-practices/
# https://www.scraperapi.com/blog/10-tips-for-web-scraping/

Anti-scraping Techniques
- number of requests: too many requests within a particular time frame or there are too many parallel requests from the same IP
- number of repetitions and find request patterns (X number of requests at every Y seconds)
- Honeypots are link traps webmasters can add to the HTML file that are hidden from humans
- Redirecting the request to a page with a CAPTCHA
- javascript checks
- anti-bot mechanisms can spot patterns in the number of clicks, clicks’ location, the interval between clicks, and other metrics

Todos
- Set Your Timeout to at Least 60 seconds
- Don’t Set Custom Headers Unless You 100% Need To
- Always Send Your Requests to the HTTPS Version
- Avoid Using Sessions Unless Completely Necessary
- Manage Your Concurrency Properly
- Verify if You Need Geotargeting Before Running Your Scraper
- If you want to be able to interact with the page (click on a button, scroll, etc.) then you will need to use your own Selenium, Puppeteer, or Nightmare headless browser

Tips
- Set Random Intervals In Between Your Requests
- Set a Referrer
- Use a Headless Browser
- Avoid Honeypot Traps
- Detect Website Changes

In [1]:
import requests
url = 'https://httpbin.org/headers'
 
response = requests.get(url)
 
print(response.text)

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-64a56f06-4bce88994e6eeaa00698b3e6"
  }
}



In [2]:
url = 'https://httpbin.org/headers'
 
headers = {
    'accept': '*/*',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53',
    'Accept-Language': 'en-US,en;q=0.9,it;q=0.8,es;q=0.7',
    'referer': 'https://www.google.com/',
    'cookie': 'DSID=AAO-7r4OSkS76zbHUkiOpnI0kk-X19BLDFF53G8gbnd21VZV2iehu-w_2v14cxvRvrkd_NjIdBWX7wUiQ66f-D8kOkTKD1BhLVlqrFAaqDP3LodRK2I0NfrObmhV9HsedGE7-mQeJpwJifSxdchqf524IMh9piBflGqP0Lg0_xjGmLKEQ0F4Na6THgC06VhtUG5infEdqMQ9otlJENe3PmOQTC_UeTH5DnENYwWC8KXs-M4fWmDADmG414V0_X0TfjrYu01nDH2Dcf3TIOFbRDb993g8nOCswLMi92LwjoqhYnFdf1jzgK0'
}
 
response = requests.get(url, headers=headers)
 
print(response.text)

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en-US,en;q=0.9,it;q=0.8,es;q=0.7", 
    "Cookie": "DSID=AAO-7r4OSkS76zbHUkiOpnI0kk-X19BLDFF53G8gbnd21VZV2iehu-w_2v14cxvRvrkd_NjIdBWX7wUiQ66f-D8kOkTKD1BhLVlqrFAaqDP3LodRK2I0NfrObmhV9HsedGE7-mQeJpwJifSxdchqf524IMh9piBflGqP0Lg0_xjGmLKEQ0F4Na6THgC06VhtUG5infEdqMQ9otlJENe3PmOQTC_UeTH5DnENYwWC8KXs-M4fWmDADmG414V0_X0TfjrYu01nDH2Dcf3TIOFbRDb993g8nOCswLMi92LwjoqhYnFdf1jzgK0", 
    "Host": "httpbin.org", 
    "Referer": "https://www.google.com/", 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53", 
    "X-Amzn-Trace-Id": "Root=1-64a56f65-17ead7881dd31c011d3c2dac"
  }
}



In [4]:
headers = {
'accept': '*/*',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53',
'Accept-Language': 'en-US,en;q=0.9,it;q=0.8,es;q=0.7',
'referer': 'https://www.google.com/',
'cookie': 'DSID=AAO-7r4OSkS76zbHUkiOpnI0kk-X19BLDFF53G8gbnd21VZV2iehu-w_2v14cxvRvrkd_NjIdBWX7wUiQ66f-D8kOkTKD1BhLVlqrFAaqDP3LodRK2I0NfrObmhV9HsedGE7-mQeJpwJifSxdchqf524IMh9piBflGqP0Lg0_xjGmLKEQ0F4Na6THgC06VhtUG5infEdqMQ9otlJENe3PmOQTC_UeTH5DnENYwWC8KXs-M4fWmDADmG414V0_X0TfjrYu01nDH2Dcf3TIOFbRDb993g8nOCswLMi92LwjoqhYnFdf1jzgK0'
}
payload = {
'api_key': '51e43be283e4db2a5afb62660xxxxxxx',
'url': 'https://httpbin.org/headers',
'keep_headers': 'true',
}
response = requests.get('http://api.scraperapi.com', params=payload, headers=headers)
print(response.text)

# http://api.scraperapi.com/?api_key=51e43be283e4db2a5afb62660xxxxxxx&url=http://httpbin.org/headers&keep_headers=true

Unauthorized request, please make sure your API key is valid.
