# Scraping the Unscrapable

Some sites are hard to scrape.

Sometimes you get blocked.
Sometimes the site is using a lot of fancy Javascript.

We'll see a few examples of methods we can use as workarounds for the former and introduce the tool Selenium that lets us automate dynamic interactions with the browser, which can help with the latter.

## How much is too much?

Sites have `robots.txt` pages that give guidelines about what they want to allow webcrawlers to access

In [1]:
import requests

url = 'https://www.airbnb.com/s/craft-class/experiences?refinement_paths%5B%5D=%2Fexperiences%2FKG%2FTag%3A438&current_tab_id=experience_tab&selected_tab_id=experience_tab&search_type=filter_change&screen_size=large&hide_dates_and_guests_filters=false/robots.txt'
response  = requests.get(url)
print(response.text)
#Webpages have static HTML  Beautiful Soup
#But if you go to webpages like FB, our webpages are different (different HTMLs) because Javascript library—runs on your computer; other thing runs on FB 
#Sends info for my page
#Javascript renders it
#Code/browser running on my machine generates HTML


<!doctype html>
<html data-is-hyperloop="true" lang="en" dir="ltr" xmlns:fb="http://ogp.me/ns/fb#"><head><script type="application/javascript">window.sherlock_firstbyte = window.performance && window.performance.timing ? window.performance.timing.responseStart : Number(new Date());</script><script type="application/javascript">!function(){"use strict";var e=730,n="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";var t=/(?:^| )bev=(.*?)(?:;|$)/,o=!1;function a(){return window.bev=window.bev||function(){if(o||"undefined"==typeof document)return null;o=!0;var e=(document.cookie||"").match(t);return e&&2===e.length?decodeURIComponent(e[1]):null}(),window.bev}!function(){try{if(!a()){var t=function(){for(var e=[],t=15;t>=0;t--)e.push(n[Math.floor(Math.random()*n.length)]);var o=Math.floor(Date.now()/1e3);return"".concat(o,"_").concat(e.join(""))}();o=t,r=document.location.hostname,c=".".concat(r.slice(r.indexOf("airbnb."))),(i=new Date).setDate(i.getDate()+e),document.cooki

Disallow: / means disallow everything (for all user-agents at the end that aren't covered earlier). Boxofficemojo is more accepting:

In [2]:
url = 'http://www.boxofficemojo.com/robots.txt'
response  = requests.get(url)
print(response.text)

#Disallow = don’t want people scraping (policies)
#vs. Box office mojo = smaller list, reasonably web-scraper friendly 


# robots.txt for http://www.boxofficemojo.com

User-agent: *
Disallow: /movies/default.movies.htm
Disallow: /showtimes/buy.php
Disallow: /forums/
Disallow: /derbygame/
Disallow: /grades/
Disallow: /moviehangman/
Disallow: /users/




It's very common for sites to block you if you send too many requests in a certain time period. Sometimes all it takes to evade this is well-designed pauses in your scraping. 

2 general ways:
* pause after every request
* pause after each n requests

In [3]:
#every request
import time

page_list = ['page1','page2','page3']

for page in page_list:
    ### scrape a website
    ### ...
    print(page)
    
    time.sleep(2)
#problem with webscraping = send tons of requests to webpage, asking for stuff faster than human would = annoying for people running site
#as courtesy, delay between requests 

page1
page2
page3


In [4]:
#every 200 requests
import time

page_list = ['page1','page2','page3','page4','page5','page6']

for i, page in enumerate(page_list):
    ### scrape a website
    ### ...
    print(page)
    
    if (i+1 % 200 == 0):
        time.sleep(320)

page1
page2
page3
page4
page5
page6


Or better yet, add a random delay (more human-like)

In [5]:
import random

for page in page_list:
    ### scrape a website
    ### ...
    print(page)
    
    time.sleep(.5+2*random.random())
    
#random is kinda sketchy--probably shouldn't be scraping if you're that worried

page1
page2
page3
page4
page5
page6


## How do I make requests look like a real browser?

In [6]:
import sys
import requests
from bs4 import BeautifulSoup

url = 'http://www.reddit.com'

user_agent = {'User-agent': 'Mozilla/5.0'}
response  = requests.get(url, headers = user_agent)

We can generate a random user_agent

In [7]:
from fake_useragent import UserAgent

ua = UserAgent()
user_agent = {'User-agent': ua.random}
print(user_agent)

response  = requests.get(url, headers = user_agent)
print(response.text)
#library fake useragent
#grab info: get info from url, from reddit 
#if ask for info from reddit, get info: script = javascript
#can use beautiful soup to ask for email

{'User-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.90 Safari/537.36'}
<!DOCTYPE html><html lang="en"><head><script>
          var __SUPPORTS_TIMING_API = typeof performance === 'object' && !!performance.mark && !! performance.measure && !!performance.getEntriesByType;
          function __perfMark(name) { __SUPPORTS_TIMING_API && performance.mark(name); };
          var __firstLoaded = false;
          function __markFirstPostVisible() {
            if (__firstLoaded) { return; }
            __firstLoaded = true;
            __perfMark("first_post_title_image_loaded");
          }
        </script><script>
          __perfMark('head_tag_start');
        </script><title>reddit: the front page of the internet</title><meta charSet="utf-8"/><meta name="viewport" content="width=device-width, initial-scale=1"/><meta name="referrer" content="origin-when-cross-origin"/><style>
  /* http://meyerweb.com/eric/tools/css/reset/
    v2.0 | 20110126
 

## Now to Selenium!

## What happens if I try to parse my gmail with `requests` and `BeautifulSoup`?

In [8]:
import requests
from bs4 import BeautifulSoup

gmail_url="https://mail.google.com"
soup=BeautifulSoup(requests.get(gmail_url).text, "lxml")
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=300, initial-scale=1" name="viewport"/>
  <meta content="Gmail is email that's intuitive, efficient, and useful. 15 GB of storage, less spam, and mobile access." name="description"/>
  <meta content="LrdTUW9psUAMbh4Ia074-BPEVmcpBxF6Gwf0MSgQXZs" name="google-site-verification"/>
  <title>
   Gmail
  </title>
  <style>
   @font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 300;
  src: local('Open Sans Light'), local('OpenSans-Light'), url(//fonts.gstatic.com/s/opensans/v15/mem5YaGs126MiZpBA-UN_r8OUuhs.ttf) format('truetype');
}
@font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 400;
  src: local('Open Sans'), local('OpenSans'), url(//fonts.gstatic.com/s/opensans/v15/mem8YaGs126MiZpBA-UFVZ0e.ttf) format('truetype');
}
  </style>
  <style>
   h1, h2 {
  -webkit-animation-duration: 0.1s;
  -webkit-animation-name: fontfix;
  -webkit-animation-iteratio

Well, this is a tiny page. We get redirected. Soupifying this is useless, of course. Luckily, in this case we can see where we are sent to. In many of cases, you won't be so lucky. The page contents will be rendered by JavaScript by a browser, so just getting the source won't help you.

Anyway, let's follow the redirection for now.

In [9]:
new_url = "https://mail.google.com/mail"

# get method will navigate the requested url.. 
soup =BeautifulSoup(requests.get(new_url).text)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=300, initial-scale=1" name="viewport"/>
  <meta content="Gmail is email that's intuitive, efficient, and useful. 15 GB of storage, less spam, and mobile access." name="description"/>
  <meta content="LrdTUW9psUAMbh4Ia074-BPEVmcpBxF6Gwf0MSgQXZs" name="google-site-verification"/>
  <title>
   Gmail
  </title>
  <style>
   @font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 300;
  src: local('Open Sans Light'), local('OpenSans-Light'), url(//fonts.gstatic.com/s/opensans/v15/mem5YaGs126MiZpBA-UN_r8OUuhs.ttf) format('truetype');
}
@font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 400;
  src: local('Open Sans'), local('OpenSans'), url(//fonts.gstatic.com/s/opensans/v15/mem8YaGs126MiZpBA-UFVZ0e.ttf) format('truetype');
}
  </style>
  <style>
   h1, h2 {
  -webkit-animation-duration: 0.1s;
  -webkit-animation-name: fontfix;
  -webkit-animation-iteratio

In [10]:
print(soup.find(id='Email'))
#this gets login page--> need script

<input id="Email" name="Email" placeholder="Email or phone" spellcheck="false" type="email" value=""/>


We have hit the login page. We can't get to the emails without logging in ... i.e. we need to actually interact with the browser using Selenium!

In [11]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

import chromedriver_binary


driver = webdriver.Chrome()
driver.get("https://www.airbnb.com/s/craft-class/experiences?refinement_paths%5B%5D=%2Fexperiences%2FKG%2FTag%3A438&current_tab_id=experience_tab&selected_tab_id=experience_tab&search_type=filter_change&screen_size=large&hide_dates_and_guests_filters=false")

# Alternatives to Chrome: Firefox, PhantomJS

### Interlude: how to include usernames and passwords

We are going to have to enter a username  and password in order to log in. However, we **don't** want to have our password uploaded to Github for people to scrape! One solution to this is to use _environment variables_.

In your directory, create a file called `.env` that has the following format:
```bash
USERNAME="your_username@gmail.com"
PASSWORD="your_password"
```
DON'T ADD THIS FILE TO GITHUB!
It is prudent to add a line `.env` to your `.gitignore`

We add two commands to the top of the cell:
```
%load_ext dotenv  # allows us to use the %dotenv "magic" command
%dotenv           # reads .env, and makes USERNAME and PASSWORD environment variables
```
We can now use `os.environ.get` to access the environment variables without having them appear in the notebook.

In [None]:
!pip install python-dotenv

In [None]:
# See notes about environment variables
%load_ext dotenv
%dotenv
import os
EMAIL = os.environ.get('USERNAME')
PASSWORD = os.environ.get('PASSWORD')

# Show that this is working. Don't do this for PASSWORD!
print(EMAIL)

### Fill out username and password, hit enter to log in

Now let use this to log in.

In [None]:
username_form = driver.find_element_by_id("identifierId")
username_form.send_keys(EMAIL) 

In [None]:
username_form.send_keys(Keys.RETURN)

In [None]:
password_form=driver.find_element_by_name("password") # note another approach
password_form.send_keys(PASSWORD) # enter password

In [None]:
password_form.send_keys(Keys.RETURN)

### Click compose button to start a new email draft

In [None]:
compose_button=driver.find_element_by_xpath('//div[text()="Compose"]')
compose_button.click()

### Write a nice, friendly (optional) message to your (least?) favorite person

In [None]:
to_field = driver.find_element_by_name("to")
to_field.send_keys("email@gmail.com") # enter recipient email

In [None]:
subject = driver.find_element_by_name("subjectbox")
subject.send_keys("This is an alert!")

In [None]:
message_body = driver.find_element_by_xpath("//div[@aria-label='Message Body']")
message_body.send_keys("Hello,")
message_body.send_keys([Keys.RETURN, Keys.RETURN])
message_body.send_keys("I am a computer and I just became self aware!")

### Press the send button

In [None]:
send_button = driver.find_element_by_xpath("//div[contains(@aria-label, 'Send')]")
send_button.click()

# Scraping Box Office Mojo with Selenium

In [None]:
matrix_url = "https://www.airbnb.com/s/craft-class/experiences?refinement_paths%5B%5D=%2Fexperiences%2FKG%2FTag%3A438&current_tab_id=experience_tab&selected_tab_id=experience_tab&search_type=filter_change&screen_size=large&hide_dates_and_guests_filters=false"
driver.get(matrix_url)


In [None]:
# 'contains' will find a match on the text, in this case return b tag
gross_selector = '//font[contains(text(), "Domestic")]/b'
print(driver.find_element_by_xpath(gross_selector).text)

In [None]:
# scraping genre
genre_selector = '//a[contains(@href, "/genres/chart/")]/b'
for genre_anchor in driver.find_elements_by_xpath(genre_selector):
    print(genre_anchor.text)

In [None]:
inf_adjust_2000_selector = '//select[@name="ticketyr"]/option[@value="2000"]'
driver.find_element_by_xpath(inf_adjust_2000_selector).click()

In [None]:
go_button = driver.find_element_by_name("Go")
go_button.click()

Now the page has changed; it's showing inflation adjusted numbers. We can grab the new, adjusted number.

In [None]:
gross_selector = '//font[contains(text(), "Domestic ")]/b'
print(driver.find_element_by_xpath(gross_selector).text)

# Scraping IMDB with Selenium

In [None]:
url = "http://www.imdb.com"
driver.get(url)

In [None]:
query = driver.find_element_by_id("navbar-query")
query.send_keys("Julianne Moore")

In [None]:
query.send_keys(Keys.RETURN)

In [None]:
name_selector = '//a[contains(text(), "Julianne Moore")]'
driver.find_element_by_xpath(name_selector).click()
current_url = driver.current_url

# Mixing Selenium and BeautifulSoup

In [None]:
from bs4 import BeautifulSoup
"""Could use requests then send page.text to bs4
but Selenium actually stores the source as part of
the Selenium driver object inside driver.page_source

#import requests
#page = requests.get(current_url)
"""
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [None]:
soup.prettify()

In [None]:
len(soup.find_all('a'))

In [None]:
driver.close()

**Conclusion**: If a page is static, we can just use Beautiful Soup. If there is some dynamic component or interaction, we can then bring Selenium into the mix. Selenium can be used on its own or in conjunction with Beautiful Soup.

*References:* 

Documentation on finding elements:
- https://selenium-python.readthedocs.io/locating-elements.html

Xpath tutorial:
-  https://www.w3schools.com/xml/xpath_syntax.asp