# Scraping the Unscrapable

Some sites are hard to scrap.

Sometimes you get blocked.
Sometimes the site is using a lot of fancy java

## How much is too much?

In [37]:
url = 'http://www..com/robots.txt'
response  = requests.get(url)
print(response.text)

# robots.txt for http://www.boxofficemojo.com

User-agent: *
Disallow: /movies/default.movies.htm
Disallow: /showtimes/buy.php
Disallow: /forums/
Disallow: /derbygame/
Disallow: /grades/
Disallow: /moviehangman/
Disallow: /users/




add a delay

In [39]:
import time

page_list = ['page1','page2','page3']

for page in page_list:
    ### scrape a website
    ### ...
    print(page)
    
    time.sleep(2)
    

page1
page2
page3


or better yet, add a random delay

In [41]:
import random

for page in page_list:
    ### scrape a website
    ### ...
    print(page)
    
    time.sleep(.5+2*random.random())

page1
page2
page3


## how do I make requests look like a real browser

In [27]:
import sys
import requests
from bs4 import BeautifulSoup

url = 'http://www.reddit.com'

user_agent = {'User-agent': 'Mozilla/5.0'}
response  = requests.get(url, headers = user_agent)

We can generate a random user_agent

This library probably a little outdated, but it still works

In [28]:
#pip install fake_useragent
from fake_useragent import UserAgent
ua = UserAgent()

In [29]:
user_agent = {'User-agent': ua.random}
print(user_agent)

{'User-agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36'}


In [30]:
response  = requests.get(url, headers = user_agent)

In [31]:
response.text

'<!doctype html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>reddit: the front page of the internet</title><meta name="keywords" content=" reddit, reddit.com, vote, comment, submit " /><meta name="description" content="reddit: the front page of the internet" /><meta name="referrer" content="always"><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><link type="application/opensearchdescription+xml" rel="search" href="/static/opensearch.xml"/><link rel="canonical" href="https://www.reddit.com/" /><meta name="viewport" content="width=1024"><link rel="dns-prefetch" href="//out.reddit.com"><link rel="preconnect" href="//out.reddit.com"><link rel=\'icon\' href="//www.redditstatic.com/icon.png" sizes="256x256" type="image/png" /><link rel=\'shortcut icon\' href="//www.redditstatic.com/favicon.ico" type="image/x-icon" /><link rel=\'apple-touch-icon-precomposed\' href="//www.redditstatic.com/icon-touch.png" /><link rel="alternate" type="appl

### What happens if I try to parse my gmail with `requests` and `BeautifulSoup`?

In [1]:
gmail_url="https://mail.google.com"

soup=BeautifulSoup(requests.get(gmail_url).text)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=300, initial-scale=1" name="viewport"/>
  <meta content="Gmail is email that's intuitive, efficient, and useful. 15 GB of storage, less spam, and mobile access." name="description"/>
  <meta content="LrdTUW9psUAMbh4Ia074-BPEVmcpBxF6Gwf0MSgQXZs" name="google-site-verification"/>
  <title>
   Gmail
  </title>
  <style>
   @font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 300;
  src: local('Open Sans Light'), local('OpenSans-Light'), url(//fonts.gstatic.com/s/opensans/v13/DXI1ORHCpsQm3Vp6mXoaTYnF5uFdDttMLvmWuJdhhgs.ttf) format('truetype');
}
@font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 400;
  src: local('Open Sans'), local('OpenSans'), url(//fonts.gstatic.com/s/opensans/v13/cJZKeOuBrn4kERxqtaUH3aCWcynf_cDxXwCLxiixG1c.ttf) format('truetype');
}
  </style>
  <style>
   h1, h2 {
  -webkit-animation-duration: 0.1s;
  -webkit-animation-name: fon



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Well, this is a tiny page. We get redirected. Soupifying this is useless, of course. Luckily, in this case we can see where we are sent to. In many of cases, you won't be so lucky. The page contents will be rendered by javascript by a browser, so just getting the source won't help you.

Anyway, let's follow the redirection for now.

In [2]:
new_url = "https://mail.google.com/mail"

# get method will navigate the requested url.. 
soup =BeautifulSoup(requests.get(new_url).text)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=300, initial-scale=1" name="viewport"/>
  <meta content="Gmail is email that's intuitive, efficient, and useful. 15 GB of storage, less spam, and mobile access." name="description"/>
  <meta content="LrdTUW9psUAMbh4Ia074-BPEVmcpBxF6Gwf0MSgQXZs" name="google-site-verification"/>
  <title>
   Gmail
  </title>
  <style>
   @font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 300;
  src: local('Open Sans Light'), local('OpenSans-Light'), url(//fonts.gstatic.com/s/opensans/v13/DXI1ORHCpsQm3Vp6mXoaTYnF5uFdDttMLvmWuJdhhgs.ttf) format('truetype');
}
@font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 400;
  src: local('Open Sans'), local('OpenSans'), url(//fonts.gstatic.com/s/opensans/v13/cJZKeOuBrn4kERxqtaUH3aCWcynf_cDxXwCLxiixG1c.ttf) format('truetype');
}
  </style>
  <style>
   h1, h2 {
  -webkit-animation-duration: 0.1s;
  -webkit-animation-name: fon

In [2]:
print(soup.find(id='Email'))

<input id="Email" name="Email" placeholder="Email or phone" spellcheck="false" type="email" value=""/>


We have hit the login page. We can't get to the emails without logging in ....

`pip install selenium`

### Download:
 http://chromedriver.storage.googleapis.com/index.html      

### Recommend: version 2.29
### move chromedriver to `/Applications`

In [4]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

import os
chromedriver = "/Applications/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver


driver = webdriver.Chrome(chromedriver)

In [5]:
driver.get("https://mail.google.com")

# Alternatives to Chrome:
# Firefox, phantomjs

#### Fill out username and password, hit enter to log in

In [3]:
username_form = driver.find_element_by_id("Email")
username_form.send_keys("metisseleniumtest@gmail.com")

In [19]:
username_form.send_keys(Keys.RETURN)

In [20]:
password_form=driver.find_element_by_id('Passwd')
password_form.send_keys('thisismetis')

In [21]:
password_form.send_keys(Keys.RETURN)

#### Click compose button to start a new email draft!

In [22]:
compose_button=driver.find_element_by_xpath('//div[text()="COMPOSE"]')
compose_button.click()

#### Write a nice, friendly message to Paul

In [23]:
to_field = driver.find_element_by_name("to")
to_field.send_keys("paul@thisismetis.com")

In [24]:
subject = driver.find_element_by_name("subjectbox")
subject.send_keys("This is an alert!")

In [25]:
message_body = driver.find_element_by_xpath("//div[@aria-label='Message Body']")
message_body.send_keys("Hello,")
message_body.send_keys([Keys.RETURN, Keys.RETURN])
message_body.send_keys("I am a computer and I just became self aware.")

#### Press the send button

In [26]:
send_button = driver.find_element_by_xpath("//div[contains(@aria-label, 'Send')]")
send_button.click()


### Scraping boxofficemojo with selenium

In [30]:
matrix_url = "http://www.boxofficemojo.com/movies/?id=matrix.htm"
driver.get(matrix_url)


In [31]:
# 'contains' will find a match on the text, in this case return b tag
gross_selector = '//font[contains(text(), "Domestic")]/b'
print(driver.find_element_by_xpath(gross_selector).text)

$171,479,930


In [32]:
# scraping genre
genre_selector = '//a[contains(@href, "/genres/chart/")]/b'
for genre_anchor in driver.find_elements_by_xpath(genre_selector):
    print(genre_anchor.text)

Action - Wire-Fu
Man vs. Machine
Post-Apocalypse
Virtual Reality


In [33]:
# 
inf_adjust_2000_selector = '//select[@name="ticketyr"]/option[@value="2000"]'
driver.find_element_by_xpath(inf_adjust_2000_selector).click()

In [34]:
go_button = driver.find_element_by_name("Go")
go_button.click()

Now the page has changed, it's showing inflation adjusted numbers. We can grab the new, adjusted number

In [35]:
gross_selector = '//font[contains(text(), "Domestic ")]/b'
print(driver.find_element_by_xpath(gross_selector).text)

$181,944,300


### Scraping IMDB with selenium

In [43]:
url = "http://www.imdb.com"
driver.get(url)

In [44]:
query = driver.find_element_by_id("navbar-query")
query.send_keys("Nicholas Cage")

In [45]:
query.send_keys(Keys.RETURN)

In [47]:
name_selector = '//a[contains(text(), "Nicolas Cage")]'
driver.find_element_by_xpath(name_selector).click()

In [48]:
driver.close()

References: 
- Documentation on finding elements:
- http://selenium-python.readthedocs.org/en/latest/locating-elements.html
- Xpath tutorial:
- http://www.w3schools.com/xpath/xpath_syntax.asp
- Good Xpath syntax diagram: http://www.guru99.com/xpath-selenium.html

![](http://cdn.guru99.com/images/3-2016/032816_0758_XPathinSele1.png)