# Scraping the Unscrapable

## What happens if I try to parse my gmail with `requests` and `BeautifulSoup`?

> Note: Websites change over time; code and html tag updates may be necessary

In [2]:
import requests
from bs4 import BeautifulSoup

# do pip install python-dotenv before running this notebook.

gmail_url="https://mail.google.com"
soup=BeautifulSoup(requests.get(gmail_url).text, "html5lib")
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=300, initial-scale=1" name="viewport"/>
  <meta content="Gmail is email that's intuitive, efficient, and useful. 15 GB of storage, less spam, and mobile access." name="description"/>
  <meta content="LrdTUW9psUAMbh4Ia074-BPEVmcpBxF6Gwf0MSgQXZs" name="google-site-verification"/>
  <title>
   Gmail
  </title>
  <style>
   @font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 300;
  src: local('Open Sans Light'), local('OpenSans-Light'), url(//fonts.gstatic.com/s/opensans/v15/mem5YaGs126MiZpBA-UN_r8OUuhs.ttf) format('truetype');
}
@font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 400;
  src: local('Open Sans'), local('OpenSans'), url(//fonts.gstatic.com/s/opensans/v15/mem8YaGs126MiZpBA-UFVZ0e.ttf) format('truetype');
}
  </style>
  <style>
   h1, h2 {
  -webkit-animation-duration: 0.1s;
  -webkit-animation-name: fontfix;
  -webkit-animation-iteratio

You may get a tiny page... or get redirected. Websites change over time, and YMMV based on your settings, cookies, etc. Soupifying this is useless, of course. Luckily, in this case we can see where we are sent to. In many of cases, you won't be so lucky. The page contents will be rendered by javascript by a browser, so just getting the source won't help you.

Anyway, let's follow the redirection for now.

In [3]:
new_url = "https://mail.google.com/mail"

# get method will navigate the requested url.. 
soup =BeautifulSoup(requests.get(new_url).text)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=300, initial-scale=1" name="viewport"/>
  <meta content="Gmail is email that's intuitive, efficient, and useful. 15 GB of storage, less spam, and mobile access." name="description"/>
  <meta content="LrdTUW9psUAMbh4Ia074-BPEVmcpBxF6Gwf0MSgQXZs" name="google-site-verification"/>
  <title>
   Gmail
  </title>
  <style>
   @font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 300;
  src: local('Open Sans Light'), local('OpenSans-Light'), url(//fonts.gstatic.com/s/opensans/v15/mem5YaGs126MiZpBA-UN_r8OUuhs.ttf) format('truetype');
}
@font-face {
  font-family: 'Open Sans';
  font-style: normal;
  font-weight: 400;
  src: local('Open Sans'), local('OpenSans'), url(//fonts.gstatic.com/s/opensans/v15/mem8YaGs126MiZpBA-UFVZ0e.ttf) format('truetype');
}
  </style>
  <style>
   h1, h2 {
  -webkit-animation-duration: 0.1s;
  -webkit-animation-name: fontfix;
  -webkit-animation-iteratio



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [4]:
print(soup.find(id='Email'))

<input id="Email" name="Email" placeholder="Email or phone" spellcheck="false" type="email" value=""/>


We have hit the login page. We can't get to the emails without logging in ....

In [5]:
# conda install selenium 
# download chromedriver: https://sites.google.com/a/chromium.org/chromedriver/downloads      

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

import os

chromedriver = f"{os.environ['HOME']}/.local/bin/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

driver = webdriver.Chrome(chromedriver)
driver.get("https://mail.google.com")

# Alternatives to Chrome: Firefox, phantomjs

### Interlude: how to include usernames and passwords

We are going to have to enter a username  and password in order to log in. However, we **don't** want to have our password uploaded to Github for people to scrape! One solution to this is to use _environment variables_.

In your directory, create a file called `.env` that has the following format:
```bash
EMAIL="your_username@gmail.com"
PASSWORD="your_password"
```
DON'T ADD THIS FILE TO GITHUB!
It is prudent to add a line `.env` to your `.gitignore`

We add two commands to the top of the cell:
```
%load_ext dotenv  # allows us to use the %dotenv "magic" command
%dotenv           # reads .env, and makes USERNAME and PASSWORD environment variables
```
We can now use `os.environ.get` to access the environment variables without having them appear in the notebook.

In [7]:
!pwd

/Users/liuriguang/ds/metis/metisgh/sea18_ds6/class_lectures/week02-luther1/02-regression_scrape


### Fill out username and password, hit enter to log in

In [8]:
# https://github.com/theskumar/python-dotenv
# See notes about environment variables
%load_ext dotenv
%dotenv
import os
EMAIL = os.environ.get('EMAIL')
PASSWORD = os.environ.get('PASSWORD')
username_form = driver.find_element_by_id("identifierId")
username_form.send_keys(EMAIL) # enter email

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [9]:
# shows that it is working. Don't do this for PASSWORD!
os.environ.get('EMAIL')

'enjoyeverysecondofmylife@gmail.com'

In [10]:
username_form.send_keys(Keys.RETURN)

In [11]:
password_form=driver.find_element_by_name("password") # note another approach
password_form.send_keys(PASSWORD) # enter password

In [12]:
password_form.send_keys(Keys.RETURN)

### Click compose button to start a new email draft

In [18]:
compose_button=driver.find_element_by_xpath('//div[text()="Compose"]')
compose_button.click()

### Write a nice, friendly (optional) message to your favorite person

In [19]:
to_field = driver.find_element_by_name("to")
to_field.send_keys("gliderpengying@gmail.com") # please pre-check if AOL still exists

In [20]:
subject = driver.find_element_by_name("subjectbox")
subject.send_keys("This is an alert!")

In [21]:
message_body = driver.find_element_by_xpath("//div[@aria-label='Message Body']")
message_body.send_keys("Hello,")
message_body.send_keys([Keys.RETURN, Keys.RETURN])
message_body.send_keys("I am a computer and I just became self aware!")

### Press the send button

In [22]:
send_button = driver.find_element_by_xpath("//div[contains(@aria-label, 'Send')]")
send_button.click()

## You did it! [It's Party Time!](https://media.giphy.com/media/zQLjk9d31jlMQ/giphy.gif)

# Scraping boxofficemojo with selenium

In [23]:
matrix_url = "http://www.boxofficemojo.com/movies/?id=matrix.htm"
driver.get(matrix_url)


In [24]:
# 'contains' will find a match on the text, in this case return b tag
gross_selector = '//font[contains(text(), "Domestic")]/b'
print(driver.find_element_by_xpath(gross_selector).text)

$171,479,930


In [25]:
# scraping genre
genre_selector = '//a[contains(@href, "/genres/chart/")]/b'
for genre_anchor in driver.find_elements_by_xpath(genre_selector):
    print(genre_anchor.text)

Action - Wire-Fu
Man vs. Machine
Post-Apocalypse
Virtual Reality


In [31]:
inf_adjust_2000_selector = '//select[@name="ticketyr"]/option[@value="2013"]'
driver.find_element_by_xpath(inf_adjust_2000_selector).click()

In [32]:
go_button = driver.find_element_by_name("Go")
go_button.click()

Now the page has changed, it's showing inflation adjusted numbers. We can grab the new, adjusted number

In [33]:
gross_selector = '//font[contains(text(), "Domestic ")]/b'
print(driver.find_element_by_xpath(gross_selector).text)

$274,435,400


# Scraping IMDB with selenium

In [41]:
url = "http://www.imdb.com"
driver.get(url)

In [42]:
query = driver.find_element_by_id("navbar-query")
query.send_keys("Julianne Moore")

In [43]:
query.send_keys(Keys.RETURN)

In [44]:
name_selector = '//a[contains(text(), "Julianne Moore")]'
driver.find_element_by_xpath(name_selector).click()
current_url = driver.current_url

# Mixing Selenium and BeautifulSoup

In [45]:
driver.page_source

'<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml" class=" scriptsOn"><head><script async="" src="https://m.media-amazon.com/images/G/01/csm/showads.v2.js" crossorigin="anonymous"></script><script async="" src="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/ClientSideMetricsAUIJavascript@jserrorsForester.0acd236281a4d93774c265b3bec043f2087a43c2._V2_.js" crossorigin="anonymous"></script>\n        \n<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>\n<script type="text/javascript">\nwindow.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\n\nvar ue_csm = window,\n    ue_hob = +new Date();\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.

In [46]:
from bs4 import BeautifulSoup
"""Could use requests then send page.text to bs4
but Selenium actually stores the source as part of
the selenium driver object inside driver.page_source

#import requests
#page = requests.get(current_url)
"""
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [47]:
soup.prettify()

'<!DOCTYPE html>\n<html class=" scriptsOn" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">\n <head>\n  <script async="" crossorigin="anonymous" src="https://m.media-amazon.com/images/G/01/csm/showads.v2.js">\n  </script>\n  <script async="" crossorigin="anonymous" src="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/ClientSideMetricsAUIJavascript@jserrorsForester.0acd236281a4d93774c265b3bec043f2087a43c2._V2_.js">\n  </script>\n  <script type="text/javascript">\n   var ue_t0=ue_t0||+new Date();\n  </script>\n  <script type="text/javascript">\n   window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;\nif (window.ue_ihb === 1) {\n\nvar ue_csm = window,\n    ue_hob = +new Date();\n(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay

In [48]:
len(soup.find_all('a'))

848

In [49]:
driver.close()

References: 
- Senelium documentation on [finding elements](http://selenium-python.readthedocs.io/locating-elements.html)
- [Xpath tutorial](https://www.google.com/url?q=https://www.w3schools.com/xml/xpath_intro.asp&sa=U&ved=0ahUKEwjN8fLp5MHaAhWnhFQKHQ9ZAG8QFggEMAA&client=internal-uds-cse&cx=012971019331610648934:m2tou3_miwy&usg=AOvVaw0AZHPYNHpWvA5lFYFZG4YR)

## Open table

Open a new browser with chrome, and wait 1 second (allow upload)

In [50]:
driver = webdriver.Chrome(chromedriver)
driver.get('http://www.opentable.com/')
time.sleep(1);

Using the developer inspector, we can see the `name` of picking the number of people is `Select_1`. Let's set the reservation for 3 people

In [51]:
people_dropdown = driver.find_element_by_name('Select_1')
time.sleep(1); 
people_dropdown.send_keys("3 people")
time.sleep(1);

Select Friday as the date

In [54]:
temp = driver.find_element_by_name('datepicker')
time.sleep(1); 
temp.click()
time.sleep(1);

In [56]:
date_element = driver.find_element_by_xpath('//div[@data-pick="1538722800000"]')
date_element.click()

Make the reservation for 8PM:

In [59]:
time_dropdown = driver.find_element_by_name('Select_0')
time_dropdown.send_keys("11:00 PM")
time.sleep(1);

Let's search!

In [69]:
search = driver.find_element_by_xpath('//input[@tabindex="5"]')
search.click()
time.sleep(1);

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//input[@tabindex="5"]"}
  (Session info: chrome=69.0.3497.100)
  (Driver info: chromedriver=2.42.591059 (a3d9684d10d61aa0c45f6723b327283be1ebaad8),platform=Mac OS X 10.13.4 x86_64)


We now have a list of restaurants. We could use BeautifulSoup to parse from here, which would be simpler.

In [70]:
driver.page_source

'<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" class=" js no-touch svg inlinesvg svgclippaths no-ie8compat"><head><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE" /><title>Restaurants and Restaurant Reservations | OpenTable</title><meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" /><link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon.ico" type="image/x-icon" /><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-16.png" sizes="16x16" /><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-32.png" sizes="32x32" /><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-48.png" sizes="48x48" /><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.5/favicon/favicon-64.png" sizes="64x64" /><link rel="icon" href="//components.otstatic.com/compone

In [71]:
spans = driver.find_elements_by_xpath('//span[@class="rest-row-name-text"]')
for span in spans[:10]:
    print(span.text)

You would use only slightly more complicated code to, for example:
* Get the number of "dollar signs" for each restaurant and print that out.
* Get a list of restaurants that had reservations available at exactly 8:00 PM.
* Print out the rating, out of five stars, of each restaurant.
Each of these would take a bit of time and experimentation to figure out, but hopefully you see that it is possible.

Let's click on "Zuni Cafe":

In [None]:
for span in spans:
    if span.text == "Zuni Cafe":
        span.click()

Uh oh. We opened a new tab, but our driver is still on the old tab, so
```
driver.find_element_by_xpath('//p[@class="readmore"]')
```
for example, won't work!

We can use the following to switch the driver to the correct window:

In [None]:
driver.switch_to_window(driver.window_handles[1])

Let's grab the text element

In [None]:
description_element = driver.find_element_by_xpath('//span[text()="Read more"]/..')
description_element.click()

In [None]:
driver.close()

## Examples of scrolling through a table


In [76]:
driver = webdriver.Chrome(chromedriver)
driver.get('https://www.boxofficemojo.com/yearly/chart/?yr=2018&p=.htm')
time.sleep(1);

In [77]:
import pandas as pd

tables = pd.read_html(driver.page_source)
len(tables)

9

In [78]:
# table 2 looks like the one we want

def get_url(page):
    if page == 1:
        return 'https://www.boxofficemojo.com/yearly/chart/?yr=2018&p=.htm'
    return f'https://www.boxofficemojo.com/yearly/chart/?page={page}&view=releasedate&view2=domestic&yr=2018&p=.htm'

In [79]:
page = 1

list_of_tables = []
while page < 100:
    url = get_url(page)
    try:
        driver.get(url)
    except:
        # out of pages!
        break
    time.sleep(1)
    list_of_tables.append(pd.read_html(driver.page_source)[2])
    
    page = page + 1



In [80]:
list_of_tables

[                                                   0    \
 0    2018 DOMESTIC GROSSESTotal Grosses of all Movi...   
 1                                      < Previous Year   
 2    Data as of:  Today  Today in 2018  Jan. 31, 20...   
 3    RankMovie Title (click to view)Studio function...   
 4                                                 Rank   
 5    Filter4th Row A24 AAE Abr. ADC Affirm Amazon A...   
 6                                                    1   
 7                                                    2   
 8                                                    3   
 9                                                    4   
 10                                                   5   
 11                                                   6   
 12                                                   7   
 13                                                   8   
 14                                                   9   
 15                                                  10 

In [81]:
movie_df = pd.concat(list_of_tables[:-1], axis=0)

In [82]:
import requests

r = requests.get('https://factfinder.census.gov/faces/nav/jsf/pages/community_facts.xhtml')

In [83]:
from IPython.core.display import HTML

HTML(r.text)

In [84]:
driver = webdriver.Chrome(chromedriver)
driver.get('https://factfinder.census.gov/faces/nav/jsf/pages/community_facts.xhtml')
time.sleep(1);

In [85]:
search_box = driver.find_element_by_id('cfsearchtextbox')
search_box.send_keys('Seattle, Washington')

In [86]:
go_button = driver.find_element_by_xpath('//a[text()="GO"]')
go_button.click()

In [87]:
education_element = driver.find_element_by_xpath('//a[text()="Education"]')
education_element.click()

In [88]:
soup = BeautifulSoup(driver.page_source,'html')



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [89]:
driver.close()

In [90]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <style type="text/css">
   [uib-typeahead-popup].dropdown-menu{display:block;}
  </style>
  <style type="text/css">
   .uib-time input{width:50px;}
  </style>
  <style type="text/css">
   [uib-tooltip-popup].tooltip.top-left &gt; .tooltip-arrow,[uib-tooltip-popup].tooltip.top-right &gt; .tooltip-arrow,[uib-tooltip-popup].tooltip.bottom-left &gt; .tooltip-arrow,[uib-tooltip-popup].tooltip.bottom-right &gt; .tooltip-arrow,[uib-tooltip-popup].tooltip.left-top &gt; .tooltip-arrow,[uib-tooltip-popup].tooltip.left-bottom &gt; .tooltip-arrow,[uib-tooltip-popup].tooltip.right-top &gt; .tooltip-arrow,[uib-tooltip-popup].tooltip.right-bottom &gt; .tooltip-arrow,[uib-tooltip-html-popup].tooltip.top-left &gt; .tooltip-arrow,[uib-tooltip-html-popup].tooltip.top-right &gt; .tooltip-arrow,[uib-tooltip-html-popup].tooltip.bottom