# Datascraping : Web Scraping / APIs
Dec 10th, 2018 - Javier Garcia-Bernardo, Anna Keuchenius & Allie Morgan

In [96]:
## Requirements
import requests               # Simple HTTP operations (GET and POST)
import selenium               # Loads dynamic (javascript) pages
import json                   # Parsing the responses from APIs
import re                     # Python library for parsing regular expressions
from bs4 import BeautifulSoup # Parsing HTML
import pandas as pd           # Read tables
import time

## What is datascraping?

[Data scraping](https://en.wikipedia.org/wiki/Web_scraping) is a method for extracting data from the web. There are many techniques which can be used for web scraping — ranging from requiring human involvement (“human copy-paste”) to fully automated systems (using computer vision). Somewhere in the middle is HTML parsing, which we will describe here.

Web scraping using [HTML parsing](https://en.wikipedia.org/wiki/Web_scraping#HTML_parsing) is often used on webpages which share similar HTML structure. For example, you might want to scrape the ingredients from chocolate chip cookie recipes to identify correlations between ingredients and five-star worthy cookies, or you might want to predict who will win March Madness by looking at game play-by-plays, or you want to know all the local pets up for adoption.

## Part I: Static Webpages

In [46]:
urls = ["https://www.boulderhumane.org/animals/adoption/dogs", 
         "https://www.boulderhumane.org/animals/adoption/cats", 
         "https://www.boulderhumane.org/animals/adoption/adopt_other"]

page = requests.get(urls[0])
# Extractt 
html = page.text
print(html[:500]) # Print the first 500 characters of the HTML

<!DOCTYPE html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=Edge" />
<meta charset="utf-8" />
<link rel="shortcut icon" href="https://www.boulderhumane.org/sites/default/files/favicon.ico" type="image/vnd.microsoft.icon" />
<meta name="Generator" content="Drupal 7 (http://drupal.org)" />
<meta name="viewport" content="width=1000px, initial-scale=1.0, maximum-scale=1.0" />
<title>Dogs Available for Adoption | Humane Society of Boulder Valley</title>
<link type="text/css" rel="stylesheet


### Extracting information: parsing html

When you visit a webpage, your web browser renders an HTML document with CSS and Javascript to produce a visually appealing page. (See the HTML above.) [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is a Python library for parsing HTML. We'll use it to extract all of the names, ages, and breeds of the [dogs](https://www.boulderhumane.org/animals/adoption/dogs), [cats](https://www.boulderhumane.org/animals/adoption/cats), and [small animals](https://www.boulderhumane.org/animals/adoption/adopt_other) currently up for adoption at the Boulder Humane Society.

In [6]:
soup = BeautifulSoup(html, 'html.parser')

Note, that the feature of these pages which we are exploiting is their repeated HTML structure. Every animal listed has the following HTML variant:
```{html}
<div class="views-row ... ">
  ...
  <div class="views-field views-field-field-pp-animalname">
    <div class="field-content">
      <a href="/animals/adoption/" title="Adopt Me!">Romeo</a>
    </div>
  </div>
  <div class="views-field views-field-field-pp-primarybreed">
    <div class="field-content">New Zealand</div>
  </div>
  <div class="views-field views-field-field-pp-secondarybreed">
    <div class="field-content">Rabbit</div>
  </div>
  <div class="views-field views-field-field-pp-age">
    ...
    <span class="field-content">0 years 2 months</span>
  </div>
  <div class="views-field views-field-field-pp-gender">
    ...
    <span class="field-content">Male</span>
  </div>
  ...
</div>
``` 
So to get at the HTML object for each pet, we can run the following:

In [13]:
pets = soup.find_all('div', {'class': re.compile('.*views-row.*')})

That is, find all of the `div` tags with the `class` attribute which contains the substring `views-row`. 

**notice that we use regex here in the re.compile statement. Regex helps to find patterns in text. More info on regex [here](https://docs.python.org/3/library/re.html). The wildcard .* matches everything (all characters).

Next to grab the name, breeds, and ages of these pets, we’ll grab the children of each pet HTML object. For example:

In [17]:
head = "views-field views-field-field-pp-"
for pet in pets:
    name = pet.find('div', {'class': head + 'animalname'}).get_text(strip=True)
    primary_breed = pet.find('div', {'class': head + 'primarybreed'}).get_text(strip=True)
    secondary_breed = pet.find('div', {'class': head + 'secondarybreed'}).get_text(strip=True)
    age = pet.find('div', {'class': head + 'age'}).get_text(strip=True)
    print(name, primary_breed, secondary_breed, age)

(u'Blue', u'Coonhound, Treeing Walker', u'', u'Age:7 years 5 months')
(u'Calvin', u'Pointer, German Shorthaired', u'Mix', u'Age:1 year 9 months')
(u'Mona', u'Greyhound', u'Retriever, Labrador', u'Age:8 years 1 month')
(u'Iris', u'Siberian Husky', u'Saint Bernard', u'Age:7 years 0 months')
(u'Lucia', u'Chihuahua, Short Coat', u'Mix', u'Age:6 years 0 months')
(u'Karma', u'Terrier, American Pit Bull', u'Mix', u'Age:3 years 0 months')
(u'Chuco', u'Terrier, American Pit Bull', u'Mix', u'Age:3 years 7 months')
(u'Harper', u'Terrier, American Pit Bull', u'Mix', u'Age:1 year 1 month')
(u'Lupita', u'Bulldog, American', u'Mix', u'Age:2 years 0 months')
(u'Winston', u'Chihuahua, Short Coat', u'Mix', u'Age:1 year 0 months')
(u'Beans', u'Terrier, American Pit Bull', u'Mix', u'Age:0 years 2 months')
(u'Coco', u'Bulldog, English', u'Mix', u'Age:0 years 5 months')
(u'Oliver', u'Alaskan Klee Kai', u'', u'Age:7 years 10 months')
(u'Canon', u'Siberian Husky', u'Mix', u'Age:3 years 0 months')
(u'Eve', u'M

where each call to `find` is getting the children of a pet object, in particular, the `div`s with `class` attributes which look like `views-field views-field-field-pp-*`. Feel free to replace the above code with the cat or small animal pages provided and see how the output changes.

We often use BeautifulSoup's .find or .find_all methods. However, there are also other useful methods that one can use.  For example, .findNext of .nextSibling. More in the [docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) (especially under tab 'method names'). 

### Part Ib: Extract Tables from Webpages

We can quickly read a table from html and converge it to a Pandas dataframe with pandas method read_html. Then we can use the regular pandas methods to manipulate or filter the data.

In [19]:
table = pd.read_html("https://en.wikipedia.org/wiki/List_of_sandwiches",header=0)[0]

# Write table to CSV
#table.to_csv("filenamehere.csv")

# Output the top rows of the table
table.head((20))

Unnamed: 0,Name,Image,Origin,Description
0,Bacon,,United Kingdom,Often eaten with ketchup or brown sauce
1,"Bacon, egg and cheese",,United States,"Breakfast sandwich, usually with fried or scra..."
2,Bagel toast,,Israel,"Pressed, toasted bagel filled with vegetables ..."
3,Baked bean,,United States,"Canned baked beans on white or brown bread, so..."
4,Bánh mì[4],,Vietnam,"Filling is typically meat, but can contain a w..."
5,Barbecue[5][6][7],,United States,"Served on a bun, with chopped, sliced, or shre..."
6,Barros Jarpa,,Chile,"Ham and cheese, usually mantecoso, which is si..."
7,Barros Luco,,Chile,Beef (usually thin-cut steak) and cheese
8,Bauru,,Brazil,"Melted cheese, roast beef, tomato, and pickled..."
9,Beef on weck,,"United States(Buffalo, New York)",Roast beef on a Kummelweck roll


In [35]:
animals_from_uk = table[table['Origin'] == 'United Kingdom']
animals_from_uk.head()

Unnamed: 0,Name,Image,Origin,Description
0,Bacon,,United Kingdom,Often eaten with ketchup or brown sauce
19,British Rail,,United Kingdom,Reference to the poor quality of catering on t...
29,Cheese and pickle,,United Kingdom,Slices of cheese (typically Cheddar) and pickl...
37,Chip butty[11][12][13][14],,United Kingdom,"Sliced white bread (or a large, flat bread rol..."
44,Corned beef,,United Kingdom,Corned beef often served with a condiment such...


## Part  II: Dynamic Webpages

Above, we requested webpages that required no [Javascript](https://en.wikipedia.org/wiki/JavaScript). In other words, there was no input required on the users' end to view the content of the page (e.g. a login). Let's try a more complicated example of webscraping where content is loaded dynamically.

[Selenium](https://www.seleniumhq.org/download/)

Some advantages of HTML scraping with Selenium it: can handle javascript, get **HTML** back after the Javascript has been rendered, can behave like a person. The disadvantage of using Selenium is that it is generally slow.

Requirements (one of the below):
- Firefox + geckodriver (https://github.com/mozilla/geckodriver/releases)
- Chrome + chromedriver (https://sites.google.com/a/chromium.org/chromedriver/)
    
Note: geckodriver/chromedriver must have execution permissions (chmod +x geckodriver)

In [47]:
import selenium.webdriver

Start the browser and define how much are you willing to wait for a page to load. (Many times this is not needed but it doesn't hurt.)

In [48]:
# Open the driver (change the executable path to geckodriver_mac.exe or geckodriver.exe)
driver = selenium.webdriver.Chrome(executable_path="./chromedriver")
#driver = selenium.webdriver.Chrome()

Visit [xkcd](https://xkcd.com) and click through the comics.

In [24]:
# Get the xkcd website
driver.get("https://xkcd.com/")

In [32]:
# Let's find the 'random' buttom and go to a random comic
random_element = driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[1]/li[3]/a')
random_element.click()

In [33]:
# Identify the 'next' button and go to the next comic
next_element = driver.find_element_by_css_selector("a[rel='next']")
next_element.click()

As you can see in the examples above, there are several ways to identify elements on the page, such as with the xpath or by css selectors. You can find more methods in the [docs](https://onlinetraining.etestinghub.com/webdriver-methods-web-elements/).

Find an attribute of this page.

In [34]:
element = driver.find_element_by_xpath('//*[@id="comic"]/img')
element.get_attribute("title")

u"Earth clearly hasn't been inspected, since it's definitely contaminated with salmonella."

In [62]:
driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[1]/li[3]/a').click()

### Login with Selenium

Let's visit a webpage which requires a login. Signing in to Facebook ...

In [53]:
##DO NOT WRITE YOUR PASSWORD IN NOTEBOOKS!!
fb_email = "your-email"

In [94]:
# Go to Facebook
driver = selenium.webdriver.Chrome(executable_path="./chromedriver")
driver.get("https://www.facebook.com/")

In [99]:
# Send email and password
driver.find_element_by_xpath('//*[@id="email"]').send_keys(fb_email)
time.sleep(20)
print("You now have 20 seconds to fill out your password")
# driver.find_element_by_xpath('//*[@id="pass"]').send_keys(fb_pass)

You now have 20 seconds to fill out your password


In [100]:
# Click on login
driver.find_element_by_xpath('//*[@id="loginbutton"]').click()

In [101]:
# Go to CSS Amsterdam Website
driver.get("https://www.facebook.com/CSSamsterdam/")

In [93]:
vind_ik_leuks = driver.find_element_by_xpath("//*[contains(text(), 'Vind ik leuk')]").click()

In [102]:
# Always remember to close your browser!
driver.close()

## Part III: APIs

To allow users to access large amounts of data, companies may provide an [Application Programming Interface (API)](https://en.wikipedia.org/wiki/Application_programming_interface). Often these request are handled via PUT and POST HTTP requests. For example, to make a request from the Twitter API:

```{bash}
curl --request GET 
 --url 'https://api.twitter.com/1.1/search/tweets.json?q=nasa&result_type=popular' 
 --header 'authorization: OAuth oauth_consumer_key="consumer-key-for-app", ... , 
 oauth_token="access-token-for-authed-user", oauth_version="1.0"'
 ```

APIs often return data in the format of [Javascript Object Notation (JSON)](https://en.wikipedia.org/wiki/JSON). For example:

```{json}
{"status": 200, "message": "hello world"}
```

### Explicit APIs

Next, let's try a more typical example of an API. The perks of this approach: 
- (a) send a request and get back JSON, 
- (b) chances are that somebody else has created a Python wrapper for you, but keep in mind that 
- (c) APIs have limits.

Let's consider a common API example -- Twitter. To get started:
- Get a key: https://apps.twitter.com/
- Documentation: https://dev.twitter.com/rest/public
- Find a library: https://dev.twitter.com/resources/twitter-libraries (We'll use https://github.com/tweepy/tweepy)

Limitations: 100 messages / query, 180 messages every 15 min, & only the last seven days of data 

In [104]:
!pip install tweepy

Below you will find the twitter keys that were created for this demo. Please don't use these keys after this datascraping course any longer. GENEREALLY, NEVER STORE PASSWORD, KEYS OR OTHER SENSITIVE INFORMATION IN NOTEBOOKS

In [108]:
d_keys = {}
d_keys['CONSUMER_KEY'] = 'iISHnFwAbqlBehAP6RxUIjawG'
d_keys['CONSUMER_SECRET'] = 'XJTe1LlhZMSaRHE7SH1paZGwzb2HNbd5GyeDgI9HoLVoeBoBVM'
d_keys['ACCESS_KEY'] = '3001441779-A3BEX2aY86j7yOsjwiZ7bRYmFkHnfpsSaPVtaBs'
d_keys['ACCESS_SECRET'] = 'qa1ZKpMZPmKqoKJUvD4Gdw3GWhuMRHtIrFvkR7DYoaLHc'

In [113]:
import tweepy
import time
import pickle

def twitter(d_keys,query, num_results=1000):
    # Authtentify
    auth = tweepy.OAuthHandler(d_keys["CONSUMER_KEY"], d_keys["CONSUMER_SECRET"])
    auth.set_access_token(d_keys["ACCESS_KEY"], d_keys["ACCESS_SECRET"])
    api = tweepy.API(auth)

    # We want 1000 tweets
    result_count = 0
    last_id = None
    
    # Max 180 tweets 15 min
    cumulative = 0

    #While we don't have them
    while (result_count <  num_results):
        previous_tweets = result_count
        # Ask for more tweets, starting in the 'last_id' (identifier of the tweet)
        results = api.search(q = query,
                              count = 90, max_id = last_id, result_type="recent")
                                # geocode = "{},{},{}km".format(latitude, longitude, max_range) #for geocode

        # For each tweet extract some info (JSON structure)
        for result in results:
            result_count += 1
            user = result.user.screen_name
            text = result.text
            followers_count = result.user.followers_count
            time_zone = result.user.time_zone
            print("_"*10)
            print(user,time_zone,followers_count)
            print(text)

        # Keep the last_id to know where to continue
        last_id = int(result.id)-1
        new_tweets = result_count - previous_tweets

        print ("Number of results: {} ({} new)".format(result_count,new_tweets))

        # If we don't get new tweets exit
        if new_tweets == 0: 
            break
        
        time.sleep(1)
        
        if ((result_count + 90) // 150) > cumulative:
            cumulative += 1
            time.sleep(15*60)


twitter(d_keys,"from:sfiscience", num_results=10)

__________
(u'sfiscience', None, 28802)
It's working... https://t.co/HfqQ8QDO1N
__________
(u'sfiscience', None, 28802)
RT @ewhitmore: Did you know @sfiscience in #SantaFe is one of the top places in the world to explore #ComplexSystems &amp; Complexity Science?…
__________
(u'sfiscience', None, 28802)
RT @DavidFeldman: This is a great program!  Highly recommended.  I was a student in 1996 and have served as the director since 2017.  The s…
__________
(u'sfiscience', None, 28802)
RT @StefaniCrabtree: Our amazing collaborator on the #archaeoEcology project run through @sfiscience, Andy Dugmore talking about learning f…
__________
(u'sfiscience', None, 28802)
A fun, very quick intro to the field of #AI c/o our own @MelMitchell1 + @LetsWorkHappy #podcast...come for the robo… https://t.co/JPSJChD6IT
__________
(u'sfiscience', None, 28802)
Transcend disciplinary boundaries, take intellectual risks, and ask big questions about complex systems!  

Deadlin… https://t.co/lMJyXj8r81
__________
(

# Advanced Scraping Issues

### "Hidden" APIs

First, let's try and access what we are calling a "hidden" API. That is, we investigate the resources requested by a webpage (e.g. a list of faculty), and make requests directly to that API. 

We will do this for the website: https://www.uvm.edu/directory

First, vist uvm.edu/directory and open the network tab as you do a search in this directory.  
Copy the get url, and paste it on the website https://curl.trillworks.com/, that will convert directly to a python requests command. 

In [122]:
import requests
import json

def get_names(letters):
    params = (
        ('name', letters),
        ('request_num', '1'),
    )

    response = requests.get('https://www.uvm.edu/directory/api/query_results.php', params=params)
    if response.ok == True:
        return response.text
    else:
        return None

In [142]:
response = get_names("john smith")

In [143]:
response_json = json.loads(response)

In [145]:
from IPython.core.display import display
for i, person in enumerate(response_json["data"]):
#     display(person)
    if i == 10: 
        break # Make sure we don't print too much
        
    print(person["edupersonprimaryaffiliation"]["0"], person["edupersonprincipalname"]["0"], person["cn"]["0"])

(u'Affiliate', u'jfsmith@uvm.edu', u'John F. Smith')
(u'Student', u'dsmith41@uvm.edu', u'David John Smith')
(u'Affiliate', u'jfsmith@uvm.edu', u'John F. Smith')
(u'Student', u'dsmith41@uvm.edu', u'David John Smith')


###  Session ID's

### Robust scraping

Websites changes their html all the time. Therefore it is worthwhile to make your scraper robust.  There are a few tips we have on how to do this, and you might know some other tricks too.

- Don't make your scraper language dependent. Your browser setting will influence the text displayed on websites. So if you extracting elements by text, this is sensitive your browser setting, and to the general language of the website. It's better not extract elements by text.
- Save raw html. As we learned last week from Damian Trillings Database Management workshop, it is very good practice to save the entire html of the website in stead of only the elements you are interested in. That way, if the website has changed their html and your scraper brakes or is not extracing the right elements anymore, you can simply re-extract the information later from the raw html that you saved in your database.
- Use drilldown method. Don't look for a class or attribute in the entire html, but first drill down the specific part of page. Very bad practice is to look for all elements of very general html attribute (such as 'row'), which returns a list, and then select the right index of that list.  This is very sensitive to html changes!
- I always try to avoid xpath, for the same reason as above.
- Track your progress. Build you scraper in a way so that it is not a problem if it crashes. Scrapers will always crash, almost by default. For example, your vpn connenction can shut down (if applicable), your internet connection might brake, the site your scraping might go down for a moment etc. Track your progress somewhere so that you can always turn your scraper on so that it start from where it left of.


### Crawlers -- scaling up

In many cases, scraping can be easily parallelized. Especially if you have several urls that need to be scraped independently. In case you do a search on website, and get many result pages, you can also parallelize your code; you can divide the work of scraping over several scrapers that all scrape several pages. However, then you need to put in place a way of tracking what has been scraped and what has not. Maybe some of you, us, have advice on how to do this? I personally use a tracking table in my database, that tracks the progress. 

Again, robustness is important: build your scrapers or crawlers in such a way that it is absolutely fine if a scraper dies.

There are several ways to parallelize scrapers, i.e. setup crawlers. One of the ways is to do this yourself, without an external service, by means of subprocess. Here is some simple code I wrote to do this. This spin_up_scrapers code spins up several scrapers, and check every x seconds if each scraper is still active. If one dies, another scraper is spin up.
main.py is the scraper code 

In [None]:
import time
import subprocess

nr_scrapers = 10
nr_hours_scraping = 10

def spin_up_scraper(nr_scrapers, nr_hours_scraping)
    scraper_processes = []
    for scraper_i in range(nr_scrapers):
        p = subprocess.Popen(['python main.py'], shell=True,
                                stdin=None, stdout=None, stderr=None, close_fds=True)

        # Wait a few moments before starting the next scraper
        time.sleep(20)
        print("---------------Starting next scraper-------------------------------")
        scraper_processes.append(p)


    # Check every minute if all scrapers are up, if one is down, start a new one
    for minutes in range(60*nr_hours_scraping):
        # Sleep 60 seconds till the next check
        time.sleep(60)
        for scraper in scraper_processes:
            down = scraper.poll()
            if down is None:
                scraper_processes.remove(scraper)
                print('----One scraper down. Starting a new one ------------------------')
                p = subprocess.Popen(['python main.py'], shell=True,
                                     stdin=None, stdout=None, stderr=None, close_fds=True)
                scraper_processes.append(p)
                time.sleep(20)


### Scrolling down

## Downloading files