# Datascraping Advanced Issues
Dec 10th, 2018 - Javier Garcia-Bernardo, Anna Keuchenius & Allie Morgan

## "Hidden" APIs

First, let's try and access what we are calling a "hidden" API. That is, we investigate the resources requested by a webpage (e.g. a list of faculty), and make requests directly to that API. 

We will do this for the website: https://www.uvm.edu/directory

First, vist uvm.edu/directory and open the network tab as you do a search in this directory.  
Copy the get url, and paste it on the website https://curl.trillworks.com/, that will convert directly to a python requests command. 

In [1]:
import requests
import json

def get_names(letters):
    params = (
        ('name', letters),
        ('request_num', '1'),
    )

    response = requests.get('https://www.uvm.edu/directory/api/query_results.php', params=params)
    if response.ok == True:
        return response.text
    else:
        return None

In [2]:
response = get_names("john smith")

In [5]:
response_json = json.loads(response)

In [6]:
from IPython.core.display import display
for i, person in enumerate(response_json["data"]):
#     display(person)
    if i == 10: 
        break # Make sure we don't print too much
        
    print(person["edupersonprimaryaffiliation"]["0"], person["edupersonprincipalname"]["0"], person["cn"]["0"])

Affiliate jfsmith@uvm.edu John F. Smith
Student dsmith41@uvm.edu David John Smith
Affiliate jfsmith@uvm.edu John F. Smith
Student dsmith41@uvm.edu David John Smith


## Session ID's

Example : Web of Science / Webofknowledge (only works if uvavpn is on, or connected to eduroam)
- Beware, this website is hard to scrape,because it uses lots of params, cookies and data values
- Some of these are not required, others are required (below we use the full list)
- Next to the SID, there are some other variables that will be used if you scrape this website beyond the search page (such as qid and parentQid)

In [67]:
import requests
import bs4 as bs
import BeautifulSoup
import re

In [59]:
print("Be patient, wos is a little slow")
page = requests.get('http://apps.webofknowledge.com')
print(page.ok)
print(page.is_redirect)
print(page.text[:1000])

True
False
<!DOCTYPE html>                                                                                                                                                                                                                                                                                                                                             <html lang="en">                                                                                                          <head><link rel="icon" href="http://images.webofknowledge.com/WOKRS531NR4/images/wok_favicon.ico" type="image/x-icon"/> <title>Web of Science [v.5.31]  -     Web of Science Core Collection Basic Search  </title><link rel="stylesheet" href="http://images.webofknowledge.com/WOKRS531NR4/css/WoKcommon.css" type="text/css" /><link rel="stylesheet" href="http://images.webofknowledge.com/WOKRS531NR4/css/WoKcomponents.css" type="text/css" /><link rel="stylesheet" href="http://images.webofknowledge.com/WOKRS531NR4/css/Font

In [60]:
def find_sid(page):
    # Parse html to soup element     
    soup = bs.BeautifulSoup(page.text, 'html.parser')
    # Find all urls on this page
    link_elements = soup.find_all('a', href=True)
    links = [a['href'] for a in link_elements]
    # Find a url that mentions the SID
    for link in links:
        if "&SID" in link:
            m = re.search('SID=([a-zA-Z0-9]*)&', link)
            if m:
                sid = m.group(1)
                print("SID = " + sid )
                break
    return sid

In [68]:
find_sid(page)

SID = C6qUgZxsBpEFX5sdEyh


u'C6qUgZxsBpEFX5sdEyh'

Enter your searchstring for webofknowledge here

In [69]:
search_for = "duncan watts"

In [70]:
# The values for cookies, headers, and data were abstracted by copying cURL of the postrequest
# and converting via https://curl.trillworks.com/ 

cookies = {
    'JSESSIONID': '99D741BD336ECEAE76CAD3B393B28DEC',
    '_abck': '71DAE2633D154E88038BB9AB588774F65F65025C79420000DE1B115CFB966873~0~iE9/DcS3kdvjrrZVkrxrxf8XZwdEfBkj8wGcJpsxPgk=~-1~-1',
    'bm_sz': '2486FC5ACB86312F6255A7068D5A0A8E~QAAQXAJlX9Dp8ptnAQAABd3UojnpkufjEno/Soup6ud0aOug8vqn2/Y30pnLBro8ZWGeym3eeUxQshGhvslq7JSMKrN8C+ID0Bs6zOPNSf4nQyWyEbNFB/tqDePD/S+DIMGDuzSA1Rw6qg9NWfPLxjUDZSz1w+HPiXDTmDjvdgtXeVmsGboBc4ki7nLbZ92MXa5X417dDw==',
    'SID': sid,
    'CUSTOMER': 'SURFMARKET BV_Netherlands Consortium',
    'E_GROUP_NAME': 'Universiteit van Amsterdam',
    'dotmatics.elementalKey': 'SLsLWlMhrHnTjDerSrlG',
    'ak_bmsc': '3E7A47598CAAF5268D5B25BC910E3E4B5F65025C79420000DE1B115C0EE4B205~plMIIaKYNxqVa5r3ZwEYYpeARdwS1xfcg++jvnlggN+Bgd7CHMxWIY9Xlmm+OWlBrw918+OtL+aC5vf0p4TXjAagNBuPBjLZfu0eH4dd7u5ckNy5xdogi7r9pMRzyududsEufRAZeC7KgI3taS383LnuQVS/6DLROYU5Cnt3nHdUtUiKhJMg4YRjCqXclJvRt/Tsy4Cgx/YkTooXOkvv5bL+3tSqjLcW8ZsiF+lmZ5n1+xK/7LCXQQ30R56gyrLjZQ',
    'bm_sv': 'FFAA3183B0316D39CBFC5803DA96F660~jilZYddUoiGVvK4MVDpSTAjE+YXfAiZnhnCH+bAGZMzN67D761Q3XiYXHoyB4wNiq/Twf2oDTf9Pg6kcUDM8v8ym7WSirBJu3jbW5Cw3C5GvHjGSthV7IfsxJ/7WEDEF+PeY22WlfjvM8zeiuhnQh4RdWkO5qrzEOwvbkXS9El4=',
    '_sp_ses.630e': '*',
    '_sp_id.630e': 'b85ae547-8640-4c9d-815d-4ac2d5004766.1544625121.1.1544625121.1544625121.6ebc7484-bc0e-4fbb-ad02-e771ba38fb56',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:63.0) Gecko/20100101 Firefox/63.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Referer': 'http://apps.webofknowledge.com/WOS_GeneralSearch_input.do?product=WOS&search_mode=GeneralSearch&SID=E1hJmPezAu8d99A5SoW&preferencesSaved=',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
}

data = [
  ('fieldCount', '1'),
  ('action', 'search'),
  ('product', 'WOS'),
  ('search_mode', 'GeneralSearch'),
  ('SID', sid),
  ('max_field_count', '25'),
  ('max_field_notice', 'Notice: You cannot add another field.'),
  ('input_invalid_notice', 'Search Error: Please enter a search term.'),
  ('exp_notice', 'Search Error: Patent search term could be found in more than one family (unique patent number required for Expand option) '),
  ('input_invalid_notice_limits', ' <br/>Note: Fields displayed in scrolling boxes must be combined with at least one other search field.'),
  ('sa_params', 'WOS||E1hJmPezAu8d99A5SoW|http://apps.webofknowledge.com|\''),
  ('formUpdated', 'true'),
  ('value(input1)', search_for),
  ('value(select1)', 'TS'),
  ('value(hidInput1)', ''),
  ('limitStatus', 'collapsed'),
  ('ss_lemmatization', 'On'),
  ('ss_spellchecking', 'Suggest'),
  ('SinceLastVisit_UTC', ''),
  ('SinceLastVisit_DATE', ''),
  ('period', 'Range Selection'),
  ('range', 'ALL'),
  ('startYear', '1975'),
  ('endYear', '2018'),
  ('editions', 'SCI'),
  ('editions', 'SSCI'),
  ('editions', 'AHCI'),
  ('editions', 'ESCI'),
  ('update_back2search_link_param', 'yes'),
  ('ssStatus', 'display:none'),
  ('ss_showsuggestions', 'ON'),
  ('ss_numDefaultGeneralSearchFields', '1'),
  ('ss_query_language', ''),
  ('rs_sort_by', 'PY.D;LD.D;SO.A;VL.D;PG.A;AU.A'),
]

In [71]:
response = requests.post('http://apps.webofknowledge.com/WOS_GeneralSearch.do', headers=headers, cookies=cookies, data=data)

In [79]:
# Print the titles of the first 10 search results
soup = bs.BeautifulSoup(response.text, 'html.parser')
results = soup.find_all('div', {'class': 'search-results-content'})
for result in results:
    print(result.find('a', {'class' : 'smallV110 snowplow-full-record'}).text)


The last apothecary: Eric Knott (1896-1993) and 20th-century pharmacy in Scotland


AN INTERVIEW WITH DUNCAN WATTS



Extensible software for whole of society modeling: framework and preliminary results


The Measure of All Things Finding Out That Something Doesn't Work Is the First Step Toward Learning What Does Work


Optimization and evaluation of a semi-continuous solar dryer for cereals (Rice, etc)


The HBR list


Complex systems: Network thinking


In vitro caries formation in primary tooth enamel - Role of argon laser irradiation and remineralizing solution treatment


Effect of the photoperiod on bullfrog (Rana catesbeiana Shaw, 1802) tadpoles development


Researching Sara Jeanette Duncan in the papers of A.P. Watt and Company



## Crawlers 
scaling up

In many cases, scraping can be easily parallelized. Especially if you have several urls that need to be scraped independently. In case you do a search on website, and get many result pages, you can also parallelize your code; you can divide the work of scraping over several scrapers that all scrape several pages. However, then you need to put in place a way of tracking what has been scraped and what has not. Maybe some of you, us, have advice on how to do this? I personally use a tracking table in my database, that tracks the progress. 

Again, robustness is important: build your scrapers or crawlers in such a way that it is absolutely fine if a scraper dies.

### Subprocesses

There are several ways to parallelize scrapers, i.e. setup crawlers. One of the ways is to do this yourself, without an external service, by means of subprocess. Here is some simple code I wrote to do this. This spin_up_scrapers code spins up several scrapers, and check every x seconds if each scraper is still active. If one dies, another scraper is spin up.
main.py is the scraper code 

In [8]:
import time
import subprocess

nr_scrapers = 10
nr_hours_scraping = 10

def spin_up_scraper(nr_scrapers, nr_hours_scraping):
    scraper_processes = []
    for scraper_i in range(nr_scrapers):
        p = subprocess.Popen(['python main.py'], shell=True,
                                stdin=None, stdout=None, stderr=None, close_fds=True)

        # Wait a few moments before starting the next scraper
        time.sleep(20)
        print("---------------Starting next scraper-------------------------------")
        scraper_processes.append(p)


    # Check every minute if all scrapers are up, if one is down, start a new one
    for minutes in range(60*nr_hours_scraping):
        # Sleep 60 seconds till the next check
        time.sleep(60)
        for scraper in scraper_processes:
            down = scraper.poll()
            if down is None:
                scraper_processes.remove(scraper)
                print('----One scraper down. Starting a new one ------------------------')
                p = subprocess.Popen(['python main.py'], shell=True,
                                     stdin=None, stdout=None, stderr=None, close_fds=True)
                scraper_processes.append(p)
                time.sleep(20)


## Dealing with Selenium

### Scrolling down
Some websites (e.g. reddit or facebook) have infinite scroll. You only load new elements by scrolling down.

In [12]:
import selenium.webdriver
driver = selenium.webdriver.Chrome()
driver.get('https://cssamsterdam.github.io/')

In [11]:
def scroll_down(SCROLL_PAUSE_TIME = 0.5):
    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height: break
        last_height = new_height

In [14]:
scroll_down()

### Stale Element Exception
We get this when the element is destroyed or hasn't been completely loaded. Possible solutions: Refresh the website, or wait until the page loads

In [None]:
import selenium.common.exceptions
import selenium.webdriver
import selenium.webdriver.common.desired_capabilities
import selenium.webdriver.support.ui
from selenium.webdriver.support import expected_conditions

#Define a function to wait for an element to load
def _wait_for_element(xpath, wait):
    try:
        polling_f = expected_conditions.presence_of_element_located((selenium.webdriver.common.by.By.XPATH, xpath))
        elem = wait.until(polling_f)
    except:
        raise selenium.common.exceptions.TimeoutException(msg='XPath "{}" presence wait timeout.'.format(xpath))
    return elem

def _wait_for_element_click(xpath, wait):
    try:
        polling_f = expected_conditions.element_to_be_clickable((selenium.webdriver.common.by.By.XPATH, xpath))
        elem = wait.until(polling_f)
    except:
        raise selenium.common.exceptions.TimeoutException(msg='XPath "{}" presence wait timeout.'.format(xpath))
    return elem

#define short and long timeouts
wait_timeouts=(30, 180)

#open the driver (change the executable path to geckodriver_mac or geckodriver.exe)
driver = selenium.webdriver.Firefox(executable_path="./geckodriver")

#define short and long waits (for the times you have to wait for the page to load)
short_wait = selenium.webdriver.support.ui.WebDriverWait(driver, wait_timeouts[0], poll_frequency=0.05)
long_wait = selenium.webdriver.support.ui.WebDriverWait(driver, wait_timeouts[1], poll_frequency=1)


#And this is how you get an element
element = _wait_for_element('HERE_GOES_THE_XPATH',short_wait)

### Download pdfs
Many times you want to download pdfs (applications/pdf) automatically (or some other file, e.g. for zip files "application/x-gzip"), but you get the pop ups asking you where to save it. 

In [28]:
import selenium.webdriver

##Chrome option
options = selenium.webdriver.ChromeOptions()
profile = {"plugins.plugins_list":
           [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
           "download.default_directory": "./download_directory/" ,
           "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)
   
driver = selenium.webdriver.Chrome("./chromedriver",chrome_options=options)


# ##Firefox option
# profile = selenium.webdriver.FirefoxProfile()
# profile.set_preference("browser.download.folderList", 2)
# profile.set_preference("browser.download.manager.showWhenStarting", False)
# profile.set_preference("browser.download.dir", "./download_directory/")
# profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")

# driver = selenium.webdriver.Firefox(firefox_profile=profile)

### Headless Chrome
Both Chrome and Firefox allow you to start headless (without opening a window)

    import selenium.webdriver
    options = selenium.webdriver.ChromeOptions()
    options.add_argument("--headless")
    driver = selenium.webdriver.Chrome("./chromedriver",chrome_options=options)

    
    from selenium.webdriver.firefox.options import Options
    options = Options()
    driver = selenium.webdriver.Firefox("./geckodriver",options=options)

This is how you enable download in headless Chrome

In [29]:
import selenium.webdriver

def enable_download_in_headless_chrome(driver, download_dir):
    # add missing support for chrome "send_command"  to selenium webdriver
    driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')

    params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
    command_result = driver.execute("send_command", params)

options = selenium.webdriver.ChromeOptions()
profile = {"plugins.plugins_list":
           [{"enabled": False, "name": "Chrome PDF Viewer"}], # Disable Chrome's PDF Viewer
           "download.default_directory": "./download_directory/" ,
           "download.extensions_to_open": "applications/pdf"}

options.add_experimental_option("prefs", profile)
options.add_argument("--headless")
   
driver = selenium.webdriver.Chrome("./chromedriver",chrome_options=options)
enable_download_in_headless_chrome(driver,"./download_directory/")
                                   

### Dealing with new windows

In [None]:
#Click somewhere
driver.find_element_by_xpath("xxxx").click()

#Switch to the new window
driver.switch_to_window(driver.window_handles[1])

#Do whatever
driver.find_element_by_xpath('xxxxxx').click()

#Go back to the main window
driver.switch_to_window(driver.window_handles[0])


## Robust scraping

Websites changes their html all the time. Therefore it is worthwhile to make your scraper robust.  There are a few tips we have on how to do this, and you might know some other tricks too.

- Don't make your scraper language dependent. Your browser setting will influence the text displayed on websites. So if you extracting elements by text, this is sensitive your browser setting, and to the general language of the website. It's better not extract elements by text.
- Save raw html. As we learned last week from Damian Trillings Database Management workshop, it is very good practice to save the entire html of the website in stead of only the elements you are interested in. That way, if the website has changed their html and your scraper brakes or is not extracing the right elements anymore, you can simply re-extract the information later from the raw html that you saved in your database.
- Use drilldown method. Don't look for a class or attribute in the entire html, but first drill down the specific part of page. Very bad practice is to look for all elements of very general html attribute (such as 'row'), which returns a list, and then select the right index of that list.  This is very sensitive to html changes!
- I always try to avoid xpath, for the same reason as above.
- Track your progress. Build you scraper in a way so that it is not a problem if it crashes. Scrapers will always crash, almost by default. For example, your vpn connenction can shut down (if applicable), your internet connection might brake, the site your scraping might go down for a moment etc. Track your progress somewhere so that you can always turn your scraper on so that it start from where it left of.


## Proxies
- Proxies allow you to download a website changing your IP. 
**Don't use it for evil**

### Using public proxies

In [None]:
!pip install http-request-randomizer

In [23]:
from http_request_randomizer.requests.proxy.requestProxy import RequestProxy
import logging

# Collects the proxys and log errors
req_proxy = RequestProxy()
req_proxy.set_logger_level(logging.CRITICAL)

# Request a website
r = req_proxy.generate_proxied_request("http://cssamsterdam.github.io")


<Response [200]>

### Using TOR
- Instructions to configure it: https://github.com/jgarciab/tor

In [None]:
#This won't work without configuring it
from tor_control import TorControl
import requests


tc = TorControl()
print(requests.get("https://api.ipify.org?format=jso").text)
#> 163.172.162.106

tc.renew_tor()
print(requests.get("https://api.ipify.org?format=jso").text)
#> 18.85.22.204

## Speed up requests 
- Use: Want to collect info from many different websites.
- Problem: requests is blocking (it waits until the website responds )
- Solution: run many threads
  - But not straightforward
  - Best: grequests: asynchronous HTTP Requests

In [11]:
!pip install grequests



In [12]:
import grequests

urls = [
    'http://www.heroku.com','http://python-tablib.org', 'http://httpbin.org', 
    'http://python-requests.org', 'http://fakedomain/','http://kennethreitz.com'
]

rs = (grequests.get(u) for u in urls)

grequests.map(rs)

[<Response [200]>,
 <Response [200]>,
 <Response [200]>,
 <Response [200]>,
 None,
 <Response [200]>]