# Collecting Data From the Web

Lecture is partly borrowed (or scrupolously stolen) from Damina Trilling's "Doing Computational Social Science with Python: An Introduction", Chapter 8.

By know you should have a basic understanding of various steps and techniques for content analysis. But where to get your data? Many sources are availably, but ultimately it depends on your research questions, really. Nonetheless, the Web is becoming one of biggest sources of information--for both historians of the recent past, as scholars in media studies. 

In this lecture, we have a closer look at techniques that help you to collect information from the Web, either through Web scraping or working with APIs (Application Programming Interfaces).

# 1 Scraping Web Pages

Scraping, broadly, consists of two steps[OR]: 
- Retrieving HTML data from a domain name
- Parsing that data for target information
- Storing the target information
- Optionally, moving to another page to repeat the process

In many ways, scraping resembles surfing the web without the use of a web browser--these are by itself a rather recent invention. But you can most of the same operation you'd perform in your browser (such as navigating to a another pages via a link, or saving material you find) in Python. 

Especially with regard to collecting information that is available in specific a structure, scraping makes life easier. Image, manually downloading more than a thousand books and selecting the useful information--completing this taks would easily take more than a few days, and require quite some tedious work. Luckily, with Python this can be done in just a few lines of code (and less of your time, as the script handles every step).

## 1.1 Downloading Webpages

Downloading a webpage with Python requires just one line of code using the urllib module.

In [5]:
from urllib.request import urlopen
html = urlopen("https://www.whitehouse.gov/issues/immigration/")
print(html.read()[:1000])

b'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#" class="no-js">\n<head>\n\t<script>\n\t\tdocument.documentElement.className = document.documentElement.className.replace(/\\bno-js\\b/, \'js\');\n\t\tdocument.createElement(\'picture\');\n\t</script>\n\t<meta charset="UTF-8">\n\t<meta http-equiv="x-ua-compatible" content="ie=edge"><script type="text/javascript">window.NREUM||(NREUM={}),__nr_require=function(e,t,n){function r(n){if(!t[n]){var o=t[n]={exports:{}};e[n][0].call(o.exports,function(t){var o=e[n][1][t];return r(o||t)},o,o.exports)}return t[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(e,t,n){function r(){}function o(e,t,n){return function(){return i(e,[f.now()].concat(u(arguments)),t?null:this,n),t?void 0:this}}var i=e("handle"),a=e(2),u=e(3),c=e("ee").get("tracer"),f=e("loader"),s=NREUM;"undefined"==typeof window.newrelic&&(newrelic=s);var p=["setPageViewName","setCustomAttribute",

[OR]So how does Python do this? Thanks to the plain-English nature of Python, the line

In [None]:
from urllib.request import urlopen

[OR]means what it looks like it means: it looks at the Python module request (found within the urllib library) and imports only the function urlopen.

[OR]urlopen is used to open a remote object across a network and read it. Because it is a fairly generic library (it can read HTML files, image files, or any other file stream with ease), we will be using it quite frequently throughout this lecture

Reading webpages with Python is also fairly easy with the BeautifulSoup (BS) module. 
From [OR]

"The BeautifulSoup library was named after a Lewis Carroll poem of the same name in Alice’s Adventures in Wonderland. In the story, this poem is sung by a character called the Mock Turtle (itself a pun on the popular Victorian dish Mock Turtle Soup made not of turtle but of cow). Like its Wonderland namesake, BeautifulSoup tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us
with easily-traversible Python objects representing XML structures."

Let's have a closer look at how this works using the official webpage of the White House. To obtain the Html version of this 

In [10]:
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs

html = urlopen("https://www.whitehouse.gov/issues/immigration/")

page = bs(html.read(),"lxml")

<h1 class="page-header__title">Immigration</h1>


[OR] As in the example before, we are importing the urlopen library and calling `html.read()` in order to get the HTML content of the page. This HTML content is then transformed into a BeautifulSoup object named `page`.

To print main header elements we simply print `page.h1`.

In [None]:
print(page.h1)

## 1.2 What is HTML, and why is it important?

HTML is the acronym for HyperText Markup Language. You can think of it as an additional layer to the text that instructs your browser how to interpret and represent the text--for this reason it belongs to the family of markup languages (another famous member is XML, the eXtensible Markup Language).

[From Wikipedia](https://en.wikipedia.org/wiki/HTML):

HTML describes the structure of a web page **semantically** and originally included cues for the appearance of the document.

"**HTML elements** are the building blocks of HTML pages. With HTML constructs, images and other objects, such as interactive forms, may be embedded into the rendered page. It provides a means to create **structured documents** by denoting **structural semantics** for text such as *headings, paragraphs, lists, links, quotes and other items*. 

HTML elements are delineated by **tags**, written using angle brackets. Tags such as <img /> and <input /> introduce content into the page directly. Others such as <p>...</p> surround and provide information about document text and may include other tags as sub-elements. Browsers do not display the HTML tags, but use them to interpret the content of the page."

HTML markup consists of several key components, including those called **tags** (and their **attributes**), character-based data types, character references and entity references. HTML tags most commonly come in **pairs** like `<h1>` and `</h1>`, although some represent empty elements and so are unpaired, for example `<img>`. The first tag in such a pair is the **start** tag, and the second is the **end** tag (they are also called opening tags and closing tags).

Another important component is the HTML document type declaration, which triggers standards mode rendering.

The following is an example of the classic "Hello, World!" program, a common test employed for comparing programming languages, scripting languages and markup languages. This example is made using 9 source lines of code:

(The text between `<html>` and `</html>` describes the web page, and the text between `<body>` and `</body>` is the visible page content. The markup text `<title>This is a title</title>` defines the browser page title.

### 1.2.1 Common HTML Elements 
See [this page](http://www.thuto.org/ubh/web/html/tags1.htm) for a more comprehensive overview.

| Tag | Description |
|-------|---------|
| `<a>` | anchor |
| `<b>` | show content in bold type |
| `<blokquote>` | content shown as indented block |
| `<div>` | dummy element that contains block-level elements. |
| `<h1>`| heading level 1 |
| `<h2>`| heading level 2 etc. | 
| `<img>` | image, must have a `src` and `alt` attribute. 
| | `src` locates the image file (`src="cat.jpg".`) |
| | `alt` refers to a brief description (`alt="Picture of my Cat"`) |
| `<table>` | tags that enclose a table |
| `<td>` | table data cell|
| `<th>` | table header cell |
| `<tr>` | table row | 

## 1.3 Scraping Data: Some Examples

Even though HTML only comprises a handful of common elements, the makup of pages often differs significantly and the scraper needs to be adjusted to the specific pages. Therefore, you end up b. 

Nonetheless, you'll likely end reconfiguring existing code.
Below you'll find a few examples and guidelines for collecting information from the web.
- Gathering Links (traversing the page tree)
- Reading and writing tables

In [None]:
## 1.3.1 Migration on the whitehouse.gov

In [25]:
from bs4 import BeautifulSoup as bs
import requests

start_url = "https://www.whitehouse.gov/issues/immigration/"

start_content = requests.get(start_url).content
start_soup = bs(content,'lxml')

In [None]:
Retrieving the urls with a recursive function

In [39]:
def get_next(url,collector):
    collector.append(url)
    soup = bs(requests.get(url).content,'lxml')
    next_a = soup.find('a',attrs={'class': 'pagination__next'})
    if next_a:
        url = next_a['href']
        get_next(url,collector)
    return collector
    
urls=get_next(start_url,[])
print(urls)

['https://www.whitehouse.gov/issues/immigration/', 'https://www.whitehouse.gov/issues/immigration/page/2/', 'https://www.whitehouse.gov/issues/immigration/page/3/', 'https://www.whitehouse.gov/issues/immigration/page/4/', 'https://www.whitehouse.gov/issues/immigration/page/5/']


In [46]:
def get_briefings(main_page_url):
    briefing_urls = []
    soup = bs(requests.get(main_page_url).content,'lxml')

    briefings = soup.find_all('h2',attrs={'class':'briefing-statement__title'})
    for brief in briefings:
        briefing_urls.append(brief.find('a')['href'])
    print(briefing_urls)
    return briefing_urls
        
briefing_url = get_briefings('https://www.whitehouse.gov/issues/immigration')  
print(briefing_url)

['https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-weekly-address-121617/', 'https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-weekly-address-26/', 'https://www.whitehouse.gov/briefings-statements/statement-press-secretary-kate-steinle-case/', 'https://www.whitehouse.gov/briefings-statements/immigration-reform-law-institutes-brian-lonergan-america-seen-enough-tragedies-result-open-boarders/', 'https://www.whitehouse.gov/briefings-statements/trump-administration-immigration-policy-priorities/', 'https://www.whitehouse.gov/briefings-statements/secure-border-deterring-swiftly-removing-illegal-entrants/', 'https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-letter-house-senate-leaders-immigration-principles-policies/', 'https://www.whitehouse.gov/briefings-statements/establish-merit-based-reforms-promote-assimilation-financial-success/']
['https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-weekly-

In [None]:
Putting it together

In [52]:
start_url = 'https://www.whitehouse.gov/issues/immigration'

def get_briefings(soup):
    
    briefing_urls = []

    briefings = soup.find_all('h2',attrs={'class':'briefing-statement__title'})
    for brief in briefings:
        briefing_urls.append(brief.find('a')['href'])
    
    return briefing_urls
        
def get_urls(url,collector):
    print('At page %s'%url)
    
    soup = bs(requests.get(url).content,'lxml')
    next_a = soup.find('a',attrs={'class': 'pagination__next'})
    collector.extend(get_briefings(soup))
    
    if next_a:
        url = next_a['href']
        get_urls(url,collector)
    
    return collector
    
urls=get_urls(start_url,[])

At page https://www.whitehouse.gov/issues/immigration
At page https://www.whitehouse.gov/issues/immigration/page/2/
At page https://www.whitehouse.gov/issues/immigration/page/3/
At page https://www.whitehouse.gov/issues/immigration/page/4/
At page https://www.whitehouse.gov/issues/immigration/page/5/
['https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-weekly-address-121617/', 'https://www.whitehouse.gov/briefings-statements/president-donald-j-trumps-weekly-address-26/', 'https://www.whitehouse.gov/briefings-statements/statement-press-secretary-kate-steinle-case/', 'https://www.whitehouse.gov/briefings-statements/immigration-reform-law-institutes-brian-lonergan-america-seen-enough-tragedies-result-open-boarders/', 'https://www.whitehouse.gov/briefings-statements/trump-administration-immigration-policy-priorities/', 'https://www.whitehouse.gov/briefings-statements/secure-border-deterring-swiftly-removing-illegal-entrants/', 'https://www.whitehouse.gov/briefings-st

In [None]:
def read_brief(url):
    pass
    return text

## 1.3.2 Bukowski's Poems

In [None]:
from bs4 import BeautifulSoup as bs
import requests

base_url = "https://www.poemhunter.com/charles-bukowski/poems"

content = requests.get(url).content

In [None]:
soup = bs(content,'lxml')
tables=soup.find_all('table')
len(tables)

In [None]:
#!pip install python-louvain==0.5

In [None]:
for table in tables:
    print(table.get('class','NaN'))

In [None]:
poems=soup.find('table',{'class':'poems'})

In [None]:
print(len(poems))

In [None]:
links = poems.find_all('a')

In [None]:
first_link = links[0]
url = first_link['href']
print(url)

In [None]:
import urllib.parse
poem_url = urllib.parse.urljoin(base_url,url)
print(poem_url)

In [None]:
poem = bs(requests.get(poem_url).content,'lxml')
poem_div = poem.find('div',{'class':'KonaBody'}).find('p')#.find_all('br')
#print(poem_div)
print(str(poem_div).replace('<p>','').replace('</p>','').replace('<br/>','\n').strip())

## 2. APIs

### 2.1 Google Books

In [70]:
from urllib.request import urlopen
import json
from pprint import pprint
antwoord=urlopen("https://www.googleapis.com/books/v1/volumes?q=shakespeare").read()
data=json.loads(antwoord.decode("utf-8"))
pprint(data)

{'items': [{'accessInfo': {'accessViewStatus': 'FULL_PUBLIC_DOMAIN',
                           'country': 'NL',
                           'embeddable': True,
                           'epub': {'downloadLink': 'http://books.google.nl/books/download/The_Works_of_Shakespear.epub?id=wsPe-P8lb8AC&hl=&output=epub&source=gbs_api',
                                    'isAvailable': False},
                           'pdf': {'downloadLink': 'http://books.google.nl/books/download/The_Works_of_Shakespear.pdf?id=wsPe-P8lb8AC&hl=&output=pdf&sig=ACfU3U07OWCZj5Z2mMdO5J4MsPgtzmmiJg&source=gbs_api',
                                   'isAvailable': True},
                           'publicDomain': True,
                           'quoteSharingAllowed': False,
                           'textToSpeechPermission': 'ALLOWED',
                           'viewability': 'ALL_PAGES',
                           'webReaderLink': 'http://play.google.com/books/reader?id=wsPe-P8lb8AC&hl=&printsec=frontcover&sour

                                          'in Shakespearean criticism over a '
                                          'period of about three decades. Many '
                                          'of them were written for specific '
                                          'occasions or specific reasons '
                                          'having to do with teaching or with '
                                          'panel discussions before diverse '
                                          'audiences, which she entered into '
                                          'along with others. In the process '
                                          'she contributed some of the best '
                                          'work on Shakespeare that was then '
                                          'extant, as this collection '
                                          'demonstrates. Searching for a '
                                          'principle of organizati

### 2.2 Chronicling America