The internet is filled with content, primarily not intended for digital humanities research. This notebook illustrates the  possibilities of fetching such data automatically, by opening many web pages automatically.

It uses selenium and phantomjs. The former is a python library to do navigate the web programatically, the latter is a browser that does not open window, but can load webpages.

Lets *steal* some code of others and het started.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from jupyter_progressbar import ProgressBar
from IPython.display import Image, display

import random

The first thing we do is creating a browser object, it will have functionality to navigate to URLs using the get-method and extracting certain content.

Content extraction is done using CSS selectors, which query the HTML. For example:

 * **table a** finds all links (a) that are embedded in a table.
 * **table td > a** does the same, but only for links that a direct child of a table cell.

You'll see this in action a bit further.

Run the cell.

In [None]:
url = 'http://www.lovepoemsandquotes.com/LovePoems.html'

browser = webdriver.PhantomJS()
browser.get(url)

browser.save_screenshot('screenshot.png')
display(Image(filename='screenshot.png'))

browser.close()
browser.quit()

You took a screenshot of a webpage, nice.

The webpage is messy, but contains love poems, which we're going to download. The code below fetches all the links.

Since the webpage has a lot of irrelevant links, only the links wich start with **http://www.lovepoemsandquotes.com/LovePoem** are appended to the list. This syntax, using **if**, was not discussed, basically the append is only executed if the **startswith**-condition is met.

The code will fail, fix the errorby intializing links with an empty list. Then run.

In [None]:
url = 'http://www.lovepoemsandquotes.com/LovePoems.html'

browser = webdriver.PhantomJS()
browser.get(url)

# fix this code by initializing links here


for link in browser.find_elements_by_css_selector("table .vstxt a"):
    url = link.get_attribute('href')

    if url.startswith('http://www.lovepoemsandquotes.com/LovePoem'):
        links.append(url)

browser.close()
browser.quit()

Let's list some random links, say 5% of them. Notice that we use enumerate, which was briefly mentioned.

Enumerate does not just iterate over the elements of links, but also attaches a number starting at $1$ in this case. Normally computers start counting at $0$ because they are wierd.

random.random() returns a random number between 0 and 1, which will be smaller than $0.05$ in 5% of the cases

In [None]:
for i, link in enumerate(links, start=1):
    if random.random() < 0.05:
        print("Link number", i, "is", link)

Now fill in 2 lines of code to show a screenshot of the sixth link, call it link6.png

Hint: You can copy-paste from above

In [None]:
browser = webdriver.PhantomJS()
browser.get(links[5])

# Show a screenshot of the opened page, 2 lines of code
# Hint, if you name the file something else as screenshot.png, the previous screenshot won't be overwritten.




browser.close()
browser.quit()

Now let's scrape all the poems, this takes about 15 minutes.

Note the undiscussed **try-catch** syntax, this simply tells Python to print and ignore errors if occuring, and then continue. This is handy if you scrape the web, webpages might be inconsistent or links may be broken. This way, you still get the rest of the poems.

In [None]:
browser = webdriver.PhantomJS()

with open('poems.txt', 'w') as poem_file:
    for link in ProgressBar(links):
        try:
            browser.get(link)
            poem_text = browser.find_elements_by_css_selector('.stxt')[1].text
        except Exception as e:
            print("Exception occured,", e)
            pass
        
        poem_file.write(poem_text)
        poem_file.write('\n')

browser.close()
browser.quit()

**Go the the first screen, number.digitalhimanities.risacademy.nl, and open the poems.txt file.**