# GIAN 7: Scraping

The internet provides a great source of data that can be used for research in language processing. 

We are confronted with two problems.

1. How can we get automatically download content from the internet
2. How can we parse downloaded webpages to extract exactly what we need

## 1. Automatically downloading content

Most popular websites prevent downloading by bots. Repeated access to websites is usally quickly blocked. The key to downloading content for scientific research from such websites is to do it in moderation and using your own webbrowser.

One of the ways in which you can automate your browser is [Selenium](https://www.seleniumhq.org/). In this tutorial, I will assume that you use Selenium in combination with the [Firefox](https://www.mozilla.org/en-US/firefox/download/) web browser.

### Installing Selenium

If you are using the Anaconda Navigator, you can install the Selenium package for Python from there. Alternatively, you can install it by running the command in the cell below.

In [None]:
!pip3 install selenium

You will also need to install a driver to communicate with your browser. The drivers for Firefox can be downloaded from https://github.com/mozilla/geckodriver/releases

Please note that you will need to install the driver in a directory on your *path*. Running the cell below shoul you the directories on your path.

In [None]:
!echo $PATH

In [None]:
import time
import urllib.request
import tarfile
import zipfile

The commands in the following cell:
+ downloads the current driver for Firefox (December 13, 2018) to the current directory (the same as this notebook is located in);
+ unpacks the driver if necessary.

In [None]:

## uncomment the following lines if you use macOS
# driver_url = "https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-macos.tar.gz"
# driver_filename = driver_url.split("/")[-1]
# urllib.request.urlretrieve(driver_url, driver_filename)
# archive = tarfile.open(driver_filename)
# archive.extractall
# archive.close()

## uncomment the following lines if you use Windows
# driver_url = "https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-win64.zip"
# driver_filename = driver_url.split("/")[-1]
# urllib.request.urlretrieve(driver_url, driver_filename)
# archive = zipfile.ZipFile(driver_filename)
# archive.extractall()
# archive.close()

## uncomment the following lines if you use Linux
# driver_url = "https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz"
# driver_filename = driver_url.split("/")[-1]
# urllib.request.urlretrieve(driver_url, driver_filename)
# archive = tarfile.open(driver_filename)
# archive.extractall()
# archive.close()

Now, you will need to manually copy the driver you downloaded to a location in your path.

### Running Selenium

We are now ready to run Selenium. We will use the [documentation for the Python Selenium API](https://selenium-python.readthedocs.io/) to learn how to use Selenium.

In [None]:
from selenium import webdriver

If all went well, the following code should launch a Firefox instance and open the frontpage of Language Log

In [None]:
browser = webdriver.Firefox()

We will try to scrap the website Language Log, which can be found at [http://languagelog.ldc.upenn.edu](). However, we don't want to overload the site, so we have placed a local copy at [http://172.20.160.50/user/sikarwar/gyan/languagelog.ldc.upenn.edu]().

In [None]:
base_url="http://172.20.160.50/user/sikarwar/gyan/languagelog.ldc.upenn.edu/index.html"

In [None]:
browser.get(base_url)

We can now write the page we scraped to a file

In [None]:
with open("language_log.html", "w", encoding="utf-8") as f_out:
    f_out.write(browser.page_source)

And use selenium to control the browser.

Let's try to expand the link showing the archives of the website

In [None]:
browser.find_element_by_link_text("[+/–]").click()

And go into the archives for December 2018

In [None]:
browser.find_element_by_link_text("December 2018").click()

The following command will stop the driver and quit the browser instance

In [None]:
browser.quit()

Now that we know how to load the website and control the browser, let's try something more complicated. We will download a number of pages from the Language Log.

We will first make a directory to store our downloaded files and create a text file that logs what we have downloaded

In [None]:
import os
from hashlib import md5
import random
import time

In [None]:
try:
    os.mkdir("loot")
except FileExistsError:
    pass
# create a log file
download_log=open("loot/log_pages.txt", "a")

We will also write a simple function that takes a page source, creates a unique filename for the file and stores the information containing the filename and its URL in the log file

In [None]:
def store_page(browser, logfile):
    # create a unique filename
    source_hash=md5(browser.page_source.encode("utf-8")).hexdigest()

    # get the required information from the page
    current_filename="loot/{:s}.html".format(source_hash)
    current_url=browser.current_url
    current_source=browser.page_source

    # store the page source in the file
    with open(current_filename, "w") as f_out:
        f_out.write(current_source)

    # store the filename and url in the download log
    logfile.write("{:s}\t{:s}\n".format(current_url, current_filename))
    logfile.flush()

Now, let's start our browser again

In [None]:
browser = webdriver.Firefox()
browser.get(base_url)

And go to the archives for November 2018

In [None]:
browser.find_element_by_link_text("[+/–]").click()
browser.find_element_by_link_text("November 2018").click()

In [None]:
# For safety, we always start by testing on a few pages
for i in range(5):
    store_page(browser, download_log)
    try:
        browser.find_element_by_link_text("Next Page »").click()
    except:
        break
    time.sleep(random.randint(5,10)) # wait for 5 to 10 seconds  

In [None]:
browser.quit()
download_log.close()

By looking at the structure of the url's, we could think that there is a more efficient way of downloading all the posts.

Every url for a post is of the form `http://languagelog.ldc.upenn.edu/nll/?p=X`, where X is the number of the post

We could generate all the url's and downloads all the corresponding posts!

However, the posts are not numbered sequentially and a browser generating many 404s could quickly be identifed as a bot.

It's better to take things slowly.

## 2. Parsing web content
When we download web pages, we invariably download them in html format. Once we know how to parse the html structure, it can actually be very useful for extracting exactly the information we want.

In this case, we want to make a list of the url's of the *posts* on Language Log.

Let's look at our log file to see the files we have downloaded.

In [None]:
with open("loot/log_pages.txt", "r", encoding="utf-8") as logfile:
    filenames=[]
    for line in logfile:
        url, filename = line.strip().split("\t")
        filenames.append(filename)
print(len(filenames))

### Using BeautifulSoup

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a library that parses a webpage's [Document Object Model](https://www.w3.org/TR/WD-DOM/introduction.html) and lets us navigate that structure.

In [None]:
# Install beautifulsoup using the following command or via Anaconda
!pip install beautifulsoup4

In [None]:
from bs4 import BeautifulSoup

In [None]:
test_file=open(filenames[0])
soup = BeautifulSoup(test_file, 'html.parser')

Inspection of the source of one of the pages, shows that a link to a post has this form.

`<h2 class="posttitle" id="post-38242"><a href="http://languagelog.ldc.upenn.edu/nll/?p=38242" rel="bookmark" title="Permanent link to Smart should check the OED">Smart should check the OED</a></h2>`

Using BeautifulSoup, we can search for all `h2` elements with class `posttitle`

In [None]:
soup.find_all("h2", "posttitle")

And from each of those we can extract the link

In [None]:
for h2 in soup.find_all("h2", "posttitle"):
    print(h2.a.get("href"))

In [None]:
# Now let's do this for all the pages we downloaded

In [None]:
post_urls=[]
for filename in filenames:
    soup = BeautifulSoup(open(filename), 'html.parser')
    for h2 in soup.find_all("h2", "posttitle"):
        post_url=h2.a.get("href")
        post_urls.append(post_url)

In [None]:
len(post_urls)

In [None]:
error_urls=["http://languagelog.ldc.upenn.edu/nll/?p=690", "http://languagelog.ldc.upenn.edu/nll/?p=689",
"http://languagelog.ldc.upenn.edu/nll/?p=686", "http://languagelog.ldc.upenn.edu/nll/?p=685",
"http://languagelog.ldc.upenn.edu/nll/?p=681", "http://languagelog.ldc.upenn.edu/nll/?p=684",
"http://languagelog.ldc.upenn.edu/nll/?p=683", "http://languagelog.ldc.upenn.edu/nll/?p=682",
"http://languagelog.ldc.upenn.edu/nll/?p=677", "http://languagelog.ldc.upenn.edu/nll/?p=675",
"http://languagelog.ldc.upenn.edu/nll/?p=674", "http://languagelog.ldc.upenn.edu/nll/?p=673",
"http://languagelog.ldc.upenn.edu/nll/?p=672", "http://languagelog.ldc.upenn.edu/nll/?p=670",
"http://languagelog.ldc.upenn.edu/nll/?p=671", "http://languagelog.ldc.upenn.edu/nll/?p=669",
"http://languagelog.ldc.upenn.edu/nll/?p=666", "http://languagelog.ldc.upenn.edu/nll/?p=668",
"http://languagelog.ldc.upenn.edu/nll/?p=664", "http://languagelog.ldc.upenn.edu/nll/?p=665",
"http://languagelog.ldc.upenn.edu/nll/?p=663", "http://languagelog.ldc.upenn.edu/nll/?p=662",
"http://languagelog.ldc.upenn.edu/nll/?p=661", "http://languagelog.ldc.upenn.edu/nll/?p=660",
"http://languagelog.ldc.upenn.edu/nll/?p=657", "http://languagelog.ldc.upenn.edu/nll/?p=659",
"http://languagelog.ldc.upenn.edu/nll/?p=658", "http://languagelog.ldc.upenn.edu/nll/?p=656",
"http://languagelog.ldc.upenn.edu/nll/?p=655", "http://languagelog.ldc.upenn.edu/nll/?p=654",
"http://languagelog.ldc.upenn.edu/nll/?p=653", "http://languagelog.ldc.upenn.edu/nll/?p=652",
"http://languagelog.ldc.upenn.edu/nll/?p=651", "http://languagelog.ldc.upenn.edu/nll/?p=649",
"http://languagelog.ldc.upenn.edu/nll/?p=650", "http://languagelog.ldc.upenn.edu/nll/?p=648",
"http://languagelog.ldc.upenn.edu/nll/?p=647", "http://languagelog.ldc.upenn.edu/nll/?p=645",
"http://languagelog.ldc.upenn.edu/nll/?p=646", "http://languagelog.ldc.upenn.edu/nll/?p=644",
"http://languagelog.ldc.upenn.edu/nll/?p=643", "http://languagelog.ldc.upenn.edu/nll/?p=642",
"http://languagelog.ldc.upenn.edu/nll/?p=641", "http://languagelog.ldc.upenn.edu/nll/?p=640",
"http://languagelog.ldc.upenn.edu/nll/?p=639", "http://languagelog.ldc.upenn.edu/nll/?p=638",
"http://languagelog.ldc.upenn.edu/nll/?p=636", "http://languagelog.ldc.upenn.edu/nll/?p=635",
"http://languagelog.ldc.upenn.edu/nll/?p=637", "http://languagelog.ldc.upenn.edu/nll/?p=634",
"http://languagelog.ldc.upenn.edu/nll/?p=633", "http://languagelog.ldc.upenn.edu/nll/?p=632",
"http://languagelog.ldc.upenn.edu/nll/?p=631", "http://languagelog.ldc.upenn.edu/nll/?p=630",
"http://languagelog.ldc.upenn.edu/nll/?p=629", "http://languagelog.ldc.upenn.edu/nll/?p=628",
"http://languagelog.ldc.upenn.edu/nll/?p=627", "http://languagelog.ldc.upenn.edu/nll/?p=626",
"http://languagelog.ldc.upenn.edu/nll/?p=625", "http://languagelog.ldc.upenn.edu/nll/?p=624",
"http://languagelog.ldc.upenn.edu/nll/?p=623", "http://languagelog.ldc.upenn.edu/nll/?p=622",
"http://languagelog.ldc.upenn.edu/nll/?p=620", "http://languagelog.ldc.upenn.edu/nll/?p=619",
"http://languagelog.ldc.upenn.edu/nll/?p=615"]

post_urls=error_urls

We can now use Selenium to download the actual posts.

Again, we'll first make a log file.

In [None]:
download_log=open("loot/log_posts.txt", "a", encoding="utf-8")

In [None]:
browser = webdriver.Firefox()
for post_url in post_urls:
    browser.get(post_url)
#     time.sleep(random.randint(5,10))
    store_page(browser, download_log)

In [None]:
browser.quit()
download_log.close()

Now that we have downloaded the posts, we can parse them with BeautifulSoup

In [None]:
from collections import defaultdict # used to make it easier to use dictionaries
from collections import Counter     # makes it easier to build frequency lists
from datetime import datetime       # convert timestamps

In [None]:
with open("./loot/log_posts.txt", "r", encoding="utf-8") as logfile:
    filenames=[]
    for line in logfile:
        url, filename = line.strip().split("\t")
        filenames.append(filename)
print(len(filenames))
test_file=open(filenames[2])

In [None]:
soup = BeautifulSoup(test_file, 'html.parser')

We will write small functions to extract the different components of the posts

In [None]:
def extract_post_title(bs):
    return(soup.find("h2", "posttitle").a.text)

In [None]:
extract_post_title(soup)

In [None]:
def extract_post_meta(bs):
    d=defaultdict(list)
    timestamp_text=bs.find("p", "postmeta").text.strip().split("\n")[0].strip()
    timestamp=datetime.strptime(timestamp_text, "%B %d, %Y @ %I:%M %p")
    d['timestamp']=str(timestamp)
    for metalink in bs.find("p", "postmeta").find_all("a"):
        key=metalink.get("rel")[0]
        value=metalink.text
        d[key].append(value)
    return(d)

In [None]:
extract_post_meta(soup)

In [None]:
def extract_post_entry(bs):
    raw_paragraphs=[]
    for paragraph in (bs.find("div", "postentry").find_all("p")[1:]):
        if paragraph.get("class")==["postmeta"]:
            break
        else:
            raw_paragraphs.append(paragraph.text)
    return("\n".join(raw_paragraphs))

In [None]:
extract_post_entry(soup)

In [None]:
def extract_comments(bs):
    comments=[]
    comment_section=bs.find(id="commentlist")
    if comment_section:
        for comment_li in comment_section.find_all('li'):
            comment={}
            author=comment_li.find("h3","commenttitle").text[:-6]
            comment["author"]=author
            timestamp_text=comment_li.find("p", "commentmeta").text.strip()
            timestamp=datetime.strptime(timestamp_text, "%B %d, %Y @ %I:%M %p")
            comment['timestamp']=str(timestamp)
            body="".join([paragraph.text for paragraph in comment_li.find_all("p", class_=False)])
            comment['body']=body
            comments.append(comment)
    return(comments)

In [None]:
extract_comments(soup)

We can now extract the information from all of the posts. As always, start with a few posts to test our functionality. If everything works well, you can process all your posts. 

In [None]:
posts=[]
processing_counter=0
for filename in filenames:
    try:
        post={}
        soup = BeautifulSoup(open(filename), 'html.parser')
        post['title']=extract_post_title(soup)
        post.update(extract_post_meta(soup))
        post['entry']=extract_post_entry(soup)
        posts.append(post)
        post['comments']=extract_comments(soup)
    except:
        print("procedure did not work for post {:d}: {:s}".format(processing_counter, filename))
    processing_counter=processing_counter+1
    if processing_counter%100==0:
        print("processed {:d} posts".format(processing_counter))

Now that we have all the data we need, let's export it.

We will use the *json* format to store the data.

In [None]:
import json

In [None]:
with open("language_log.json", "w", encoding="utf-8") as f_out:
    json.dump(posts, f_out, ensure_ascii=False)