<a href="https://colab.research.google.com/github/jtallison/LDLFest-workshop/blob/master/workbook-3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
import pandas as pd

%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 300

## Working with a Corpus

### Assembling Your Own Datasets

Places to get stuff:

* Project Gutenberg: (http://gutenberg.org/)
* Google Books (http://books.google.com)
* EEBO (http://quod.lib.umich.edu/e/eebogroup/)
* ECCO (http://quod.lib.umich.edu/e/ecco/)
* Evans (http://quod.lib.umich.edu/e/evans/)
* JSTOR DFR (http://dfr.jstor.org/)
* Open Access: PMC Open Access Set, PLoS, BioMed Central
* _Mining the Social Web_ (O'Reilly)
* Twitter APIs (http://dev.twitter.com)
* Facebook APIs (http://developers/facebook.com)

### Troubles with Access and Quality

The elephant in the room is copyright. For-profit journals: Elsevier has a text-mining API; otherwise negotiation contracts. The same holds true for contemporary books: getting access can be difficult, even with HathiTrust.

Google Ngram brings up other issues: the digitization of available materials is not complete, which suggests that the statistical significance is limited, even in contemporary English. Worries about breadth of corpus, especially in early works. Libraries scan the books they have, and libraries tend to have (still) medical and scientific volumes. E.g., any signal you get could be weak or just plain wrong. OCR noise can run wildly high.

re: Social media, see: Boyd and Crawford 2011 (SSRN: 1926431).

Not only is OCR problematic, but automated tasks, like named entity extraction, are also questionable.

## Getting Texts

### `wget`

Sometimes CLI tools, like `wget`, are more powerful than GUI tools. The key difference is that GUI tools are easier to use at first, but repetitive tasks are difficult or expensive (in terms of time). CLI tools are a little more difficult at first, but once you have an established collection of them, they are not only easier to use but just plain easier. 

**`wget`** is one of those tools. E.g.:

    % wget -r -l 1 -w 2 --limit-rate=20k https://www.cs.cmu.edu/~spok/grimmtmp/

`wget` is a CLI program that retrieves web content. To my mind, since it can act like a targeted web crawler, it is the single greatest tool available to those looking to gather data/texts. 

Let's look at what it looks like:

    % wget -r -l 1 -w 2 --limit-rate=20k https://www.cs.cmu.edu/~spok/grimmtmp/
    
* `-r` (or `--recursive`) turns on recursive retrieving (up to 5 directories deep). 
* `-l 1` (or`--level=1`) keeps the depth to 1.
* `-w 2` gives the amount of time to wait between retrievals. (Two seconds lessens the server load.)
* `--limit-rate=20k` sets the retrieval rate to 20kB/s. (This is being polite in a shared connection setting.)

### Case 1: Downloading Files

http://digital.library.okstate.edu/kappler/Vol2/Toc.htm. 

As it turns out, almost all the documents in which we are interested are housed in a single directory (below), which does not like being crawled. Running `wget` returns **ERROR 403: Forbidden**. In all likelihood, this is the result of the site's administrator configuring the website to make sure that directories cannot be browsed directly.

    !wget -r -l 1 -w 2 --limit-rate=20k http://digital.library.okstate.edu/kappler/Vol2/treaties/

We need, then, to be able to access the table of contents above, get all the links listed, and then download that list into a directory (folder in GUI terms) of our choosing.

While `wget` is a truly useful program, especially since one line can do so much, it does have its limitations. There are ways around it that would allow you to remain within the Bash shell, but it is also possible to replicate the power of `wget` in Python, and once you are using Python, you can do so much more...

In [None]:
import urllib.request
from bs4 import BeautifulSoup

# To use this script, the user needs to provide the three values below: 
# myurl, myfilter, mydirectory
# Please make sure `mydirectory` is already created before running

myurl = "http://digital.library.okstate.edu/kappler/Vol2/Toc.htm"
myfilter = "http://digital.library.okstate.edu/kappler/Vol2/treaties/"
mydirectory = "/Users/jjl/Desktop/downloadedfiles/"

myconnection = urllib.request.urlopen(myurl)
myhtml = myconnection.read()
mysoup = BeautifulSoup(myhtml, "lxml")
mylinks = mysoup.find_all('a')

all_links = []
for tag in mylinks:
    link = tag.get('href',None)
    if link is not None:
        all_links.append(link)

myresults = [k for k in all_links if myfilter in k]

for result in myresults:
    remotefile = urllib.request.urlopen(result)
    localfile = open(mydirectory+result.replace(myfilter, ''),'wb')
    localfile.write(remotefile.read())
    localfile.close()
    remotefile.close()

Now we have a directory (folder) sitting on our desktop and it has all the files we want:

![Screenshot of Full Directory](./images/Screenshot_directory.png)

### Case 2: Working with an Content Management System's API

What happens when the texts with which you want to work are not sitting in a directory, but are in a content management system (CMS)? Our next example was suggested by a participant who is interested in working with Paul Laurence Dunbar's poetry and fiction. Using the previous script as a basis for doing similar work, we are going to examine the URLs generated by the CMS to see if there is a way for us to get what is wanted.

Here is the link for the digital archive of Dunbar’s work at Wright State: http://www.libraries.wright.edu/special/dunbar/

![Screenshot of Dunbar Archive Web Page](./images/ScreenShot_Dunbar.png)

If we click on the "poetry" link in the lefthand navigation pane, and then hover over one of the books (see image above), we see the following URL: 

    http://www.libraries.wright.edu/special/dunbar/explore?book=8

Clicking on a book, takes us to a table of contents, with a series of links like this:

    http://www.libraries.wright.edu/special/dunbar/explore?book=9&id=236

The `id`s are not sequential within a book; however, by playing with the URLs in a browser, it looks like you can insert an asterisk into portion of the URL that identifies the book, `book=*`, and still get back results on simply the `id=`:

    http://www.libraries.wright.edu/special/dunbar/explore?book=*&id=99

In fact, after a little experimentation of just typing in numbers and changing the `id` number and getting back results, it looks like we just need to iterate through all the `id`s. If we start with `1`, how far up do we need to go? Since I saw numbers in the 300s earlier, I am going to start with 400 and go up by 100 until I get no results and then narrow by 10s and then 1s until I know where to stop ... and it appears we stop at 433.

Now let's go build, er, revise us some code...

In [32]:
#! /usr/bin/env python

import urllib.request
from bs4 import BeautifulSoup
import re

baseurl = "http://www.libraries.wright.edu/special/dunbar/explore?book=*&id="
mydirectory = "/Users/jjl/Desktop/downloadedfiles/"

mylist = []
for i in range (1, 434):
    link = baseurl+str(i)
    mylist.append(link)

for link in mylist:
    remotefile = urllib.request.urlopen(link).read()
    soup = BeautifulSoup(remotefile, "lxml")
    div = soup.find('div', 'bookContain-right')
    localfile = open(mydirectory+link.replace(baseurl, '')+".html",'wt')
    localfile.write(str(div.encode('utf-8')))
    localfile.close()

The code works, and it returns only the contents of the desired `div`:

    <div class="bookContain-right">

But the contents remain ugly. At the very least, some regex is needed to clean up some of the escaped characters: those that begin with a backslash. Perhaps better would be to use `html2text` to convert the documents to plain text. 

In [None]:
#! /usr/bin/env python

import urllib.request
from bs4 import BeautifulSoup
import html2text

baseurl = "http://www.libraries.wright.edu/special/dunbar/explore?book=*&id="
mydirectory = "/Users/jjl/Desktop/downloadedfiles/"

mylist = []
for i in range (1, 2):
    link = baseurl+str(i)
    mylist.append(link)

for link in mylist:
    remotefile = urllib.request.urlopen(link).read()
    soup = BeautifulSoup(remotefile, "lxml")
    div = soup.find('div', 'bookContain-right')
    text = html2text.html2text(str(div))
    localfile = open(mydirectory+link.replace(baseurl, '')+".txt",'wt')
    localfile.write(str(text))
    localfile.close()

## Munging

**Data munging** or **data wrangling** is loosely the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, "munging" the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.

-- [Wikipedia](https://en.wikipedia.org/wiki/Data_wrangling)

In [1]:
# Let's take a look at one of Zach's files:

!less ./texts/apa0598.htm

<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
   
      <link rel="stylesheet" type="text/css" href="kstyles.css">
      <title>INDIAN AFFAIRS: LAWS AND TREATIES. Vol. 2, Treaties</title>
      <meta name="description" content="Indian Affairs: Laws and Treaties, compi led and edited by  Charles J. Kappler is an historically significant, seven volu me compilation of U.S. treaties, laws and executive  orders pertaining to Native  American Indian tribes. The volumes cover U.S. Government treaties  with Native  Americans from 1778-1883 (Volume II) and U.S. laws and executive orders concern ing Native Americans  from 1871-1970 (Volumes I, III-VII). The work was first pu blished in 1903-04 by the U.S. Government Printing Office.">
      <meta name="keywords" content="kappler native american indian tribes treat ies laws executive orders">
      <meta name="author" content="Oklahoma State University Library">
   </head>
   <body background="i

No matter what Zach has in mind for this data, we can be pretty sure that it does not include a lot of angle brackets and funkiness like `div class="SANSLINE"`. (For the record, *funkiness* is a technical term in data munging. I'm serious. Go look it up.) Whatever Zach's next steps are, he is going to want to clean up the text. 

For this workshop, we are going to skip transforming this html into some kind of operable xml and focus on simply getting it into useful plain text. From there, Zach will be able to engage a number of automated processes which may be more, or less, interesting.

In [2]:
from bs4 import BeautifulSoup

myfile = open('./texts/apa0598.htm', 'r')
myhtml = myfile.read()
mytext = BeautifulSoup(myhtml).text

print(mytext)





INDIAN AFFAIRS: LAWS AND TREATIES. Vol. 2, Treaties





INDIAN AFFAIRS: LAWS AND TREATIES
Vol. II, Treaties    
Compiled and edited by Charles J. Kappler.
         Washington : Government Printing Office, 1904.
      

Home | Disclaimer & Usage | Table of Contents | Index


TREATY WITH THE APACHE, 1852
July 1, 1852. | 10 Stat., 979. | Ratified Mar. 23, 1853. | Proclaimed Mar. 25, 1853.
Page Images:  Page 598
             | 599
             | 600






Margin Notes


Authority of the United States acknowledged.


Peace to exist.


The Apaches not to assist other tribes in hostilities.


Good treatment of citizens of the United States and nations at peace with them.


Cases of aggression on them to be referred to Government.


Laws to be conformed to.


Provisions against incursions into Mexico.


Persons injuring the Apaches to be tried and punished.


Free passage over the Apache territory.


Military posts, agencies, and trading houses to be established.


Territorial boundaries 



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


Now, we just need to clean up the entire folder!

Again, a bit of automation goes a long way...

In [None]:
import os, os.path
import glob
from bs4 import BeautifulSoup

# Please note that this script is incomplete for now. 
# Feel free to use it as a basis for a script that works.

filesIN = "/Users/jjl/Desktop/filesIN/"
filesOUT = "/Users/jjl/Desktop/filesOUT/"

postlist = os.listdir(filesIN)

for post in postlist: 
    text = BeautifulSoup(open(filesIN+post), "lxml")
    text.encode("utf-8")
    fout = open(filesOUT+post, "w")
    fout.write(text.encode("utf-8"))
    fout.close()