# Getting Our Data

## This Notebook

In this session, we will be acquiring the letters that we will use over the course of this workshop. We will be using the letters of Robert Louis Stevenson, made available through Wikisource in two volumes ([Vol 1](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_1) & [Vol 2](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_2)). In this case, since there are only twelve pages that we need to download, we can just download the pages manually by saving each page in our browsers. At larger scales, you may consider using a command line utility like wget (for which there is a [tutorial on The Programming Historian](https://programminghistorian.org/lessons/automated-downloading-with-wget)), or with some additional Python (which also has a tutorial on [The Programming Historian](https://programminghistorian.org/lessons/downloading-multiple-records-using-query-strings))

In this tutorial, we are using a [Jupyter Notebook](http://jupyter.org/), which is a tool for interactive programming with the Python programming language. The notebook consists of a series of "cells". Some, like this one, are for taking notes. We call those "Markdown cells", because they use the [Markdown](https://daringfireball.net/projects/markdown/syntax) syntax. We also have code cells, where we can execute Python code and see the results.

If you want to use Jupyter Notebook yourself, I recommend the [Anaconda](https://www.anaconda.com/download/#macos) Python distribution. Installing Python can be quite a hassle, but using this distribution makes it much easier to get up and running quickly, and includes Jupyter Notebook.

Even if you've never seen Python code before, this session should help you form expectations for what you can do with HTML documents in Python. If you have worked with Python before, or you're interested in learning, you can use this code as an example for your own projects. Since this isn't a "Learn Python" workshop, there will be a lot of things that we'll gloss over in the interest of time, but if you're interested in learning Python, I'd recommend saving it and coming back to it in between tutorials to see how much of it you understand.

__If you want to pursue learning Python...__

While we're going through this tutorial, take advantage of this interactive environment to modify pieces of code, so that you can see if the results are what you expect. Most of programming is learning how to align your expectations with what the code does, and much of that learning process can be done through experimentation.

__If you don't...__

That's totally fine! Learning to program represents a significant time commitment, which not all of us have. Digital scholarship tends to be collaborative because no individual can master every aspect of the work that goes into it. However, being familiar with how programming works and how programmers have to approach problems can give you better intuitions about the kinds of problems that can be solved with programmatic approaches, and make you a better collaborator on digital scholarship projects.

And who knows, you might find programming easier than you think!

__In any case...__

Ask questions! If you're wondering something about what's going on, there are probably other people with the same exact question, so do them a favor and ask.

__One last thing...__

I'm showing you how I got this dataset using Python, but that's not the only way that you can get data like this to work with. What we're creating with this process is a spreadsheet with each letter a one cell in a row, so that we can use other tools to manipulate the data further. This is the output provided by some web scraping tools like Grepsr and webscraper.io, so I don't want you to get the impression that you can't get data like this without using Python. It was the best tool for this particular job, and I want to use Python to show you what kinds of things go on under the hood of any given scraper, but you can scrape the web without it.

## Our Data

Before we look at any code, however, let's look at what it is that we actually want to do. The goal is to transform pages on Wikisource that contain many letters into a series of documents, each containing a single letter.

Take a look at a single page ([Volume 1, Chapter 1](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I) for example), and see what kind of useful patterns we might have to work with. Once we have an idea of what we want to do conceptually, we can put that into practice with Python code.

In [1]:
import os
import re
import csv
from lxml import html

## $\uparrow$ Importing Modules

One of the many joys of Python is that you almost never have to start entirely from scratch. Python is built to be extensible, and there are not only many modules to give you different kinds of functionality, but there is a robust infrastructure to acquire them.

If your this notebook is our data workbench, then the modules that we import are the specialized tools that we bring to our specific project. Python comes with basic tools, so you can always count on your screwdrivers and hammers, but sometimes you need some more specialized tools, so you `import` them to keep those specialized tools close at hand.

For now, we're importing four tools to work with:

|Command|Module Imported|Description|
|-------|------|:----------|
|`import os`|`os`|Module for interacting with the computer's file system. Useful for getting file locations and iterating through folders of files|
|`import re`|`re`|Module for using Regular Expressions, a language for making search and replace functions that are more flexible than you may be used to|
|`import csv`|`csv`|Module for working with CSV files, which we will use for output.|
|`from lxml import html`|`html`|This command is different because we're not importing an entire module, we're importing a sub-module. `lxml` is a module for interacting with XML and XML-like documents, and `html` is a sub-module specifically for working with HTML documents|

In [2]:
ch1 = html.parse("wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I.htm")
# I didn't actually type the whole file path, you can hit tab whenever you're typing for some pretty smart autocomplete
ch1

<lxml.etree._ElementTree at 0x111a39f88>

----

$\uparrow$ We're doing a few things in that first line of code, so here's a breakdown:

ch1$^1$ = $^2$html$^3$.$^4$parse$^5$($^6$"wikisource/wikisource_vol1_ch1.htm"$^7$)$^8$

1. We are creating a variable, and calling it "ch1"
2. This single equals sign is what we use to assign a variable name to whatever is to its right
3. Here we are using the "html" module that we imported in the previous cell
4. This single period or "dot" is used to access attributes of an object, so what follows is an attribute of "html"
5. This is a function attribute of html called "parse"
6. This is parenthesis indicates that the word before it is a function. Within the parentheses are the arguments of the function
7. This is the only argument we are giving to the function, the relative location of an html document. It is represented as a "string", which is what you call an object that you want to work with as text (in contrast to other data types, like numbers and lists). This is the page for Volume 1, Chapter 1 of the collection of letters, which you can see [here](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I)
8. This parenthesis closes out the function arguments.

The second line is just calling up the object we've just created, so that we can see what it is. Some objects will output a nice, readable form of their contents when you call them in this way. In this case, what we get is a kind of gnarly looking line that just tells us what kind of thing it is (an "ElementTree")

----

In [3]:
print(html.tostring(ch1,encoding="unicode")[:1000])

<!DOCTYPE html>
<!-- saved from url=(0087)https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I --><html class="client-js ve-not-available" lang="en" dir="ltr"><head>

<title>The Letters of Robert Louis Stevenson Volume 1/Chapter I - Wikisource, the free online library</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I","wgTitle":"The Letters of Robert Louis Stevenson Volume 1/Chapter I","wgCurRevisionId":3751843,"wgRevisionId":3751843,"wgArticleId":39652,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Subpages"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageC

----

$\uparrow$ This is another look at what we've just made. We're using two different functions to get it, but don't worry about the syntax (unless you're really curious) and just know that this is the top of the document represented by `ch1`

----

In [4]:
text = ch1.xpath("//div[@class='prose']")[0]
text

<Element div at 0x1121cd4a8>

----

$\uparrow$ In this cell, we're using an XPath selector to get a narrower selection of the document. (Remember that XPath selectors are a kind of selector that uses the structure of an HTML or XML document to grab parts of that document.) Instead of the whole document, with the Wikipedia header and sidebars and everything, we want just the central block with the text. The XPath selector that we're using as an argument does just that; it selects "`div`" elements with the class "prose". If you go to the [Wikisource page](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I), right click on the text and click "Inspect", you'll be able to use your browser's developer tools to confirm this for yourself.

We're using a similar syntax to what we used in the previous cell. We're creating an object, this time based on the `ch1` object, and using the `xpath` function to do so. One difference in the syntax is at the end of the line, where we have `[0]`. When you see those square brackets after some object, it indicates that we're accessing some part of it. Here, we're accessing the first value that the `xpath` function returns (since Python, like many other programming languages, starts counting from 0, not 1)

----

In [5]:
texthtml = html.tostring(text,encoding="unicode")
print(texthtml[:1200])

<div class="prose">
<p>Letter: SPRING GROVE SCHOOL, 12TH NOVEMBER 1863.</p>
<p><br></p>
<p>MA CHERE MAMAN, - Jai recu votre lettre Aujourdhui et comme le jour prochaine est mon jour de naisance je vous ecrit ce lettre. Ma grande gatteaux est arrive il leve 12 livres et demi le prix etait 17 shillings. Sur la soiree de Monseigneur Faux il y etait quelques belles feux d'artifice. Mais les polissons entrent dans notre champ et nos feux d'artifice et handkerchiefs disappeared quickly, but we charged them out of the field. Je suis presque driven mad par une bruit terrible tous les garcons kik up comme grand un bruit qu'll est possible. I hope you will find your house at Mentone nice. I have been obliged to stop from writing by the want of a pen, but now I have one, so I will continue.</p>
<p>My dear papa, you told me to tell you whenever I was miserable. I do not feel well, and I wish to get home.</p>
<p>Do take me with you.</p>
<p>R. STEVENSON.</p>
<hr>
<p>Letter: 2 SULYARDE TERRACE, TORQU

----

$\uparrow$ Here we're transforming our text from an "ElementTree" object to a much simpler data structure, a "string". Strings in Python are data objects that are treated as text, as opposed to numbers or code.

Now we have something quite useful. We have the text of all of the letters in a format that is very easy for Python to work with. This is the first part of breaking the longer text containing all of the letters into smaller chunks containing a single letter.

_Side note: We're also only displaying the first 1200 characters of text with the `[:1200]` addendum, so as not to take up too much space. Try out displaying more or less of the text by changing the number in the brackets._

----

In [6]:
"Hello, goodbye".split(",")

['Hello', ' goodbye']

----

$\uparrow$ Here's a simplified example of what we want to do with our text. Since we've determined that we want to break up the document using those horizontal lines as breakpoints, we need a Python function to do just that.

What we have here is a string, `"Hello, goodbye"`, that we're using the "`split`" function attribute of to split on a comma: `","`.

In the output we see `['Hello', ' goodbye']`. The square brackets indicate that this is a `list` data type, which in Python is an ordered collection of objects. The bits of text in single quotes are `string`s, and they are separated by commas outside of the quotes, which tells us that they are separate items in this `list`.

It's worth mentioning that creating a simplified version of a command that you'd like to give, in order to develop an intuition for how it functions, is a very common tactic in Python programming, especially in a notebook like this. 

_Try changing the first string and the argument given to `split`, to see how you can use the function in different ways and in different contexts_

----

In [7]:
letters = texthtml.split("<hr>")
print(letters[0])

<div class="prose">
<p>Letter: SPRING GROVE SCHOOL, 12TH NOVEMBER 1863.</p>
<p><br></p>
<p>MA CHERE MAMAN, - Jai recu votre lettre Aujourdhui et comme le jour prochaine est mon jour de naisance je vous ecrit ce lettre. Ma grande gatteaux est arrive il leve 12 livres et demi le prix etait 17 shillings. Sur la soiree de Monseigneur Faux il y etait quelques belles feux d'artifice. Mais les polissons entrent dans notre champ et nos feux d'artifice et handkerchiefs disappeared quickly, but we charged them out of the field. Je suis presque driven mad par une bruit terrible tous les garcons kik up comme grand un bruit qu'll est possible. I hope you will find your house at Mentone nice. I have been obliged to stop from writing by the want of a pen, but now I have one, so I will continue.</p>
<p>My dear papa, you told me to tell you whenever I was miserable. I do not feel well, and I wish to get home.</p>
<p>Do take me with you.</p>
<p>R. STEVENSON.</p>



----

$\uparrow$ We're using that same function on our larger document, but instead of breaking it up by commas, we're using `"<hr>"`, the HTML notation for those horizontal lines that separate letters. Then we're using the `print` function to display the first letter from that set of letters

----

In [8]:
# len gets us the number of objects in a list or list-like object, 
# so this is the number of letters that we've just parsed out
len(letters)

16

In [9]:
# Let's start preparing to output these letters to a CSV file...

fp = open('all_letters.csv',"w")
# ^Open the file

writer = csv.writer(fp,quoting=csv.QUOTE_ALL)
# ^Make an object that can write to the file
writer.writerow(['filename','letter_text','location'])
# ^Write headers to the file

fp.close()
# ^Close the file

----

$\uparrow$ In this cell, we're starting to use the `csv` module to write data to a file. To do that, we need an open file to write to. 

On the first line we're opening a file called "all_letters.csv", and the second argument, that `"w"`, indicates that we're opening it for writing as a new file. When you open a file with this option, Python will always create a new, empty file with that name, even if that will overwrite an existing file. 

This is something that comes up a lot in programming: having to think about and specify things that you might normally take for granted. When you're opening a file on your computer, you open it the same way whether you want to look at it or edit it. In Python, opening a file for reading and writing are different, and there are even different kinds of writing, depending on whether you're making a new file or adding to an existing file. 

But this hyper-specificity can pay dividends. Although we have to consider more aspects of the process, we also have more control, so we can prevent problems before they happen. In the next line, for instance, we're using the `csv` module to create an object for writing data to the CSV file. We could create the file with `writer = csv.writer(fp)`, which uses default options for writing to a CSV file. Instead, we've also specified that we want to put quotes around every field. Since CSV files are just text that use commas to indicate fields, sometimes CSV files that include text get their cells split unintentionally by commas in the text. However, quotation marks around text serve as more solid boundaries for the cells, ensuring that our CSV works as expected. So, we've specified that we want quotation marks around all of our cells with the option `quoting=csv.QUOTE_ALL`. There's a lot to get used to, and it takes time, but if you've ever opened a spreadsheet to find that the contents are all split up or made up of crazy characters, you'll appreciate being able to ensure that that doesn't happen.

On the line `writer.writerow(['filename','letter_text'])`, we're writing a list to the file as a row, with each item in the list as a cell in that row.

Finally, we have to close the file. It's similar to closing a file in one program before you edit in another. You can see some strange behavior if you don't do this.

----

In [10]:
counter = 1
# ^Set a variable to count how many times we've been through the loop we'll set up 

fp = open('all_letters.csv','a')
# ^Open our file with the 'a' parameter for 'appending' data to the end of the file

writer = csv.writer(fp,quoting=csv.QUOTE_ALL)
# ^Set up the CSV writer object
    
letterLocation = "wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I.htm"
# ^Make a variable called letterLocation, and set it to the location of the file we got the letter from
# That will let us reference it more easily later.

for l in letters:
# ^Start 'looping' through our list of letters

    letterName = "wikisource_vol1_ch1_letter_"+str(counter)
    # ^Make a variable within the loop called letter name, use the counter to construct it

    writer.writerow([letterName,l.strip(),letterLocation])
    # ^Within the loop, write a row that has a name for the letter, as well as the text
    # `l.strip()` uses a string function to trim whitespace from the beginning and end 
    # of a string of text, so that we don't get extra spaces and newlines in our letters.

    counter += 1
    # ^Increment our counter

fp.close()
# ^Outside the loop, close the file

----

$\uparrow$ In this cell, we're doing some iteration. The syntax of `for l in letters:` begins a __loop__ that performs the indented actions on every object in `letters`, with "`l`" as the stand-in for the individual objects.

Having run the code in this cell, you can now download and open [all_letters.csv](all_letters.csv) and see what we've made. You can open it in Excel or any other spreadsheet software.

_We are setting up a new variable called `letterName`, but it's getting re-written in every iteration of the loop. You can always re-use the same name for a variable elsewhere in a Python script, it will just overwrite the old one. The new value doesn't even have to be the same kind of object, Python just puts that label on the new object._

----

In [11]:
def get_letters(location):
# ^define a function that always takes a single argument, which we'll call location throughout the function

    tree = html.parse(location)
    # Parse the HTML of the document at the location given
    
    text_element = tree.xpath("//div[@class='prose']")[0]
    # Get the contents of the document that we're interested in with XPath
    
    text_with_html = html.tostring(text_element,encoding="unicode")
    # Convert that text to a string so it's easy to work with
    
    letters = text_with_html.split("<hr>")
    # Break the document into sections, separated by horizontal lines
    
    base_location = location.split('/')[-1]
    # Set up a base filename based on the location given
    # e.g. "wikisource_vol1_ch1.htm"
    
    base_location = base_location.replace('.htm','')
    # Remove the ".htm" file extension from the base location
    
    counter = 1
    # Set up our counter
    
    fp = open('all_letters.csv','a')
    # Open the output file for writing new data
    
    writer = csv.writer(fp,quoting=csv.QUOTE_ALL)
    # Create a CSV writer object to write the data
    
    for l in letters:
    # Begin iterating through letters
    
        letterName = base_location + "_letter" + str(counter)
        # Set up the name of the letter that we'll use
        
        writer.writerow([letterName,l.strip(),location])
        # Write the letter name and contents as a row to our CSV, again removing whitespace
        
        counter += 1
        # Increment the counter

    return "Done!"
    # You don't have to, but it's customary to return something at the end of a function.
    # In our case, the function's purpose isn't to transform an object, it's to write objects to files
    # However, you can use "return" to have the function spit out a new object based on its input

----

$\uparrow$ Now we're preparing to scale up this operation with a function of our own. In this cell we're defining a new function that pulls together all of the steps of the previous cells, so that we can more easily parse additional files.

This process is common in using Python as a research tool. First, you figure out the process to do something once, then you make it easier to do perform that process repeatedly, and gradually scale up the complexity of your project. One of the benefits of using a notebook like this as a coding environment is that you can preserve the process that you went through to develop the function, as well as the function itself.

----

In [12]:
# let's try out our new function on the next chapter:
get_letters("wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_II.htm")

'Done!'

----

Now we can take another look at our csv: [all_letters.csv](all_letters.csv)

----

In [13]:
# Let's see what's in our directory of downloaded files...
print(os.listdir("wikisource/"))
print()
print(os.listdir("wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/"))
print()
print(os.listdir("wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/"))

['The_Letters_of_Robert_Louis_Stevenson_Volume_1', 'Icon\r', 'The_Letters_of_Robert_Louis_Stevenson_Volume_2', '.ipynb_checkpoints']

['Chapter_V.htm', 'Chapter_II.htm', 'Icon\r', 'Chapter_VI.htm', '.ipynb_checkpoints', 'Chapter_VII.htm', 'Chapter_I.htm', 'Chapter_IV.htm', 'Chapter_III.htm']

['Chapter_IX.htm', 'Icon\r', 'Chapter_XII.htm', 'Chapter_VIII.htm', 'Chapter_XI.htm', '.ipynb_checkpoints', 'Chapter_X.htm']


----

Here we're using a very handy function from the `os` module: `listdir`. This function returns the contents of a directory as a list in Python. We've already seen how we can loop through lists to perform the same action on each item in the list, so you can see how this function similarly lets you loop through files or directories on your computer, and act on each one. Python doesn't care if you have 5 files or 5000 files, so if you have a function that you can apply consistently to everything in a folder, you can scale up your operations really quickly.

As you can see, we're looking at three different folders: one parent folder, and two sub-folders. There are mostly relevant files there, but there are some extra files and folders mixed in. This is a good time to think about what we want to do conceptually, namely, go through all of the folders under the top level, then go through their contents and get all of the HTML files. We also want to preserve that path, so we know where to find the files later.

----

In [14]:
file_locations = []
# ^Set up an empty container for our locations

for folder in os.listdir("wikisource"):
    # ^Loop through whatever is in "wikisource" as the variable folder
    
    if os.path.isdir(os.path.join("wikisource",folder)):
        # ^If we have a "directory", or folder...
        
        for location in os.listdir(os.path.join("wikisource",folder)):
            # ^Loop through the file locations in that folder
            
            if location[-3:] == "htm":
                # ^If the location ends in "htm"...
                
                full_path = os.path.join("wikisource",folder,location)
                # ^Construct a full path for it
                
                file_locations.append(full_path)
                # ^Add that location to our list of locations

file_locations
# ^Show us what's in the container

['wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_V.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_II.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_VI.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_VII.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_IV.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_III.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_IX.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_XII.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_VIII.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_XI.htm',
 'wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_X.htm']

----

$\uparrow$ This function, `os.path.isfile()`, is a way to check if a given path points to a file, or something else. As you can see, for this path, which points to a folder (which sometimes gets called a directory in programming) the function returns `False`. We can use this function to prune our directory listing to the list of files that we want.

----

In [15]:
for location in file_locations:
    get_letters(location)
    print("done with "+location)

done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_V.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_II.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_VI.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_VII.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_IV.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_III.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_IX.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_XII.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_VIII.htm


IndexError: list index out of range

## Errors

Oh no! We've encountered an error!

If you get into programming, this will happen to you all the time. In fact, this isn't a contrived error, it's a real problem I ran into parsing these documents. Fortunately, Python gives good feedback on errors, once you know how to interpret them.

After the line of hyphens, the first think you see is "`IndexError`", which is the kind of error that was encountered. Then there is a "stack trace", which goes through the code involved, sometimes through several layers of nested functions, to point out the problem. 

In this case, the error happened when we were pruning the document with an XPath Expression.

After the stack trace, you get the error again, with a longer description: "`list index out of range`". That gives us a clue as to where things were going wrong, since the list involved in that line was a list of elements returned by our XPath function. We wanted to get the first element returned by our XPath expression, so how could that cause a problem? Let's take a look.

Because we've been printing out our progress, we know the last document to be completed was Chapter VII. After that chapter in the list is Chapter XI (see the list we printed out after making it), so we can see what's going on with that entry.

In [16]:
ch11 = html.parse("wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_XI.htm")
# Get the HTML of the document as an html object
text11 = ch11.xpath("//div[@class='prose']")
# Try out the XPath
text11
# Show the results

[]

Well there's our problem. That XPath isn't returning anything results. But why? Let's go to [the page](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_XI) and find out...

In [21]:
def get_letters(location):
    tree = html.parse(location)
    
    try:
        text_element = tree.xpath("//div[@class='prose']")[0]
    except IndexError:
        text_element = tree.xpath("//div[@class='mw-parser-output']")[0]
    # Our solution to that error is to use another Python convention: 
    # try stuff and then deal with errors that come up. In this case, we have a different XPath
    # for when things go wrong with the first one
    
    text_with_html = html.tostring(text_element,encoding="unicode")
    
    letters = text_with_html.split("<hr>")
    
    base_location = location.split('/')[-1]
    base_location = base_location.replace('.htm','')
    counter = 1
    fp = open('all_letters.csv','a')
    writer = csv.writer(fp,quoting=csv.QUOTE_ALL)
    for l in letters:
        letterName = base_location + "_letter" + str(counter)
        writer.writerow([letterName,l.strip(), location])
        counter += 1
    return "Done!"

----

Now we need to clear out our CSV. This is a case where understanding how Python opens files can help us. Since we know that the `"w"` parameter when opening a file is used for new files, and creates a new, blank file with that name, we can just use that parameter when opening our file to clear it out while re-making our headers.

_If we were didn't do this, we could just append all of our new rows to the existing file, duplicating some rows and messing up our dataset_

----

In [22]:
fp = open('all_letters.csv',"w")
writer = csv.writer(fp,quoting=csv.QUOTE_ALL)
writer.writerow(['filename','letter_text','file_location'])
fp.close()

In [23]:
# Now let's give our function another try...
for location in file_locations:
    get_letters(location)
    print("done with "+location)

done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_V.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_II.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_VI.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_VII.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_IV.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_III.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_IX.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_XII.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_VIII.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_XI.htm
done with wikisource/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_X.htm


In [24]:
fp = open('all_letters.csv','r')
# Open our CSV for reading

reader = csv.reader(fp)
# 

rowList = list(reader)
print(len(rowList))
fp.close()

463


----

There we have it! We've parsed the letters into 462 rows in a spreadsheet (plus a header row)! You can see how quick it was to go from 16 letters to 462, and when you're using a programming language for your work, you can anticipate similar scalability. What's more, in doing so we've created a record of exaclty how we got here.

----

# In Summary...

We have now gone through our downloaded HTML files and separated out the letters. We've taken those letters and added them to a spreadsheet, so that we can extract information later, information such as the year each letter was sent and each letter's recipient.

Hopefully, we've also gained a better understanding of how Python specifically and programming languages more generally perform tasks like this. 

If you're not going to pursue Python yourself, this understanding can help you collaborate effectively with other scholars who program or developers that you may work with.

For those of you who already use Python or want to learn, I hope that you use the code here as a stepping stone to become more fluent in the language.

We've also put together a handout with some resources for learning Python that we like. There are a ton of resources out there, so we wanted to help you sort through the noise.