# Getting Our Data

## This Notebook

In this session, we will be acquiring the letters that we will use over the course of this workshop. We will be using the letters of Robert Louis Stevenson, made available through Wikisource in two volumes ([Vol 1](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_1) & [Vol 2](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_2)). In this case, since there are only twelve pages that we need to download, we can just download the pages manually by saving each page in our browsers. At larger scales, you may consider using a command line utility like wget (for which there is a [tutorial on The Programming Historian](https://programminghistorian.org/lessons/automated-downloading-with-wget)), or with some additional Python (which also has a tutorial on [The Programming Historian](https://programminghistorian.org/lessons/downloading-multiple-records-using-query-strings))

In this tutorial, we are using a [Jupyter Notebook](http://jupyter.org/), which is a tool for interactive programming with the Python programming language. The notebook consists of a series of "cells". Some, like this one, are for taking notes. We call those "Markdown cells", because they use the [Markdown](https://daringfireball.net/projects/markdown/syntax) syntax. We also have code cells, where we can execute Python code and see the results.

Even if you've never seen Python code before, this session should help you form expectations for what you can do with HTML documents in Python. If you have worked with Python before, or you're interested in learning, you can use this code as an example for your own projects. Since this isn't a "Learn Python" workshop, there will be a lot of things that we'll gloss over in the interest of time, but if you're interested in learning Python, I'd recommend saving it and coming back to it in between tutorials to see how much of it you understand.

While we're going through this tutorial, take advantage of this interactive environment to modify pieces of code, so that you can see if the results are what you expect. Most of programming is learning how to align your expectations with what the code does, and much of that learning process can be done through experimentation.

## Our Data

Before we look at any code, however, let's look at what it is that we actually want to do. The goal is to transform pages on Wikisource that contain many letters into a series of documents, each containing a single letter.

Take a look at a single page ([Volume 1, Chapter 1](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I) for example), and see what kind of useful patterns we might have to work with. Once we have an idea of what we want to do conceptually, we can put that into practice with Python code.

In [1]:
import os
import re
from lxml import html

## $\uparrow$ Importing Modules

One of the many joys of Python is that you almost never have to start entirely from scratch. Python is built to be extensible, and there are not only many modules to give you different kinds of functionality, but there is a robust infrastructure to acquire them.

If your this notebook is our data workbench, then the modules that we import are the specialized tools that we bring to our specific project. Python comes with basic tools, so you can always count on your screwdrivers and hammers, but sometimes you need some more specialized tools, so you `import` them to keep those specialized tools close at hand.

For now, we're importing three tools to work with:

|Command|Module Imported|Description|
|-------|------|:----------|
|`import os`|`os`|Module for interacting with the computer's file system. Useful for getting file locations and iterating through folders of files|
|`import re`|`re`|Module for using Regular Expressions, a language for making search and replace functions that are more flexible than you may be used to|
|`from lxml import html`|`html`|This command is different because we're not importing an entire module, we're importing a sub-module. `lxml` is a module for interacting with XML and XML-like documents, and `html` is a sub-module specifically for working with HTML documents|

In [2]:
ch1 = html.parse("wikisource/wikisource_vol1_ch1.htm")
ch1

<lxml.etree._ElementTree at 0x108464108>

----

$\uparrow$ We're doing a few things in that first line of code, so here's a breakdown:

ch1$^1$ = $^2$html$^3$.$^4$parse$^5$($^6$"wikisource/wikisource_vol1_ch1.htm"$^7$)$^8$

1. We are creating a variable, and calling it "ch1"
2. This single equals sign is what we use to assign a variable name to whatever is to its right
3. Here we are using the "html" module that we imported in the previous cell
4. This single period or "dot" is used to access attributes of an object, so what follows is an attribute of "html"
5. This is a function attribute of html called "parse"
6. This is parenthesis indicates that the word before it is a function. Within the parentheses are the arguments of the function
7. This is the only argument we are giving to the function, the relative location of an html document. It is represented as a "string", which is what you call an object that you want to work with as text (in contrast to other data types, like numbers and lists). This is the page for Volume 1, Chapter 1 of the collection of letters, which you can see [here](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I)
8. This parenthesis closes out the function arguments.

The second line is just calling up the object we've just created, so that we can see what it is. Some objects will output a nice, readable form of their contents when you call them in this way. In this case, what we get is a kind of gnarly looking line that just tells us what kind of thing it is (an "ElementTree")

----

In [3]:
print(html.tostring(ch1,encoding="unicode")[:1000])

<!DOCTYPE html>
<!-- saved from url=(0087)https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I --><html class="client-js ve-not-available" lang="en" dir="ltr"><head>

<title>The Letters of Robert Louis Stevenson Volume 1/Chapter I - Wikisource, the free online library</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I","wgTitle":"The Letters of Robert Louis Stevenson Volume 1/Chapter I","wgCurRevisionId":3751843,"wgRevisionId":3751843,"wgArticleId":39652,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Subpages"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageC

----

$\uparrow$ This is another look at what we've just made. We're using two different functions to get it, but don't worry about the syntax (unless you're really curious) and just know that this is the top of the document represented by `ch1`

----

In [4]:
text = ch1.xpath("//div[@class='prose']")[0]
text

<Element div at 0x108565048>

----

$\uparrow$ In this cell, we're using an XPath selector to get a narrower selection of the document. Instead of the whole document, with the Wikipedia header and sidebars and everything, we want just the central block with the text. The XPath selector that we're using as an argument does just that; it selects "`div`" elements with the class "prose". If you go to the [Wikisource page](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_1/Chapter_I), right click on the text and click "Inspect", you'll be able to use your browser's developer tools to confirm this for yourself.

We're using a similar syntax to what we used in the previous cell. We're creating an object, this time based on the `ch1` object, and using the `xpath` function to do so. One difference in the syntax is at the end of the line, where we have `[0]`. When you see those square brackets after some object, it indicates that we're accessing some part of it. Here, we're accessing the first value that the `xpath` function returns (since Python, like many other programming languages, starts counting from 0, not 1)

----

In [5]:
texthtml = html.tostring(text,encoding="unicode")
print(texthtml[:1200])

<div class="prose">
<p>Letter: SPRING GROVE SCHOOL, 12TH NOVEMBER 1863.</p>
<p><br></p>
<p>MA CHERE MAMAN, - Jai recu votre lettre Aujourdhui et comme le jour prochaine est mon jour de naisance je vous ecrit ce lettre. Ma grande gatteaux est arrive il leve 12 livres et demi le prix etait 17 shillings. Sur la soiree de Monseigneur Faux il y etait quelques belles feux d'artifice. Mais les polissons entrent dans notre champ et nos feux d'artifice et handkerchiefs disappeared quickly, but we charged them out of the field. Je suis presque driven mad par une bruit terrible tous les garcons kik up comme grand un bruit qu'll est possible. I hope you will find your house at Mentone nice. I have been obliged to stop from writing by the want of a pen, but now I have one, so I will continue.</p>
<p>My dear papa, you told me to tell you whenever I was miserable. I do not feel well, and I wish to get home.</p>
<p>Do take me with you.</p>
<p>R. STEVENSON.</p>
<hr>
<p>Letter: 2 SULYARDE TERRACE, TORQU

----

$\uparrow$ Here we're transforming our text from an "ElementTree" object to a much simpler data structure, a "string". Strings in Python are data objects that are treated as text, as opposed to numbers or code.

Now we have something quite useful. We have the text of all of the letters in a format that is very easy for Python to work with. This is the first part of breaking the longer text containing all of the letters into smaller chunks containing a single letter.

_Side note: We're also only displaying the first 1200 characters of text with the `[:1200]` addendum, so as not to take up too much space. Try out displaying more or less of the text by changing the number in the brackets._

----

In [6]:
"Hello, goodbye".split(",")

['Hello', ' goodbye']

----

$\uparrow$ Here's a simplified example of what we want to do with our text. Since we've determined that we want to break up the document using those horizontal lines as breakpoints, we need a Python function to do just that.

What we have here is a string, `"Hello, goodbye"`, that we're using the "`split`" function attribute of to split on a comma: `","`.

In the output we see `['Hello', ' goodbye']`. The square brackets indicate that this is a `list` data type, which in Python is an ordered collection of objects. The bits of text in single quotes are `string`s, and they are separated by commas outside of the quotes, which tells us that they are separate items in this `list`.

_Try changing the first string and the argument given to `split`, to see how you can use the function in different ways and in different contexts_

----

In [7]:
letters = texthtml.split("<hr>")
print(letters[0])

<div class="prose">
<p>Letter: SPRING GROVE SCHOOL, 12TH NOVEMBER 1863.</p>
<p><br></p>
<p>MA CHERE MAMAN, - Jai recu votre lettre Aujourdhui et comme le jour prochaine est mon jour de naisance je vous ecrit ce lettre. Ma grande gatteaux est arrive il leve 12 livres et demi le prix etait 17 shillings. Sur la soiree de Monseigneur Faux il y etait quelques belles feux d'artifice. Mais les polissons entrent dans notre champ et nos feux d'artifice et handkerchiefs disappeared quickly, but we charged them out of the field. Je suis presque driven mad par une bruit terrible tous les garcons kik up comme grand un bruit qu'll est possible. I hope you will find your house at Mentone nice. I have been obliged to stop from writing by the want of a pen, but now I have one, so I will continue.</p>
<p>My dear papa, you told me to tell you whenever I was miserable. I do not feel well, and I wish to get home.</p>
<p>Do take me with you.</p>
<p>R. STEVENSON.</p>



----

$\uparrow$ We're using that same function on our larger document, but instead of breaking it up by commas, we're using `"<hr>"`, the HTML notation for those horizontal lines that separate letters. Then we're using the `print` function to display the first letter from that set of letters

----

In [8]:
# len gets us the number of objects in a list or list-like object, 
# so this is the number of letters that we've just parsed out
len(letters)

16

In [9]:
counter = 1
# sets up a counter variable.
for l in letters:
    filename = "letters/wikisource_vol1_ch1_letter" + str(counter) + ".html"
    # sets up the filename, using a consistent folder and beginning for the file names, 
    # but adding in the number from the counter we've set up
    with open(filename, "w") as fp:
    # opens a file with the filename that we set up, and sets it up for writing as an object called fp
        fp.write(l)
        # writes the contents of our letter to the file that we've opened
    counter += 1
    # increments the counter for the next iteration of the loop.

----

$\uparrow$ In this cell, we're doing some iteration. The syntax of `for l in letters:` begins a __loop__ that performs the indented actions on every object in `letters`, with "`l`" as the stand-in for the individual objects.

_We are setting up a new variable called `filename`, but it's getting re-written in every iteration of the loop. You can always re-use the same name for a variable elsewhere in a Python script, it will just overwrite the old one. The new value doesn't even have to be the same kind of object, Python just puts that label on the new object._

----

In [10]:
os.listdir("letters")

['wikisource_vol1_ch1_letter1.html',
 'wikisource_vol1_ch1_letter10.html',
 'wikisource_vol1_ch1_letter11.html',
 'wikisource_vol1_ch1_letter12.html',
 'wikisource_vol1_ch1_letter13.html',
 'wikisource_vol1_ch1_letter14.html',
 'wikisource_vol1_ch1_letter15.html',
 'wikisource_vol1_ch1_letter16.html',
 'wikisource_vol1_ch1_letter2.html',
 'wikisource_vol1_ch1_letter3.html',
 'wikisource_vol1_ch1_letter4.html',
 'wikisource_vol1_ch1_letter5.html',
 'wikisource_vol1_ch1_letter6.html',
 'wikisource_vol1_ch1_letter7.html',
 'wikisource_vol1_ch1_letter8.html',
 'wikisource_vol1_ch1_letter9.html']

----

$\uparrow$ This is one of the handiest functions in the `os` module. `os.listdir()` returns a list of whatever is in the directory given to it as an argument. Here we can see all of the files that we've just created.

----

In [11]:
def get_letters(location):
    # define a function that always takes a single argument, which we'll call location throughout the function
    tree = html.parse(location)
    # Parse the HTML of the document at the location given
    text_element = tree.xpath("//div[@class='prose']")[0]
    # Get the contents of the document that we're interested in with XPath
    text_with_html = html.tostring(text_element,encoding="unicode")
    # Convert that text to a string so it's easy to work with
    letters = text_with_html.split("<hr>")
    # Break the document into sections, separated by horizontal lines
    base_location = location.split('/')[-1]
    # Set up a base filename based on the location given
    # e.g. "wikisource_vol1_ch1.htm"
    base_location = base_location.replace('.htm','')
    # Remove the ".htm" file extension from the base location
    counter = 1
    # Set up our counter
    for l in letters:
        # Begin iterating through letters
        new_location = "letters/" + base_location + "_letter" + str(counter) + ".html"
        # Set up the location where we'll save our new files
        with open(new_location, "w") as fp:
            # Open a file at that location for writing
            fp.write(l)
            # Write letter contents to the file
        counter += 1
        # Increment the counter
    return "Done!"
    # You don't have to, but it's customary to return something at the end of a function.
    # In our case, the function's purpose isn't to transform an object, it's to write objects to files
    # However, you can use "return" to have the function spit out a new object based on its input

----

$\uparrow$ Now we're preparing to scale up this operation with a function of our own. In this cell we're defining a new function that pulls together all of the steps of the previous cells, so that we can more easily parse additional files.

This process is common in using Python as a research tool. First, you figure out the process to do something once, then you make it easier to do perform that process repeatedly, and gradually scale up the complexity of your project. One of the benefits of using a notebook like this as a coding environment is that you can preserve the process that you went through to develop the function, as well as the function itself.

----

In [12]:
# let's try out our new function on the next chapter:
get_letters("wikisource/wikisource_vol1_ch2.htm")

'Done!'

In [13]:
# And now let's see what our function has done...
os.listdir('letters')

['wikisource_vol1_ch1_letter1.html',
 'wikisource_vol1_ch1_letter10.html',
 'wikisource_vol1_ch1_letter11.html',
 'wikisource_vol1_ch1_letter12.html',
 'wikisource_vol1_ch1_letter13.html',
 'wikisource_vol1_ch1_letter14.html',
 'wikisource_vol1_ch1_letter15.html',
 'wikisource_vol1_ch1_letter16.html',
 'wikisource_vol1_ch1_letter2.html',
 'wikisource_vol1_ch1_letter3.html',
 'wikisource_vol1_ch1_letter4.html',
 'wikisource_vol1_ch1_letter5.html',
 'wikisource_vol1_ch1_letter6.html',
 'wikisource_vol1_ch1_letter7.html',
 'wikisource_vol1_ch1_letter8.html',
 'wikisource_vol1_ch1_letter9.html',
 'wikisource_vol1_ch2_letter1.html',
 'wikisource_vol1_ch2_letter10.html',
 'wikisource_vol1_ch2_letter11.html',
 'wikisource_vol1_ch2_letter12.html',
 'wikisource_vol1_ch2_letter13.html',
 'wikisource_vol1_ch2_letter14.html',
 'wikisource_vol1_ch2_letter15.html',
 'wikisource_vol1_ch2_letter16.html',
 'wikisource_vol1_ch2_letter17.html',
 'wikisource_vol1_ch2_letter18.html',
 'wikisource_vol1_ch2_

In [14]:
# Let's see what's in our directory of downloaded files...
os.listdir("wikisource/")

['Icon\r',
 'wikisource_vol1_ch1.htm',
 'wikisource_vol1_ch1_files',
 'wikisource_vol1_ch2.htm',
 'wikisource_vol1_ch2_files',
 'wikisource_vol1_ch3.htm',
 'wikisource_vol1_ch3_files',
 'wikisource_vol1_ch4.htm',
 'wikisource_vol1_ch4_files',
 'wikisource_vol1_ch5.htm',
 'wikisource_vol1_ch5_files',
 'wikisource_vol1_ch6.htm',
 'wikisource_vol1_ch6_files',
 'wikisource_vol1_ch7.htm',
 'wikisource_vol1_ch7_files',
 'wikisource_vol2_ch10.htm',
 'wikisource_vol2_ch10_files',
 'wikisource_vol2_ch11.htm',
 'wikisource_vol2_ch11_files',
 'wikisource_vol2_ch12.htm',
 'wikisource_vol2_ch12_files',
 'wikisource_vol2_ch8.htm',
 'wikisource_vol2_ch8_files',
 'wikisource_vol2_ch9.htm',
 'wikisource_vol2_ch9_files']

----

As you can see, we have a few isues in that list of files.

The first item is an artifact of Google Drive, an icon file. We'll need to get rid of that, but that isn't terribly difficult.

A more challenging task is removing all of those directories that end in "\_files". We could just check for that file ending, but let's use another function from `os` instead.

----

In [15]:
os.path.isfile('wikisource/wikisource_vol_1_ch_1_files/')

False

----

$\uparrow$ This function, `os.path.isfile()`, is a way to check if a given path points to a file, or something else. As you can see, for this path, which points to a folder (which sometimes gets called a directory in programming) the function returns `False`. We can use this function to prune our directory listing to the list of files that we want.

----

In [16]:
filelist = [f for f in os.listdir('wikisource') if os.path.isfile("wikisource/"+f) and f != 'Icon\r']
filelist

['wikisource_vol1_ch1.htm',
 'wikisource_vol1_ch2.htm',
 'wikisource_vol1_ch3.htm',
 'wikisource_vol1_ch4.htm',
 'wikisource_vol1_ch5.htm',
 'wikisource_vol1_ch6.htm',
 'wikisource_vol1_ch7.htm',
 'wikisource_vol2_ch10.htm',
 'wikisource_vol2_ch11.htm',
 'wikisource_vol2_ch12.htm',
 'wikisource_vol2_ch8.htm',
 'wikisource_vol2_ch9.htm']

----

$\uparrow$ That long line of code inside those square brackets is called a "list comprehension", and it's one of the cool features of Python. Essentially, it's a way of creating a new list by iterating through another list and performing some functions. Here, it's filtering out the folders and "Icon" file, and assigning the result to the variable "`filelist`"

Here's a more detailed breakdown of the list comprehension:

[f$^1$ for f in$^2$ os.listdir('wikisource')$^3$ if$^4$ os.path.isfile("wikisource/"+f)$^5$ and f != 'Icon\r'$^6$]

1. `f` is just the name of the variable that we're using to represent every item in the list, like we did with the `for` loop earlier. If we were making modifications to the list items, this is where we would do it. For example, if we had a list of numbers, we could put `f*2` here instead, to multiply every number by two.
2. `for f in ...` is the main syntax for a list comprehension. `[x*2 for x in list]` is a really simple list comprehension that multiplies everything in a list by two. It tells you what variable name it will use as a stand in for list items, and which list it's using as the basis of the list comprehension.
3. Since `os.listdir('wikisource')` returns a list, we can use it like one without creating a variable for it and giving it its own name.
4. The `if` is an optional part of the list comprehension, that puts conditions on which list items to include in the output list. The items will only be transformed and added to the new list if they pass those tests.
5. First we check if our location is a file. We have to put "`wikisource/`" in front of the file name, since `os.listdir()` just returns names, not whole paths.
6. We have an additional test for list items, and the item must pass both to be included because we've connected the two conditions with `and`. This test is just to eliminate that pesky Icon file.

After that, we have a list of all of the files, and only the files, in the folder containing our HTML files. Let's parse all of our files!

----

In [17]:
for file in filelist:
    get_letters("wikisource/"+file)
    print("done with "+file)

done with wikisource_vol1_ch1.htm
done with wikisource_vol1_ch2.htm
done with wikisource_vol1_ch3.htm
done with wikisource_vol1_ch4.htm
done with wikisource_vol1_ch5.htm
done with wikisource_vol1_ch6.htm
done with wikisource_vol1_ch7.htm
done with wikisource_vol2_ch10.htm


IndexError: list index out of range

## Errors

Oh no! We've encountered an error!

If you get into programming, this will happen to you all the time. In fact, this isn't a contrived error, it's a real problem I ran into parsing these documents. Fortunately, Python gives good feedback on errors, once you know how to interpret them.

After the line of hyphens, the first think you see is "`IndexError`", which is the kind of error that was encountered. Then there is a "stack trace", which goes through the code involved, sometimes through several layers of nested functions, to point out the problem. 

In this case, the error happened when we were pruning the document with an XPath Expression.

After the stack trace, you get the error again, with a longer description: "`list index out of range`". That gives us a clue as to where things were going wrong, since the list involved in that line was a list of elements returned by our XPath function. We wanted to get the first element returned by our XPath expression, so how could that cause a problem? Let's take a look.

In [18]:
ch11 = html.parse("wikisource/wikisource_vol2_ch11.htm")
# Get the HTML of the document as an html object
text11 = ch11.xpath("//div[@class='prose']")
# Try out the XPath
text11
# Show the results

[]

Well there's our problem. That XPath isn't returning anything results. But why? Let's go to [the page](https://en.wikisource.org/wiki/The_Letters_of_Robert_Louis_Stevenson_Volume_2/Chapter_XI) and find out...

In [19]:
def get_letters(location):
    tree = html.parse(location)
    try:
        text_element = tree.xpath("//div[@class='prose']")[0]
    except IndexError:
        text_element = tree.xpath("//div[@class='mw-parser-output']")[0]
    # Our solution to that error is to use another Python convention: 
    # try stuff and then deal with errors that come up. In this case, we have a different XPath
    # for when things go wrong with the first one
    text_with_html = html.tostring(text_element,encoding="unicode")
    letters = text_with_html.split("<hr>")
    counter = 1
    base_location = location.split('/')[-1]
    base_location = base_location.replace('.htm','')
    for l in letters:
        with open("letters/" + base_location + "_letter" + str(counter) + ".html", "w") as fp:
            fp.write(l)
        counter += 1
    return 0

In [20]:
# Let's give that another try...
for file in filelist:
    get_letters("wikisource/"+file)
    print("done with "+file)

done with wikisource_vol1_ch1.htm
done with wikisource_vol1_ch2.htm
done with wikisource_vol1_ch3.htm
done with wikisource_vol1_ch4.htm
done with wikisource_vol1_ch5.htm
done with wikisource_vol1_ch6.htm
done with wikisource_vol1_ch7.htm
done with wikisource_vol2_ch10.htm
done with wikisource_vol2_ch11.htm
done with wikisource_vol2_ch12.htm
done with wikisource_vol2_ch8.htm
done with wikisource_vol2_ch9.htm


In [21]:
os.listdir('letters')

['wikisource_vol1_ch1_letter1.html',
 'wikisource_vol1_ch1_letter10.html',
 'wikisource_vol1_ch1_letter11.html',
 'wikisource_vol1_ch1_letter12.html',
 'wikisource_vol1_ch1_letter13.html',
 'wikisource_vol1_ch1_letter14.html',
 'wikisource_vol1_ch1_letter15.html',
 'wikisource_vol1_ch1_letter16.html',
 'wikisource_vol1_ch1_letter2.html',
 'wikisource_vol1_ch1_letter3.html',
 'wikisource_vol1_ch1_letter4.html',
 'wikisource_vol1_ch1_letter5.html',
 'wikisource_vol1_ch1_letter6.html',
 'wikisource_vol1_ch1_letter7.html',
 'wikisource_vol1_ch1_letter8.html',
 'wikisource_vol1_ch1_letter9.html',
 'wikisource_vol1_ch2_letter1.html',
 'wikisource_vol1_ch2_letter10.html',
 'wikisource_vol1_ch2_letter11.html',
 'wikisource_vol1_ch2_letter12.html',
 'wikisource_vol1_ch2_letter13.html',
 'wikisource_vol1_ch2_letter14.html',
 'wikisource_vol1_ch2_letter15.html',
 'wikisource_vol1_ch2_letter16.html',
 'wikisource_vol1_ch2_letter17.html',
 'wikisource_vol1_ch2_letter18.html',
 'wikisource_vol1_ch2_

In [22]:
len(os.listdir('letters'))

462

There we have it! We've parsed the letters into 462 separate files! You can see how quick it was to go from 16 files to 462, and when you're using a programming language for your work, you can anticipate similar scalability. What's more, in doing so we've created a record of exaclty how we got here.