# Parsing Our Data

In this notebook, we'll take the results of the last notebook, [Parse letters from pages.ipynb](Parse letters from pages.ipynb), and extract information on who letters are addressed to, as well as information on when the letters were sent.

To do that, we should first take a look at the structure of an individual letter, and see what kinds of patterns we can use to get that information.

In [1]:
import os
import re
from lxml import html

----

$\uparrow$ As before, we begin by importing the `os`, `re`, and `html` modules. These are handy tools for extracting data from HTML documents, and I begin a lot of work with HTML data like this.

----

In [2]:
test_letter = html.parse('letters/wikisource_vol1_ch1_letter1.html')
test_letter

<lxml.etree._ElementTree at 0x105d23048>

----

$\uparrow$ Again we parse the letter into an `ElementTree` object, this time called `test_letter`

----

In [3]:
html.tostring(test_letter,encoding='unicode')

'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n<html><body><div class="prose">\n<p>Letter: SPRING GROVE SCHOOL, 12TH NOVEMBER 1863.</p>\n<p><br></p>\n<p>MA CHERE MAMAN, - Jai recu votre lettre Aujourdhui et comme le jour prochaine est mon jour de naisance je vous ecrit ce lettre. Ma grande gatteaux est arrive il leve 12 livres et demi le prix etait 17 shillings. Sur la soiree de Monseigneur Faux il y etait quelques belles feux d\'artifice. Mais les polissons entrent dans notre champ et nos feux d\'artifice et handkerchiefs disappeared quickly, but we charged them out of the field. Je suis presque driven mad par une bruit terrible tous les garcons kik up comme grand un bruit qu\'ll est possible. I hope you will find your house at Mentone nice. I have been obliged to stop from writing by the want of a pen, but now I have one, so I will continue.</p>\n<p>My dear papa, you told me to tell you whenever I was miserable. I do not f

----

$\uparrow$ Let's take a look at what we have here...

Well, it's not very easy to read, all stuck together like that. And there's definitely some extra stuff in there, like the `DOCTYPE` declaration at the beginning. Let's just get the paragraphs of text with some XPath.

----

In [5]:
test_letter.xpath("//p")

[<Element p at 0x10b26d4a8>,
 <Element p at 0x10b8fd318>,
 <Element p at 0x10b8fdd18>,
 <Element p at 0x10b8fdd68>,
 <Element p at 0x10b8fddb8>,
 <Element p at 0x10b8fde08>]

In [5]:
for p in test_letter.xpath('//p'):
    print(p.text_content())

Letter: SPRING GROVE SCHOOL, 12TH NOVEMBER 1863.

MA CHERE MAMAN, - Jai recu votre lettre Aujourdhui et comme le jour prochaine est mon jour de naisance je vous ecrit ce lettre. Ma grande gatteaux est arrive il leve 12 livres et demi le prix etait 17 shillings. Sur la soiree de Monseigneur Faux il y etait quelques belles feux d'artifice. Mais les polissons entrent dans notre champ et nos feux d'artifice et handkerchiefs disappeared quickly, but we charged them out of the field. Je suis presque driven mad par une bruit terrible tous les garcons kik up comme grand un bruit qu'll est possible. I hope you will find your house at Mentone nice. I have been obliged to stop from writing by the want of a pen, but now I have one, so I will continue.
My dear papa, you told me to tell you whenever I was miserable. I do not feel well, and I wish to get home.
Do take me with you.
R. STEVENSON.


----

$\uparrow$ `.text_content()` is a function attribute of elements in the `html` module that simply extracts the text that those elements contain, removing any HTML markup from child elements. We've set up a simple loop to go through all of the elements that the XPath returns and print their text content.

----

In [6]:
test_letter_text = "\n".join([p.text_content() for p in test_letter.xpath('//p')])
print(test_letter_text)

Letter: SPRING GROVE SCHOOL, 12TH NOVEMBER 1863.

MA CHERE MAMAN, - Jai recu votre lettre Aujourdhui et comme le jour prochaine est mon jour de naisance je vous ecrit ce lettre. Ma grande gatteaux est arrive il leve 12 livres et demi le prix etait 17 shillings. Sur la soiree de Monseigneur Faux il y etait quelques belles feux d'artifice. Mais les polissons entrent dans notre champ et nos feux d'artifice et handkerchiefs disappeared quickly, but we charged them out of the field. Je suis presque driven mad par une bruit terrible tous les garcons kik up comme grand un bruit qu'll est possible. I hope you will find your house at Mentone nice. I have been obliged to stop from writing by the want of a pen, but now I have one, so I will continue.
My dear papa, you told me to tell you whenever I was miserable. I do not feel well, and I wish to get home.
Do take me with you.
R. STEVENSON.


----

Printing that text is all well and good, but ultimately we want to be able to work with our data in a native Python data structure, like a string. So here we have a pretty intense one liner. Let's break it down:

test_letter_text = "\n".join([p.text_content() for p in test_letter.xpath('//p')])

1. Assigning a variable called `test_letter_text`
2. We're using the `.join()` function attribute that all strings come built with. This function uses the string calling it as the connector between items in a list. In this case, our string is a newline character, represented by `\n`, and our list is defined by the following list comprehension.
3. In this list comprehension, we're going to take the text content of every paragraph element in our list.
4. Remember that `[x for x in list]` is our basic list comprehension syntax, and you'll see that this list calls every list item `p`.
5. Our list in this case is the result of `test_letter.xpath('//p')`, which does in fact return a list of elements. Going back full circle, that means that all of the items represented by `p` in our list comprehension will be paragraph elements, which we will grab the text content of.

So, in that rather long one liner, we've made a string that puts together all of the paragraphs in our test text with newline characters. Then we `print` the results, and see that we do have that nicely formatted string we want.

----

In [7]:
def get_text(location):
    text_html = html.parse(location)
    test_letter_text = "\n".join([p.text_content() for p in text_html.xpath('//p')])
    return test_letter_text

----

Since we have a useful operation that we'll probably want to do a lot, let's turn it into a function. We give the function a location, and it parses the file there as an HTML document, then glues together all of the paragraphs inside it with newline characters, and returns the result.

Let's try it on another document.

----

In [10]:
get_text('letters/wikisource_vol1_ch1_letter11.html')

"Letter: TO MRS. THOMAS STEVENSON\n\nBRUSSELS, THURSDAY, 25TH JULY 1872.\nMY DEAR MOTHER, - I am here at last, sitting in my room, without coat or waistcoat, and with both window and door open, and yet perspiring like a terra-cotta jug or a Gruyere cheese.\nWe had a very good passage, which we certainly deserved, in compensation for having to sleep on cabin floor, and finding absolutely nothing fit for human food in the whole filthy embarkation. We made up for lost time by sleeping on deck a good part of the forenoon. When I woke, Simpson was still sleeping the sleep of the just, on a coil of ropes and (as appeared afterwards) his own hat; so I got a bottle of Bass and a pipe and laid hold of an old Frenchman of somewhat filthy aspect (FIAT EXPERIMENTUM IN CORPORE VILI) to try my French upon. I made very heavy weather of it. The Frenchman had a very pretty young wife; but my French always deserted me entirely when I had to answer her, and so she soon drew away and left me to her lord, 

----

You'll note that the newline characters are shown as `\n` unless we `print` the string, which formats things more nicely. Other than that, it looks like we've got the text of a letter.

Now let's think about how to extract the folks who those letters are addressed to...

----

In [8]:
re.findall(r'Letter: ([\w ]+)',test_letter_text)

['SPRING GROVE SCHOOL']

----

Using the pattern that we've found, we can make a regular expression to extract those values. There are a number of tools online to test regular expressions, but one that I like is [RegExr](https://regexr.com/), since it explains the syntax to you. 

Take a look at what's going on behind this function call [here](http://regexr.com/3gtks).

_Regular expressions are a powerful tool, but we won't be getting too far into the actual syntax here, instead focusing on the ideas behind the syntax. For a good introduction to regular expressions, I'd recommend [RegexOne](https://regexone.com/), which is interactive, yet simple and quick._

Since that got us something reasonable, let's try another letter.

----

In [11]:
# Using the results of get_text as the text to search
re.findall(r'Letter: ([\w ]+)',get_text('letters/wikisource_vol1_ch1_letter2.html'))

['2 SULYARDE TERRACE']

----

That one looks good too. I think we're on to something! Let's try it on another letter.

----

In [12]:
re.findall(r'Letter: ([\w ]+)',get_text('letters/wikisource_vol1_ch1_letter7.html'))

['TO MRS']

----

Okay, that looks incomplete. What does that letter look like?

----

In [14]:
print(get_text('letters/wikisource_vol1_ch1_letter7.html'))

Letter: TO MRS. CHURCHILL BABINGTON

[SWANSTON COTTAGE, LOTHIANBURN, SUMMER 1871.]
MY DEAR MAUD, - If you have forgotten the hand-writing - as is like enough - you will find the name of a former correspondent (don't know how to spell that word) at the end. I have begun to write to you before now, but always stuck somehow, and left it to drown in a drawerful of like fiascos. This time I am determined to carry through, though I have nothing specially to say.
We look fairly like summer this morning; the trees are blackening out of their spring greens; the warmer suns have melted the hoarfrost of daisies of the paddock; and the blackbird, I fear, already beginning to 'stint his pipe of mellower days' - which is very apposite (I can't spell anything to-day - ONE p or TWO?) and pretty. All the same, we have been having shocking weather - cold winds and grey skies.
I have been reading heaps of nice books; but I can't go back so far. I am reading Clarendon's HIST. REBELL. at present, with whic

----

Okay, it looks like it's not just letters and spaces that we want to get, sometimes there are periods too. Let's add that to the characters that we're looking for.

----

In [15]:
re.findall(r'Letter: ([\w .]+)',get_text('letters/wikisource_vol1_ch1_letter7.html'))

['TO MRS. CHURCHILL BABINGTON']

----

Let's account for that "TO: " and make it optional in our regular expression.

----

In [12]:
re.findall(r'Letter: (?:TO )?([\w .]+)',get_text('letters/wikisource_vol1_ch1_letter7.html'))

['MRS. CHURCHILL BABINGTON']

----

_See this regular expression in action [here](http://regexr.com/3gtl2)._

----

In [17]:
def get_addressee(location):
    text = get_text(location)
    addressee = re.findall(r'Letter: (?:TO )?([\w .]+)',text)
    return addressee

----

Since we have another useful action that we'll want to repeat, we've defined a function for it, so we don't have to keep typing it out. Now it's easier to use the function on all of the documents in our dataset, and look at the results.

----

In [18]:
for f in os.listdir('letters'):
    print(f,get_addressee('letters/'+f))

wikisource_vol1_ch1_letter1.html ['SPRING GROVE SCHOOL']
wikisource_vol1_ch1_letter10.html ['CHARLES BAXTER']
wikisource_vol1_ch1_letter11.html ['MRS. THOMAS STEVENSON']
wikisource_vol1_ch1_letter12.html ['MRS. THOMAS STEVENSON']
wikisource_vol1_ch1_letter13.html ['MRS. THOMAS STEVENSON']
wikisource_vol1_ch1_letter14.html ['THOMAS STEVENSON']
wikisource_vol1_ch1_letter15.html ['MRS. THOMAS STEVENSON']
wikisource_vol1_ch1_letter16.html ['CHARLES BAXTER']
wikisource_vol1_ch1_letter2.html ['2 SULYARDE TERRACE']
wikisource_vol1_ch1_letter3.html ['MRS. THOMAS STEVENSON']
wikisource_vol1_ch1_letter4.html ['MRS. THOMAS STEVENSON']
wikisource_vol1_ch1_letter5.html ['MRS. THOMAS STEVENSON']
wikisource_vol1_ch1_letter6.html ['MRS. THOMAS STEVENSON']
wikisource_vol1_ch1_letter7.html ['MRS. CHURCHILL BABINGTON']
wikisource_vol1_ch1_letter8.html ['ALISON CUNNINGHAM']
wikisource_vol1_ch1_letter9.html ['CHARLES BAXTER']
wikisource_vol1_ch2_letter1.html ['MRS. THOMAS STEVENSON']
wikisource_vol1_ch2_le

----

If we scroll through this long list of findings, there are some odd things. One of our letters doesn't seem to have an addressee, another looks like it got cut off, and one more still has that "To" that we wanted to take out. Let's take a look at those.

----

In [15]:
print(get_text('letters/wikisource_vol1_ch5_letter28.html'))

THE COTTAGE, CASTLETON OF BRAEMAR, AUGUST 19, 1881.
IF you had an uncle who was a sea captain and went to the North Pole, you had better bring his outfit. VERBUM SAPIENTIBUS. I look towards you.
R. L. STEVENSON.


In [16]:
print(get_text('letters/wikisource_vol1_ch6_letter1.html'))


Letter: TO THE EDITOR OF THE 'NEW YORK TRIBUNE'

TERMINUS HOTEL, MARSEILLES, OCTOBER 16, 1882.
SIR, - It has come to my ears that you have lent the authority of your columns to an error.
More than half in pleasantry - and I now think the pleasantry ill- judged - I complained in a note to my NEW ARABIAN NIGHTS that some one, who shall remain nameless for me, had borrowed the idea of a story from one of mine. As if I had not borrowed the ideas of the half of my own! As if any one who had written a story ill had a right to complain of any other who should have written it better! I am indeed thoroughly ashamed of the note, and of the principle which it implies.
But it is no mere abstract penitence which leads me to beg a corner of your paper - it is the desire to defend the honour of a man of letters equally known in America and England, of a man who could afford to lend to me and yet be none the poorer; and who, if he would so far condescend, has my free permission to borrow from me all 

In [21]:
print(get_text('letters/wikisource_vol2_ch9_letter45.html'))

Letter: To HOMER ST. GAUDENS

MANASQUAN, NEW JERSEY, 27TH MAY 1888.
DEAR HOMER ST. GAUDENS, - Your father has brought you this day to see me, and he tells me it is his hope you may remember the occasion. I am going to do what I can to carry out his wish; and it may amuse you, years after, to see this little scrap of paper and to read what I write. I must begin by testifying that you yourself took no interest whatever in the introduction, and in the most proper spirit displayed a single-minded ambition to get back to play, and this I thought an excellent and admirable point in your character. You were also (I use the past tense, with a view to the time when you shall read, rather than to that when I am writing) a very pretty boy, and (to my European views) startlingly self-possessed. My time of observation was so limited that you must pardon me if I can say no more: what else I marked, what restlessness of foot and hand, what graceful clumsiness, what experimental designs upon the furni

----

Well, the first one was an accurate representation of that letter, it really doesn't have an addressee. So that's not a problem. However, it looks like we'll need to add a character to our search, because of a letter where Stevenson is writing to an editor. We'll also need to account for some inconsistent capitalization in that "TO:"

When we define a function with the same name as one we've already defined, it overwrites the old one. This is convenient for us now, but be careful that you don't rewrite core Python functions, like `len()` for example.

----

In [24]:
def get_addressee(location):
    text = get_text(location)
    try:
        # Here's a try/except block to account for times where there isn't an address,
        # while still returning only a string, not a list of strings.
        addressee = re.findall(r'Letter: (?:T[Oo] )?([\w .\']+)',text)[0]
        return addressee
    except IndexError:
        return ""

----

_See this regular expression in action [here](http://regexr.com/3gtlb)._

----

In [25]:
# Let's give it another spot check...
for f in os.listdir('letters'):
    print(f,get_addressee('letters/'+f))

wikisource_vol1_ch1_letter1.html SPRING GROVE SCHOOL
wikisource_vol1_ch1_letter10.html CHARLES BAXTER
wikisource_vol1_ch1_letter11.html MRS. THOMAS STEVENSON
wikisource_vol1_ch1_letter12.html MRS. THOMAS STEVENSON
wikisource_vol1_ch1_letter13.html MRS. THOMAS STEVENSON
wikisource_vol1_ch1_letter14.html THOMAS STEVENSON
wikisource_vol1_ch1_letter15.html MRS. THOMAS STEVENSON
wikisource_vol1_ch1_letter16.html CHARLES BAXTER
wikisource_vol1_ch1_letter2.html 2 SULYARDE TERRACE
wikisource_vol1_ch1_letter3.html MRS. THOMAS STEVENSON
wikisource_vol1_ch1_letter4.html MRS. THOMAS STEVENSON
wikisource_vol1_ch1_letter5.html MRS. THOMAS STEVENSON
wikisource_vol1_ch1_letter6.html MRS. THOMAS STEVENSON
wikisource_vol1_ch1_letter7.html MRS. CHURCHILL BABINGTON
wikisource_vol1_ch1_letter8.html ALISON CUNNINGHAM
wikisource_vol1_ch1_letter9.html CHARLES BAXTER
wikisource_vol1_ch2_letter1.html MRS. THOMAS STEVENSON
wikisource_vol1_ch2_letter10.html MRS. THOMAS STEVENSON
wikisource_vol1_ch2_letter11.html 

----

Looks good! We've successfully created a function to get out the addressees from these letters

----

# Dates

Let's try getting the date out with some regular expressions.

I've already looked through the letters, and I've found that the ways that Stevenson records dates vary pretty wildly. We won't be able to use a single regular expression, and we're likely to run into lots of exceptions, so we'll want to be able to iterate on the regular expression that we use quickly.

One pattern in how dates are recorded is that they always contain a year, either as a 4 digit year, or abbreviated, like "'80" for 1880. Since there's that very general pattern, we can extract the lines containing the dates with one regular expression, so that we have a large set of date formats to test against.

_In case it's not obvious, it's much easier to get consistent dates from strings that contain only dates than from lines containing dates. Right now we're making things easier on ourselves when we get to data cleanup._

In [114]:
for f in os.listdir('letters/'):
    text = get_text(os.path.join('letters',f))
    try:
        print(re.findall(r".*(?:\d{4}|'\d{2}).*",text)[0])
    except IndexError:
        print()

Letter: SPRING GROVE SCHOOL, 12TH NOVEMBER 1863.
DUNBLANE, TUESDAY, 9TH APRIL 1872.
BRUSSELS, THURSDAY, 25TH JULY 1872.
HOTEL LANDSBERG, FRANKFURT, MONDAY, 29TH JULY 1872.
HOTEL LANDSBERG, THURSDAY, 1ST AUGUST 1872.
FRANKFURT, ROSENGASSE 13, AUGUST 4, 1872.
13 ROSENGASSE, FRANKFURT, TUESDAY MORNING, AUGUST 1872.
17 HERIOT ROW, EDINBURGH, SUNDAY, FEBRUARY 2, 1873.
Letter: 2 SULYARDE TERRACE, TORQUAY, THURSDAY (APRIL 1866).
WICK, FRIDAY, SEPTEMBER 11, 1868.
WICK, September 5, 1868. MONDAY.
WICK, SEPTEMBER 1868. SATURDAY, 10 A.M.
PULTENEY, WICK, SUNDAY, SEPTEMBER 1868.
[SWANSTON COTTAGE, LOTHIANBURN, SUMMER 1871.]
1871?
DUNBLANE, FRIDAY, 5TH MARCH 1872.
COCKFIELD RECTORY, SUDBURY, SUFFOLK, TUESDAY, JULY 28, 1873.
MENTONE, JANUARY 7, 1874.
MENTONE, TUESDAY, 13TH JANUARY 1874.
[MENTONE, JANUARY 1874.]
[MENTONE, MARCH 28, 1874.]
[SWANSTON], MAY 1874, MONDAY.
SWANSTON, WEDNESDAY, MAY 1874.
TRAIN BETWEEN EDINBURGH AND CHESTER, AUGUST 8, 1874.
SWANSTON, WEDNESDAY, [AUTUMN] 1874.
[EDINBURGH], DE

In [193]:
def get_date(location):
    text = get_text(location)
    patterns = [
        r"NOVEMBER 20 OR 21, 1887",
        r"[A-Z]+[\.]? \d{1,2}(?:[THNDSTR]{2})?[,!]{0,2} [\[\(]?(?:\d{4}|'\d{2})(?:-\d+)?",
        r"(?:\d+[THNDSTR]{2} )?[A-Z]{3,}[,\.\]]? [\[]?(?:\d{4}|'\d{2})(?:-\d+)?",
        r"JAN. SOMETHINGOROTHER-TH, 1886",
        r"LAST SUNDAY OF '83",
        r"FEBRUAR DEN 3EN 1890",
        r"APRIL 15 OR 16 \(THE HOUR NOT BEING KNOWN\), 1886",
        r"\d{4}",
        r"'\d{2}"
    ]
    for p in patterns:
        try:
            date = re.findall(p,text,flags=re.IGNORECASE)[0]
            return date
        except IndexError:
            pass
    return None

In [194]:
no_matches = []
for file in os.listdir('letters'):
    if get_date('letters/'+file) == None:
        no_matches.append(file)
print(len(no_matches))
print(no_matches)

1
['wikisource_vol2_ch11_letter36.html']


In [195]:
print(get_text("letters/wikisource_vol2_ch8_letter7.html"))

Letter: TO W. H. LOW

SKERRYVORE, BOURNEMOUTH, JAN. SOMETHINGOROTHER-TH, 1886.
MY DEAR LOW, - I send you two photographs: they are both done by Sir Percy Shelley, the poet's son, which may interest. The sitting down one is, I think, the best; but if they choose that, see that the little reflected light on the nose does not give me a turn-up; that would be tragic. Don't forget 'Baronet' to Sir Percy's name.
We all think a heap of your book; and I am well pleased with my dedication. - Yours ever,
R. L. STEVENSON.
P.S. - APROPOS of the odd controversy about Shelley's nose: I have before me four photographs of myself, done by Shelley's son: my nose is hooked, not like the eagle, indeed, but like the accipitrine family in man: well, out of these four, only one marks the bend, one makes it straight, and one suggests a turn-up. This throws a flood of light on calumnious man - and the scandal- mongering sun. For personally I cling to my curve. To continue the Shelley controversy: I have a look

In [196]:
letter_meta = []
for f in os.listdir('letters'):
    location = 'letters/'+f
    addressee = get_addressee(location)
    date = get_date(location)
    info = {'location':location,'addressee':addressee,'date':date}
    letter_meta.append(info)

In [197]:
import pandas as pd

In [198]:
df = pd.DataFrame(letter_meta)
df.head()

Unnamed: 0,addressee,date,location
0,SPRING GROVE SCHOOL,12TH NOVEMBER 1863,letters/wikisource_vol1_ch1_letter1.html
1,CHARLES BAXTER,9TH APRIL 1872,letters/wikisource_vol1_ch1_letter10.html
2,MRS. THOMAS STEVENSON,25TH JULY 1872,letters/wikisource_vol1_ch1_letter11.html
3,MRS. THOMAS STEVENSON,29TH JULY 1872,letters/wikisource_vol1_ch1_letter12.html
4,MRS. THOMAS STEVENSON,"AUGUST 2, 1872",letters/wikisource_vol1_ch1_letter13.html


In [199]:
len(df)

462

In [200]:
for d in df.date.sort_values().unique():
    print(d)

10TH MAY 1889
10TH NOVEMBER '88
11TH MAY 1888
12TH DECEMBER '87
12TH JANUARY '83
12TH MARCH 1884
12TH NOVEMBER 1863
12TH OCTOBER 1883
13TH DECEMBER 1883
13TH JANUARY 1874
13TH NOVEMBER 1884
14TH JANUARY 1885
14TH JANUARY, 1889
15TH NOVEMBER 1879
17TH OCTOBER 1882
1871
1877
1886
1891
18TH JULY 1892
18TH NOVEMBER 1887
19TH JULY '93
19TH MAY '91
1ST DEC. '92
1ST JANUARY '94
20TH MAY '89
21ST OCTOBER [1879
22ND FEBRUARY '82
24TH SEPTEMBER 1886
25TH JULY 1872
26TH SEPTEMBER 1883
27TH MAY 1888
28TH SEPTEMBER 1884
29TH JULY 1872
4TH DECEMBER 1889
5TH MARCH 1872
6TH OCTOBER 1888
8TH MARCH 1889
8TH OCTOBER 1879
9TH APRIL 1872
9TH MARCH 1884
APRIL 1, 1882
APRIL 14, 1894
APRIL 15 OR 16 (THE HOUR NOT BEING KNOWN), 1886
APRIL 16 [1880
APRIL 16, 1879
APRIL 16TH, 1887
APRIL 17, '94
APRIL 17TH, 1894
APRIL 1866
APRIL 1875
APRIL 1879
APRIL 1880
APRIL 1882
APRIL 1883
APRIL 1884
APRIL 1886
APRIL 1887
APRIL 1888
APRIL 1889
APRIL 1891
APRIL 1894
APRIL 19, 1884
APRIL 24, 1884
APRIL 2ND, 1889
APRIL 5TH, 1893


In [191]:
df[df.date=="OR 21, 1887"]

Unnamed: 0,addressee,date,location
423,CHARLES SCRIBNER,"OR 21, 1887",letters/wikisource_vol2_ch9_letter16.html


In [192]:
print(get_text('letters/wikisource_vol2_ch9_letter16.html'))

Letter: TO CHARLES SCRIBNER

[SARANAC, NOVEMBER 20 OR 21, 1887.]
MY DEAR MR. SCRIBNER, - Heaven help me, I am under a curse just now. I have played fast and loose with what I said to you; and that, I beg you to believe, in the purest innocence of mind. I told you you should have the power over all my work in this country; and about a fortnight ago, when M'Clure was here, I calmly signed a bargain for the serial publication of a story. You will scarce believe that I did this in mere oblivion; but I did; and all that I can say is that I will do so no more, and ask you to forgive me. Please write to me soon as to this.
Will you oblige me by paying in for three articles, as already sent, to my account with John Paton & Co., 52 William Street? This will be most convenient for us.
The fourth article is nearly done; and I am the more deceived, or it is A BUSTER.
Now as to the first thing in this letter, I do wish to hear from you soon; and I am prepared to hear any reproach, or (what is harde

In [201]:
df.to_csv('letters_data.csv',index=None)

In [202]:
test = df.groupby('addressee')

In [203]:
test.nunique().sort_values('location',ascending=False).to_csv('addressees.csv')