## Objective: To extract the text of the English language letters included in the North American Immigrant Letters, Diaries and Oral Histories collection.

Source: /Users/alaynemoody/Dropbox/Thesis/North_American_Letters_Diaries_OralHistories/dataNAIL

Search Criteria: "Letter" in column "doctype" and "English" in column "language" in IMLD_DOCS_QA completed.xlsx

Results: 1556 results

Steps:

<ol>
<li>For first row,</li>
<ol>
    <li>Open the sourceid</li>
    <li>Extract contents of the "part id" tag with the docid.</li>
    <li>Save text to a file.</li></ol>

<li>Use a loop to do this for a subset of sample</li>
<li>Run loop for all rows in sample</li>
</ol>    

In [1]:
# Import libraries
import pandas as pd 
from bs4 import BeautifulSoup
import nltk
import re

In [2]:
# 1. Open IMLD_DOCS_QA completed.xlsx
df = pd.read_csv("20240314b_PhD_NaildohSubset.csv") 
df.head()

Unnamed: 0,docsequence,docid,sourceid,docauthorid,doctitle,docyear,docmonth,docday,docpage,doctype,...,display_thumbnail,wwritten,wsent,marriagestatus,maternalstatus,authorLocation,nationalOrigin,britishEmpire_EU,translated,publicLetter
0,2,S1019-D002,S1019,per0001043,Letter from Sister Blandina Segale to Sister J...,1872.0,11.0,30.0,3-10,Letter,...,,Ohio; United States; East North Central States...,Not indicated,Single,Childless,USA,Italian,False,False,
1,4,S1019-D004,S1019,per0001043,Letter from Sister Blandina Segale to Sister J...,1872.0,12.0,6.0,13-22,Letter,...,,"Kansas City, MO; Missouri; United States; West...",Not indicated,Single,Childless,USA,Italian,False,False,
2,5,S1019-D005,S1019,per0001043,Letter from Sister Blandina Segale to Sister J...,1872.0,12.0,10.0,22-29,Letter,...,,"Trinidad, CO; Colorado; United States; Southwe...",Not indicated,Single,Childless,USA,Italian,False,False,
3,6,S1019-D006,S1019,per0001043,Letter from Sister Blandina Segale to Sister J...,1872.0,12.0,21.0,29-37,Letter,...,,"Trinidad, CO; Colorado; United States; Southwe...",Not indicated,Single,Childless,USA,Italian,False,False,
4,7,S1019-D007,S1019,per0001043,Letter from Sister Blandina Segale to Sister J...,1873.0,3.0,1.0,37-44,Letter,...,,"Trinidad, CO; Colorado; United States; Southwe...",Not indicated,Single,Childless,USA,Italian,False,False,


In [3]:
df[['docid', 'docauthorid']].describe()[0:2]

Unnamed: 0,docid,docauthorid
count,576,576
unique,576,101


In [4]:
# Display key criteria for following set of instructions. 
df[['docid']].head()

Unnamed: 0,docid
0,S1019-D002
1,S1019-D004
2,S1019-D005
3,S1019-D006
4,S1019-D007


S316 will break the loop because it is missing from the original dataset. I need to remove these rows from the dataframe.

In [23]:
df = df[df["docid"].str.contains("S316")==False]
df[['docid', 'docauthorid']].describe()[0:2]

Unnamed: 0,docid,docauthorid
count,550,550
unique,550,80


In [24]:
# Select the values in docid and split into sourceid (1st element) and docid (2nd element)
IDs = df["docid"].str.split(pat = "-", expand=True)
IDs["2"] = df["docid"]
IDs.columns = ['Src', 'Doc', 'Full']
IDs.head()

Unnamed: 0,Src,Doc,Full
0,S1019,D002,S1019-D002
1,S1019,D004,S1019-D004
2,S1019,D005,S1019-D005
3,S1019,D006,S1019-D006
4,S1019,D007,S1019-D007


In [25]:
IDs.iloc[136,2]

'S2344-D125'

## 1a. Open the sourceid

In [26]:
# Open the first source in the list
f = open("dataNAIL/" + IDs.iloc[0,0] + ".txt","r", encoding = 'utf-8')
print(f)

<_io.TextIOWrapper name='dataNAIL/S1019.txt' mode='r' encoding='utf-8'>


In [27]:
# Parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(f, 'html.parser')

# View the HTML
print(soup.prettify()[:100])

<!DOCTYPE LAD SYSTEM "LAD-TEI.DTD">
<lad>
 <header>
  <author n="A1043">
   <source n="S1019"/>
  </


## 1b. Extract elements of the part tag

In [28]:
# Isolate the desired part (i.e., doc) and place in a variable
doc = soup.find(id=IDs.iloc[0,2])

# View the part
print(str(doc)[:100])
print("...")
print(str(doc)[11000:])

<part id="S1019-D002">
<head><p>TRINIDAD</p></head>
<p>On Train from Steubenville, Ohio, to Cincinna
...
oard the train I asked that my last interview be with my mother. Cannot you picture her sad, endearing look of appreciation? I'll skip the last talk with mother — some of it was in silence.</page></p>
</page></page></part>


In [29]:
doc = ''.join(text for text in doc.find_all(text=True) if text.parent.name != "note")
print(str(doc)[:100])
print("...")
print(str(doc)[10200:])


TRINIDAD
On Train from Steubenville, Ohio, to Cincinnati. Nov˙ 30, 1872.
My Darling Sister Justina:
...
Sister Antonia asked me how I had spent the day. I narrated some incidents. "I'm an ancient religious, but I could not have gone through the ordeal as creditably as you did." What if I had mentioned all the heart sighs I had witnessed! When it was time to board the train I asked that my last interview be with my mother. Cannot you picture her sad, endearing look of appreciation? I'll skip the last talk with mother — some of it was in silence.



## 1c. Save string to file

In [31]:
f = open("letters/" + IDs.iloc[0,2], "w")
f.write(doc)
f.close()

## 2. Use a loop to do the above for a sample

In [32]:
sampleIDs = IDs.sample(n=3)
sampleIDs

Unnamed: 0,Src,Doc,Full
135,S2344,D124,S2344-D124
471,S9828,D011,S9828-D011
246,S6210,D075,S6210-D075


In [33]:
sampleIDs.iloc[1,2]

'S9828-D011'

In [34]:
for index in range(len(sampleIDs)):
    f = open("dataNAIL/" + sampleIDs['Src'].iloc[index] + ".txt","r", encoding = 'utf-8')
    soup = BeautifulSoup(f, 'html.parser')
    doc = soup.find(id=sampleIDs.iloc[index,2])
    doc = ''.join(text for text in doc.find_all(text=True) if text.parent.name != "note")
    f = open("letters/" + sampleIDs.iloc[index,2] + ".txt", "w")
    f.write(doc)
    f.close()
print("done")

done


## 3. Run loop for all rows in sample

In [35]:
for index in range(len(IDs)):
    f = open("dataNAIL/" + IDs['Src'].iloc[index] + ".txt","r", encoding = 'utf-8')
    soup = BeautifulSoup(f, 'html.parser')
    doc = soup.find(id=IDs.iloc[index,2])
    doc = ''.join(text for text in doc.find_all(text=True) if text.parent.name != "note")
    f = open("letters/" + IDs.iloc[index,2] + ".txt", "w")
    f.write(doc)
    f.close()
print("done")

done
