## Basic text processing

Here we will see a tiny bit of the basic text processing that you can do using only the built-in functions of Python ("built-in", in the sense that we do not need to import any special packages like pandas or seaborn to use them--they are just functions there for you to use any time you are running Python, for example the `len()` function). **Before doing this tutorial, please be sure to at least skim through ch. 6 of Automate the Boring Stuff**, which covers much more about strings. Most of the conceptual content of this tutorial is already covered in that chapter, and the intention here is to be somewhat redundant. We will reinforce some of the new concepts through concrete, Grand Tour related examples with thoroughly annotated ("commented") code.

In contrast to the pandas tutorials, where you saw monstrous lines of code like

`tutor_life_events = travelers_life_events_all[ travelers_life_events_all['eventsDetail2'].isin(tutor_occupations) ]`,

this tutorial will be much more transparent. It should be possible for new programmers to fully understand what every line of code is doing without too much effort. Here we are just working texts, so basic Python strings, instead of the extremely rich and complex spreadsheet data we have from the Grand Tour Explorer.



A handy reference for Python's string methods is in the Python documentation [here](https://docs.python.org/3/library/stdtypes.html#string-methods). Think of a method like a function that is appropriate for, and therefore tied to, a particular kind of object. So a string method is a function that is apppriate for strings. The syntax for methods looks like OBJECT.METHOD(). Or with a string in particular, string_variable.METHOD(). Or with a particular string and a particular method, `"AbCdE".lower()`--this would take the string `"AbCdE"`, convert every character in it to lowercase, and return the resulting string `"abcde"` to you. Note that it would not make sense to apply the .lower() method to a number--there aren't lowercase and uppercase numbers, so this would not be an appropriate method for number objects in python.

In [22]:
#make a string variable designating your present working directory ("pwd")
my_pwd = %pwd
print(my_pwd)

/Users/nicholasgardner


### Directory set up
To run this tutorial, download the ECCO txt files from Canvas (under Files/Data/Texts), and then in your present working directory (for me it is my home directory "/Users/nicholasgardner"), do the following

1. Make a folder called "ECCO_texts"
2. Within the "ECCO_texts" folder, make a folder called "txt_versions"
3. Put all the individual ECCO txt files (not the zip file you downloaded from canvas, or a folder containing the files--just all the ~2000 individual txt files) into this "txt_versions" folder 
 
If set up has gone right, the code below will be able to access these ECCO text files.

In [23]:
import os

In [24]:
ADDISON_REMARKS_ECCO_FILENAME = "K062837.000.txt"  #this is the ID for Addison's "Remarks" in the ECCO database
ADDISON_REMARKS_FP = os.path.join(my_pwd, "ECCO_texts", "txt_versions", ADDISON_REMARKS_ECCO_FILENAME) #make a filepath to the file

STERNE_JOURNEY_ECCO_FILENAME = "K027660.001.txt" #this is the ID for a 1768 text by Laurence Sterne called "A Sentimental Journey through France and Italy"
STERNE_JOURNEY_FP = os.path.join(my_pwd, "ECCO_texts", "txt_versions", STERNE_JOURNEY_ECCO_FILENAME) 


#let's check that our filepath strings look right
print(ADDISON_REMARKS_FP)
print(STERNE_JOURNEY_FP)

/Users/nicholasgardner/ECCO_texts/txt_versions/K062837.000.txt
/Users/nicholasgardner/ECCO_texts/txt_versions/K027660.001.txt


In [5]:
#get all the lines in Addison's text into a list, where each item in the list is one line
with open(ADDISON_REMARKS_FP) as addison_remarks:
    addison_remarks_lines = addison_remarks.readlines()

#let's see how many lines there are in Addison's "Remarks"
#note that these lines correspond to the line-breaks in the
#printed edition of Addison's "Remarks" that you read 
#in a previous week. Some are blank lines (i.e. they contain only the newline character "\n")
len(addison_remarks_lines)

addison_remarks_lines[:50]

['REMARKS\n',
 '\n',
 'ON SEVERAL\n',
 '\n',
 'PARTS\n',
 '\n',
 'OF\n',
 '\n',
 'ITALY, &c. In the Years 1701, 1702, 1703.\n',
 '\n',
 'The SECOND EDITION.\n',
 '\n',
 'LONDON:\n',
 '\n',
 "Printed for J. Tonson, at Shakespear's-Head, over against\n",
 '\n',
 'Katharine-street in the Strand. MDCCXVIII.\n',
 '\n',
 'THERE is a Pleasure\n',
 '\n',
 'in owning\n',
 '\n',
 'Obligations which it is\n',
 '\n',
 '\n',
 '\n',
 'an Honour to have received,\n',
 '\n',
 'but should I publish\n',
 '\n',
 'any Favours done\n',
 '\n',
 'me by Your Lordship,\n',
 '\n',
 'I am afraid it would look\n',
 '\n',
 'more like Vanity than\n',
 '\n',
 'Gratitude.\n',
 '\n',
 'I had a very early\n',
 '\n',
 'Ambition to recommend\n',
 '\n',
 'my self to Your\n',
 '\n',
 "Lordship's Patronage,\n",
 '\n',
 "which yet encreas'd in\n",
 '\n']

In [25]:
#note that our variable addison_remarks_lines is just a list of strings
print(type(addison_remarks_lines))
print(type(addison_remarks_lines[0]))

<class 'list'>
<class 'str'>


In the cell below we use the `.split()` string method. Take a minute to find this string method in the Python docs link above (do a "CONTROL-f" search on the page), and read about what it does.

In [27]:
#First we initialize an int(eger) variable to 0 to use as a counter. 
#As we process each line below, we'll increment this counter variable
#by the number of words in the line. Once we have iterated through all the lines,
#we will have a count of all the words in the text
total_word_count = 0 

for line in addison_remarks_lines: #iterate through all the lines in the text, one by one starting from beginning
    if line != "\n":   #skip any lines that are only blank lines (i.e. they contain only the newline character "\n")
        words_in_line = line.split(sep=' ') #the .split() method allows us to split up a string on a specified sep(arator) character
        num_words_in_line = len(words_in_line)
        total_word_count = total_word_count + num_words_in_line
        
print(total_word_count)

76310


In [28]:
#Now lets get a list of all the words in Addison (i.e. the whole text,
#as a list of its words in the order they appear)

#First we intialize an empty list variable.
#As we process each line below, we'll add the words in the line
#to our list. Once we have iterated through all the lines,
#we will have a list containing all the words in the text.
all_the_words = []

for line in addison_remarks_lines: 
    if line != "\n":
        words_in_line = line.split(sep=' ')
        all_the_words.extend(words_in_line) #.extend() is a list method. Search "python list method extend" online to see what it does
        

In [29]:
#check that the length of our all_the_words list is equal to the number
#of words as we counted them in the cell above
print(len(all_the_words))

76310


In [30]:
'''
Now let's see how many unique words there are in Addison.
A handy trick for getting only the unique items in a list (i.e. removing duplicates)
is to first convert the list to a "set" object, which removes duplicates. To get the 
number of unique words we can then just apply the len() function to the resulting set
object.

Math aside: this is also how it is in mathematics. A normal set
in math does not have repeated elements, and the order of elements doesn't matter.
A set is just about registering membership.
For example, the set 
{5, 5, 5, 11, 9} 
is identical to the set 
{11, 9, 5}.
A set that CAN count repeated elements (but still in which order doesn't matter) is 
called a "multiset" (recall analogously that a graph that can have multiple edges
between the same two nodes is called a "multigraph").

Python sets behave like sets in math. With python lists, by contrast, you can have 
repeated elements, and order matters. So the following lists are all different:
[5, 5, 5, 11, 9] 
[5, 5, 11, 5, 9] 
[11, 9, 5]

In python, if you run set([5, 5, 5, 11, 9]), you will get
back the set {11, 9, 5}, or {5, 11, 9}, or ..., which are all
the same set since order doesn't matter in a set.
'''

set_of_words = set(all_the_words)

set_of_words

{'',
 'Quis',
 'Statuary,',
 'Fish)',
 'floating',
 'Honorius',
 'high',
 'Ornaments',
 'Cathedral',
 'imber',
 'tabescente',
 'Meat,',
 'Mansion',
 'Protector',
 'ostenderet',
 "Patin's",
 'aevum\n',
 'mild',
 'Kuff-stain.',
 'talked',
 'Venetians,',
 'Passages\n',
 'Politicks',
 'Appearances',
 'let',
 'Antonine',
 'Maximilian?',
 'Pilasters\n',
 'Sacristie,',
 'Evil',
 'malo',
 'captant.Sil.',
 'Germans',
 'nuota;',
 'dragged\n',
 'People.\n',
 'Naufragantium',
 'precisely,',
 'Charges',
 'Caprinâ,\n',
 'Circumstance',
 "veil'd",
 'Nilus',
 'complain\n',
 'Zealots\n',
 'degna;',
 "conceal'd.\n",
 'Towards\n',
 'weight',
 'ipsâ\n',
 'Clitumna',
 'extremo',
 'Pistol',
 'Honorary,\n',
 'turns',
 "drown'd",
 'Soul',
 'Some,',
 'Death;',
 'questo\n',
 'Net',
 'fall.\n',
 'owes',
 'Temple;',
 'quench',
 'Crete,\n',
 'Spumat',
 'Turf,',
 'Renard',
 'moderate',
 'Monument,\n',
 'happens,',
 'unpeopled,',
 'Arches',
 'Wild-Time,\n',
 'agreeable',
 'beneath',
 'likely',
 "accompany'd",
 'Stat

In [31]:
#now we can apply the len() function to this set to get the number of unique words
num_unique_words = len(set_of_words)
print(num_unique_words)

17695


What interesting statistic can we calculate using only these two things, the total number of words ("word tokens") 
and the number of unique words ("word types")? We could divide the number of types by the 
number of tokens to get a number representing the lexical diversity of Addison's text, which
we can call the "type-to-token ratio".

By itself this number is not going to be meaningful, but in comparison with other texts/authors
it becomes meaningful, because it allows us to compare the relative lexical diversity of different
texts/authors (for example, all ~200 of the other Grand Tour accounts in the ECCO corpus). 

To take a cooked-up extreme example, Shakespeare's texts are famously lexically diverse, while Trump's speech is famously not lexically diverse. For all sorts of reasons that it might be interesting for you to think about, spoken language tends to be much less lexically diverse than written language.

In [32]:
type_to_token_ratio = num_unique_words / total_word_count
print(type_to_token_ratio)

0.2318831083737387


### Aside: Is there a better way to get word counts for a text?
To find this out, let's ask the internet! This is something programmers at all levels do all the time.

To see what other ways to count the number of unique items in a list there are in Python, I searched "count number of unique items in list in python" on the search engine Ecosia (a non-profit that uses their ad revenue to plant trees--Google works too). The first result that will usually come up with any search like this is a "stackoverflow" question, with helpful responses. This is the one I got: https://stackoverflow.com/questions/12282232/how-do-i-count-unique-values-inside-a-list.
The highest-voted response to the question describes a nice way to count unique list items, so that you end up with 
a dictionary with entries of the form [word type]:[number of times that type occurs]

In [33]:
from collections import Counter

word_counts = Counter(all_the_words)

word_counts.values()

print(word_counts)



### Back to lexical diversity: comparing Addison with Sterne

To start making our lexical diversity score for Addison's Remarks meaningful, let's calculate the same statistic for Laurence Sterne's 1769 text, *A sentimental journey through France and Italy*. 

We could just copy-paste the code we already wrote above (with relevant adjustments) to do this. But as you will know from ch. 3 of Automate the Boring stuff, it would be much better to write functions. Then we could use these functions to easily calculate the lexical diversity of a hundred different texts, without having to copy paste (and manually adjust for each text) the same code a hundred times.

In [34]:
#a function to get all the lines in a text
def get_lines(filepath):
    with open(filepath) as textfile:
        lines = textfile.readlines()
    return lines

#a function to get all the words in a text, assuming the function is passed a list containing the lines of the text
def get_all_the_words(lines):
    all_the_words = []
    for line in lines: 
        if line != "\n":
            words_in_line = line.split(sep=' ')
            all_the_words.extend(words_in_line)
    return all_the_words

#a function to compute the lexical diversity of a text, assuming the function is passed a list of all the words
#in the text
def compute_lexical_diversity(list_of_all_words):
    token_count = len(list_of_all_words)
    type_count = len(set(list_of_all_words))
    lexical_diversity = type_count / token_count
    return lexical_diversity

#finally, if we want we can combine the three "helper" functions above
#into a single function that will compute the lexical diversity of a text, given a filepath to it
def lexical_diversity(filepath_to_text):
    list_of_lines = get_lines(filepath_to_text)
    list_of_words = get_all_the_words(list_of_lines)
    lexical_diversity = compute_lexical_diversity(list_of_words)
    return lexical_diversity 
    
    

Using our new functions we can easily calculate the lexical diversity of Addison and Sterne, and then compare them.

In [35]:
addison_LD = lexical_diversity(ADDISON_REMARKS_FP)
sterne_LD = lexical_diversity(STERNE_JOURNEY_FP)

In [36]:
addison_LD

0.2318831083737387

In [37]:
sterne_LD

0.24222952779438134

In [38]:
#now just for fun, and to demonstrate the generalizing power of the code we have written, 
#lets calculate the lexical diversity of a bunch of ECCO texts

#this gets all the ECCO text filenames in our "txt_versions" directory (folder and directory are synonyms)
TXT_VERSIONS_FP = os.path.join(my_pwd, "ECCO_texts", "txt_versions")
ECCO_txt_filenames = os.listdir(path=TXT_VERSIONS_FP)

In [39]:
#check that what we did looks right
ECCO_txt_filenames

['K096120.018.txt',
 'K036380.000.txt',
 'K039073.000.txt',
 'K057408.000.txt',
 'K040172.000.txt',
 'K050373.000.txt',
 'K041500.000.txt',
 'K061112.000.txt',
 'K119752.000.txt',
 'K025532.001.txt',
 'K020298.000.txt',
 'K015852.000.txt',
 'K010695.000.txt',
 'K062887.000.txt',
 'K061274.000.txt',
 'K042231.000.txt',
 'K010754.000.txt',
 'K019862.000.txt',
 'K062946.000.txt',
 'K086263.000.txt',
 'K100666.000.txt',
 'K035335.000.txt',
 'K000268.000.txt',
 'K034431.000.txt',
 'K117757.000.txt',
 'K059320.000.txt',
 'K121560.000.txt',
 'K101669.002.txt',
 'K067641.005.txt',
 'K135535.000.txt',
 'K002616.000.txt',
 'K086750.000.txt',
 'K107303.000.txt',
 'K078489.001.txt',
 'K030158.000.txt',
 'K046497.000.txt',
 'K117112.000.txt',
 'K036466.000.txt',
 'K113547.002.txt',
 'K020253.000.txt',
 'K023139.000.txt',
 'K052227.003.txt',
 'K017275.002.txt',
 'K023129.000.txt',
 'K081121.000.txt',
 'K080316.000.txt',
 'K066646.001.txt',
 'K113571.000.txt',
 'K123014.000.txt',
 'K032311.000.txt',


In [40]:
#now lets make a list of filepaths to each text, so we can use the "lexical_diversity" function we wrote above
ECCO_txt_filepaths = []
for textfile in ECCO_txt_filenames:
    fp = os.path.join(my_pwd, "ECCO_texts", "txt_versions", textfile)
    ECCO_txt_filepaths.append(fp)

In [41]:
#check that what we did looks right
ECCO_txt_filepaths

['/Users/nicholasgardner/ECCO_texts/txt_versions/K096120.018.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K036380.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K039073.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K057408.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K040172.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K050373.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K041500.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K061112.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K119752.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K025532.001.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K020298.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K015852.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K010695.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K062887.000.txt',
 '/Users/nicholasgardner/ECCO_texts/txt_versions/K061274.000.t

In [66]:
#now let's use our lexical_diversity function on a hundred ECCO texts!!
#note the significan variation we see in lexical diversity
for ECCO_txt_filepath in ECCO_txt_filepaths[:100]:
    ld = lexical_diversity(ECCO_txt_filepath)
    filename = ECCO_txt_filepath.split(sep="/")[-1] #get the last thing in the filepath, which is the filename
    output_string = "The lexical diversity of text {} is: {}".format(filename, ld)
    print(output_string)

The lexical diversity of text K096120.018.txt is: 0.1858931614432837
The lexical diversity of text K036380.000.txt is: 0.2685591397849462
The lexical diversity of text K039073.000.txt is: 0.35406698564593303
The lexical diversity of text K057408.000.txt is: 0.1427020788392575
The lexical diversity of text K040172.000.txt is: 0.1449112128256129
The lexical diversity of text K050373.000.txt is: 0.3471574435900668
The lexical diversity of text K041500.000.txt is: 0.582971329278888
The lexical diversity of text K061112.000.txt is: 0.30949920782530826
The lexical diversity of text K119752.000.txt is: 0.23903880625938667
The lexical diversity of text K025532.001.txt is: 0.16541353383458646
The lexical diversity of text K020298.000.txt is: 0.2660429411059134
The lexical diversity of text K015852.000.txt is: 0.36288720476786107
The lexical diversity of text K010695.000.txt is: 0.40505359877488517
The lexical diversity of text K062887.000.txt is: 0.4069233486400565
The lexical diversity of text