# 4 Processing texts

In this notebook you learn to process and extract information from texts. We continue with the sonnet, but as promised, scale up soon. 

This notebook focusses on extracting basic information from texts, such as the number of sentences and words. We also show how to use external libraries for more refined enrichment (finding named entities or zoom in on specific word categories (nouns, verbs)). We discuss how this could be relevant to historians.

## 5.1 Strings are sequences of characters

At this point you have basic understanding of how to read and manipulate textual data in Python. Now we can turn to more directly useful and realistic applications. 

**[add wget]**

In [3]:
path = "example_data/notebook_3/shakespeare_sonnet_i.txt"
sonnet = open(path,'r').read()
print(sonnet)

From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.


We have encountered a few string methods that allow you to manipulate texts. `str.lowercase()` for example, converts capitals to lower case.


### -- Exercise
Lowercase the sonnet and store it in a variable called `sonnet_lower`

In [2]:
# remove this comment and add code here

While to a human reader (looking at the document in very formally) the sonnet clearly consist of multiple lines and has contains many words (such a surprise) this isn't obvious to the computer ingesting the document. At this stage the computer has no understanding of those basic elements of language, the concept of lines, sentences and words have to be made explicit before we can even start to encode the meaning of texts with computational means (to be discussed later).

## 5.1 Tokens

Before we proceed, let's define, because what are words anyway?
Generally we make the distinction between `type` and `token`. We follow the definition [Smith, N.A., 2019](https://arxiv.org/pdf/1902.06006.pdf). 
- "A word **token** is a word observed in a piece of text." 
- "A word **type** is a distinct word, in the abstract, rather than a specific instance. Every word token is said to “belong” to its type."

 Example:
 > The sentence "two teas and two coffees" contains 5 tokens and 4 types (two appears twice).

As said in the introduction, text comes initially as unstructured data, as a sequence of characters which we have manipulate to process properly. To make this clear, we can revisit the index notation to inspect the basic elements of a string.

In [3]:
sonnet[0]

'F'

As you notice `sonnet[0]` doesn't return the first word but the first character.

A seemingly straightforward way to transform the string to tokens is by splitting the text by white spaces. In this scenario we perceive white spaces as boundaries between tokens. Luckily Python provides us with a tool to do just that. The `str.split()` method will use the white spaces to split a string into a list of tokens. Run the code below, and inspect the output.

In [4]:
tokens = sonnet.split()
print(tokens)

['From', 'fairest', 'creatures', 'we', 'desire', 'increase,', 'That', 'thereby', "beauty's", 'rose', 'might', 'never', 'die,', 'But', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease,', 'His', 'tender', 'heir', 'might', 'bear', 'his', 'memory:', 'But', 'thou,', 'contracted', 'to', 'thine', 'own', 'bright', 'eyes,', "Feed'st", 'thy', "light's", 'flame', 'with', 'self-substantial', 'fuel,', 'Making', 'a', 'famine', 'where', 'abundance', 'lies,', 'Thyself', 'thy', 'foe,', 'to', 'thy', 'sweet', 'self', 'too', 'cruel:', 'Thou', 'that', 'art', 'now', 'the', "world's", 'fresh', 'ornament,', 'And', 'only', 'herald', 'to', 'the', 'gaudy', 'spring,', 'Within', 'thine', 'own', 'bud', 'buriest', 'thy', 'content,', 'And', 'tender', 'churl', "mak'st", 'waste', 'in', 'niggarding:', 'Pity', 'the', 'world,', 'or', 'else', 'this', 'glutton', 'be,', 'To', 'eat', 'the', "world's", 'due,', 'by', 'the', 'grave', 'and', 'thee.']


In [5]:
print(type(tokens))

<class 'list'>


In [6]:
tokens[0]

'From'

This looks different to what we have encountered before:
- The output is enclosed by square brackets
- the quotation marks are now (approximately) around the individual words and not the whole string
- words are separated by commas

What happened here is the following: split takes a string and returns a list of tokens. A `list` is another Python data type, such as strings, which we will be using a lot in the remainder of this course. The `intermezzo` provides more information, but we discuss them also here.

A Python list is an "is an ordered collection of values" ([Wentworth, et al. 2012](https://openbookproject.net/thinkcs/python/english3e/lists.html)). It is container that keeps several elements (also called) items in a particular order. Documents are often presented as a list, i.e. as a sequence of tokens in a specific order. 

Each element in the list implicitly indexed by place, i.e. you can retrieve an items by its position, for example the first and last word of the sonnet.

In [7]:
tokens[0]

'From'

In [8]:
tokens[-1]

'thee.'

With `len()` we can count the number of items the list contains (notice how this is different from the number of characters a string contains).

In [9]:
len(tokens)

105

Even though we called the variable in which we save the string we split `tokens`, upon closer inspection you may notice that some elements in this list aren't technically tokens as they also include some punctuation marks. If we look at items at position 5, 8 and 41 the difficulty of converting a string to a list of tokens becomes apparent. 

In [10]:
tokens[5],tokens[8],tokens[41]

('increase,', "beauty's", 'self-substantial')

While `'increase,'` is clearly a token followed by a punctuation mark, `"self-substantial"` is more complex. It depends on how you interpret and process such compounds (read it as one word, or split it into two, `"self"` and `"substantial"`?

Luckily, you don't have to worry too much about the subtleties unless you really want to! What makes Python so convenient are the many external libraries that provide you tools (in the form of function) that help you with more complex tasks.

Below we look at a very popular (but maybe outdated at this point) tool called the Natural Language Toolkit (NLTK). Later we discuss a few other options.

NLTK is a Python library for natural language processing, it was built to make certain like tokenization easier. The syntax below is unfamiliar and the intermezzo points to a more elaborate explanation. What this line of coude actually is does is importing tool (a function with the name `word_tokenize`) into our Notebook. This function is stored in the library (in `nltk.tokenize`).

In [11]:
from nltk.tokenize import word_tokenize

After importing `word_tokenize` we can apply it to our sonnet and print the result.

In [12]:
tokens_nltk = word_tokenize(sonnet)
print(tokens_nltk)

['From', 'fairest', 'creatures', 'we', 'desire', 'increase', ',', 'That', 'thereby', 'beauty', "'s", 'rose', 'might', 'never', 'die', ',', 'But', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease', ',', 'His', 'tender', 'heir', 'might', 'bear', 'his', 'memory', ':', 'But', 'thou', ',', 'contracted', 'to', 'thine', 'own', 'bright', 'eyes', ',', "Feed'st", 'thy', 'light', "'s", 'flame', 'with', 'self-substantial', 'fuel', ',', 'Making', 'a', 'famine', 'where', 'abundance', 'lies', ',', 'Thyself', 'thy', 'foe', ',', 'to', 'thy', 'sweet', 'self', 'too', 'cruel', ':', 'Thou', 'that', 'art', 'now', 'the', 'world', "'s", 'fresh', 'ornament', ',', 'And', 'only', 'herald', 'to', 'the', 'gaudy', 'spring', ',', 'Within', 'thine', 'own', 'bud', 'buriest', 'thy', 'content', ',', 'And', 'tender', 'churl', "mak'st", 'waste', 'in', 'niggarding', ':', 'Pity', 'the', 'world', ',', 'or', 'else', 'this', 'glutton', 'be', ',', 'To', 'eat', 'the', 'world', "'s", 'due', ',', 'by', 'the', 'grave', 'and',

In [13]:
print(len(tokens_nltk))

127


### -- Exercise: 

The previous example returns a different number of tokens. Inspect the difference between splitting by white spaces and NLTK.

Together with lowercasing, tokenization is an essential step in the text processing pipeline. Now we can start investigating the sonnet in more detail, for example by counting words. The easiest way of doing this is using a `Counter()` object. 

In [14]:
from collections import Counter
from nltk.tokenize import word_tokenize
path = "example_data/notebook_3/shakespeare_sonnet_i.txt"
sonnet = open(path,'r').read()
sonnet_lowercase = sonnet.lower()
tokens = word_tokenize(sonnet_lowercase)
word_counts = Counter(tokens)
word_counts

Counter({'from': 1,
         'fairest': 1,
         'creatures': 1,
         'we': 1,
         'desire': 1,
         'increase': 1,
         ',': 14,
         'that': 2,
         'thereby': 1,
         'beauty': 1,
         "'s": 4,
         'rose': 1,
         'might': 2,
         'never': 1,
         'die': 1,
         'but': 2,
         'as': 1,
         'the': 6,
         'riper': 1,
         'should': 1,
         'by': 2,
         'time': 1,
         'decease': 1,
         'his': 2,
         'tender': 2,
         'heir': 1,
         'bear': 1,
         'memory': 1,
         ':': 3,
         'thou': 2,
         'contracted': 1,
         'to': 4,
         'thine': 2,
         'own': 2,
         'bright': 1,
         'eyes': 1,
         "feed'st": 1,
         'thy': 4,
         'light': 1,
         'flame': 1,
         'with': 1,
         'self-substantial': 1,
         'fuel': 1,
         'making': 1,
         'a': 1,
         'famine': 1,
         'where': 1,
         'abundance': 

Again, here we skip many of the subtleties and technicalities, but what is important to understand here is that a `Counter()` maps tokens to their frequencies. Such mapping is usually handled by **dictionaries**, for example a very simple translation dictionary would look like:

```python
{'one':'einz',
 'two':'zwei'}
```

Note the curly brackets, indicating a different data type. 
Words at the left of the colon are called **keys**, those at those at the right are **values**, each key-value is called an **item**.

You can assign a dictionary to a variable and then retrieve the value for a given key as shown in the example below. Please note that we are using here square brackets again.

In [5]:
english2german = {'one':'einz', 'two':'zwei'}
print(english2german['one'])

einz


Please consult the link in the breakout for more information about dictionaries.

`Counter()` objects are in many ways similar to dictionaries, you can retrieve the frequency for a given word by looking up the value for a specific key.

In [16]:
word_counts['and']

3

If the word doesn't appear in the text, it returns `None`.

In [6]:
word_counts['hello']

NameError: name 'word_counts' is not defined

But the `Counter()` has a few useful methods that make life easier, for example we can print the `n` most common words.

In [15]:
word_counts.most_common(10)

[(',', 14),
 ('the', 6),
 ("'s", 4),
 ('to', 4),
 ('thy', 4),
 (':', 3),
 ('world', 3),
 ('and', 3),
 ('that', 2),
 ('might', 2)]

As you will notice, the most frequent words often don't 

## `Breakout`
- `''.join()`
- libraries and imports
- dictionaries

## 3.3.2 Text Processing with SpaCy



While NLTK is convenient, and still used in DH, other libraries have emerged and are slowly pushing the state-of-the art. We will have a closer look at SpaCy a more powerful (and fast!) tool for automatic language analysis. Similar to NLTK we have to import the library at the start.

In [4]:
import spacy

To use SpaCy we first have to lead a model, which is trained on a specific language for specific tasks: tokenization, lemmatization and more. In this sense SpaCy works somewhat different than NLTK: with SpaCy we apply many different types of linguistic analysis and enrichment at once. Whereas in NLTK you would invoke seperate function. 

The code below makes this distinction more clear. We load the model and save it in `nlp`.

In [9]:
# Load English model
nlp = spacy.load("en_core_web_sm")

In [None]:
Next, an example text is assigned to `paragraph`.

In [10]:
paragraph = """A trifling incident thus served to settle a victory.  Now-a days, a soldier is so much of a machine that he seems simply to go through certain evolutions, in which there is no opportunity for the display of personal bravery or cowardice.  He does not know what is going on in other parts of the field, and has no real knowledge, till all be over, whether the day has been lost or won.”"""


Next we apply call `nlp` passing `paragraph` as an argument. The returns an instance of the class `spacy.tokens.doc.Doc`.

In [15]:
doc = nlp(paragraph)
type(doc)

spacy.tokens.doc.Doc

Similar to lists, we can retrieve individual elements from `doc` using index notations.
Let's have a closer look at the third element, the word incident in `paragraph`.

In [13]:
doc[2]

incident

The `help()` function reveals the many attributes that belong to each individual token. These are attributes are created by the model SpaCy applied to the text.

In [19]:
help(doc[2])

Help on Token object:

class Token(builtins.object)
 |  An individual token – i.e. a word, punctuation symbol, whitespace,
 |  etc.
 |  
 |  DOCS: https://spacy.io/api/token
 |  
 |  Methods defined here:
 |  
 |  __bytes__(...)
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |      Return hash(self).
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(...)
 |      The number of unicode characters in the token, i.e. `token.text`.
 |      
 |      RETURNS (int): The number of unicode characters in the token.
 |      
 |      DOCS: https://spacy.io/api/token#len
 |  
 |  __lt__(self, value, /)
 |      Return self<value.
 |  
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  
 |  __reduce__(...)
 |      Helper for pickle.
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  __str__(self, /

One example is the part-of-speech, the syntactic category to which the token belongs. In this case, SpaCy predicted *"incident"* is a noun.

In [22]:
doc[2].pos_

'NOUN'

The lemma is the standardized form of a token. For example, the plural of noun is reduced to a singular, and verb forms are brought back to the infinitif form. For example, "served" has the lemma "serve", "revolutions" has the lemma "revolution".

Why is this useful? Well it depends on what you want to do. Similar to lowercasing, lemmatization reduces the complexity of text, tokens that otherwise have different surface forms are the identitical, we simplifies for example when we want to count words or compute relations between tokens. 

In [31]:
doc[4].lemma_

'serve'

In [32]:
doc[32].lemma_

'evolution'

To obtain the lemmatized text, we first create an new `list` variable. This will be an empty `list`, but as we iterate over the elements in `doc` as add the lemma of each token (hidden in the `.lemma_` atrribute. 

Even though this technique (of initializing an empty `list`) is maybe confusing at first, we will repeat it often in the following Notebook. Please take your time to understand the code below.

In [36]:
lemmas = []

for token in doc:
    lemmas.append(token.lemma_)
    
print(lemmas)

['a', 'trifle', 'incident', 'thus', 'serve', 'to', 'settle', 'a', 'victory', '.', ' ', 'now', '-', 'a', 'day', ',', 'a', 'soldier', 'be', 'so', 'much', 'of', 'a', 'machine', 'that', '-PRON-', 'seem', 'simply', 'to', 'go', 'through', 'certain', 'evolution', ',', 'in', 'which', 'there', 'be', 'no', 'opportunity', 'for', 'the', 'display', 'of', 'personal', 'bravery', 'or', 'cowardice', '.', ' ', '-PRON-', 'do', 'not', 'know', 'what', 'be', 'go', 'on', 'in', 'other', 'part', 'of', 'the', 'field', ',', 'and', 'have', 'no', 'real', 'knowledge', ',', 'till', 'all', 'be', 'over', ',', 'whether', 'the', 'day', 'have', 'be', 'lose', 'or', 'win', '.', '"']


In a similar fashion, we can harvest the part-of-speech of each token.

In [38]:
pos = []

for token in doc:
    pos.append(token.pos_)
    
print(pos)

['DET', 'VERB', 'NOUN', 'ADV', 'VERB', 'PART', 'VERB', 'DET', 'NOUN', 'PUNCT', 'SPACE', 'ADV', 'PUNCT', 'DET', 'NOUN', 'PUNCT', 'DET', 'NOUN', 'AUX', 'ADV', 'ADJ', 'ADP', 'DET', 'NOUN', 'SCONJ', 'PRON', 'VERB', 'ADV', 'PART', 'VERB', 'ADP', 'ADJ', 'NOUN', 'PUNCT', 'ADP', 'DET', 'PRON', 'AUX', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'ADJ', 'NOUN', 'CCONJ', 'NOUN', 'PUNCT', 'SPACE', 'PRON', 'AUX', 'PART', 'VERB', 'PRON', 'AUX', 'VERB', 'ADP', 'ADP', 'ADJ', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT', 'CCONJ', 'AUX', 'DET', 'ADJ', 'NOUN', 'PUNCT', 'SCONJ', 'DET', 'AUX', 'ADV', 'PUNCT', 'SCONJ', 'DET', 'NOUN', 'AUX', 'AUX', 'VERB', 'CCONJ', 'VERB', 'PUNCT', 'PUNCT']


... or both at the same time.

In [39]:
lemma_pos = []
for token in doc:
    lemma_pos.append((token.lemma_,token.pos_))
    
print(lemma_pos)


[('a', 'DET'), ('trifle', 'VERB'), ('incident', 'NOUN'), ('thus', 'ADV'), ('serve', 'VERB'), ('to', 'PART'), ('settle', 'VERB'), ('a', 'DET'), ('victory', 'NOUN'), ('.', 'PUNCT'), (' ', 'SPACE'), ('now', 'ADV'), ('-', 'PUNCT'), ('a', 'DET'), ('day', 'NOUN'), (',', 'PUNCT'), ('a', 'DET'), ('soldier', 'NOUN'), ('be', 'AUX'), ('so', 'ADV'), ('much', 'ADJ'), ('of', 'ADP'), ('a', 'DET'), ('machine', 'NOUN'), ('that', 'SCONJ'), ('-PRON-', 'PRON'), ('seem', 'VERB'), ('simply', 'ADV'), ('to', 'PART'), ('go', 'VERB'), ('through', 'ADP'), ('certain', 'ADJ'), ('evolution', 'NOUN'), (',', 'PUNCT'), ('in', 'ADP'), ('which', 'DET'), ('there', 'PRON'), ('be', 'AUX'), ('no', 'DET'), ('opportunity', 'NOUN'), ('for', 'ADP'), ('the', 'DET'), ('display', 'NOUN'), ('of', 'ADP'), ('personal', 'ADJ'), ('bravery', 'NOUN'), ('or', 'CCONJ'), ('cowardice', 'NOUN'), ('.', 'PUNCT'), (' ', 'SPACE'), ('-PRON-', 'PRON'), ('do', 'AUX'), ('not', 'PART'), ('know', 'VERB'), ('what', 'PRON'), ('be', 'AUX'), ('go', 'VERB')

This combination of lemmatization and part-of-speech tagging is quite common. It remove certain distinction (verb tense and plurals) but foregrounds that otherwise would have treated as the same word: for example the distinction between `fine` as noun and adjective.

In [None]:
But 

In [52]:
nouns = []
for token in doc:  
    if token.pos_ == "NOUN":
        nouns.append(token.text)
print(nouns)

['incident', 'victory', 'days', 'soldier', 'machine', 'evolutions', 'opportunity', 'display', 'bravery', 'cowardice', 'parts', 'field', 'knowledge', 'day']


In [50]:
sel = []
for token in doc:  
    if token.pos_ in ["NOUN","ADJ"]:
        sel.append(token.text)
print(sel)

['incident', 'victory', 'days', 'soldier', 'much', 'machine', 'certain', 'evolutions', 'opportunity', 'display', 'personal', 'bravery', 'cowardice', 'other', 'parts', 'field', 'real', 'knowledge', 'day']


SpaCy has a lot more to offer, and for example you can find Named Entities (places, persons and organisation) in texts. 

In [53]:
doc2 = nlp("Germany is a wonderful country. The city of Berlin is great! Do you Kaspar is still listening? He want to Stanford.")

for ent in doc2.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Germany 0 7 GPE
Berlin 44 50 GPE
Kaspar 68 74 PERSON
Stanford 106 114 ORG


## Tokenization and Sentence Splitting

In [None]:
for sent in doc.sents:
    print(sent.text)
    print()