[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/kasparvonbeelen/ghi_python/main?labpath=4_-_Processing_texts.ipynb)


# 4 Processing texts


## Text Mining for Historians (with Python)
## A Gentle Introduction to Working with Textual Data in Python

### Created by Kaspar Beelen and Luke Blaxill

### For the German Historical Institute, London

<img align="left" src="https://www.ghil.ac.uk/typo3conf/ext/wacon_ghil/Resources/Public/Images/institute_icon_small.png">



In this notebook, you learn to process and extract information from texts. We continue with the sonnet, but as promised, scale up soon. 

This notebook focuses on extracting basic information from texts. We show how to use external libraries for more refined enrichment (finding named entities or zoom in on specific word categories (nouns, verbs)).

At the end of this Notebook you'll be able to
- Tokenize texts
- Apply functions from Natural Language Toolkit and SpaCy for more advance processing (part-of-speech tagging)
- Count the frequency of tokens
- Have some understanding of Python `list` and `dict` objects

## 4.1 Strings

At this point you have basic understanding of how to read and manipulate textual data in Python. Now we can turn to more directly useful and realistic applications. 

In [1]:
path = "example_data/notebook_3/shakespeare_sonnet_i.txt" # declare path/ variable assignment
sonnet = open(path,'r').read() # open and read document
print(sonnet) # print document

From fairest creatures we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou, contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thyself thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And tender churl mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.


While to a human reader—looking at the document very formally—the sonnet contains multiple words and lines (such a surprise) this isn't obvious to a computer ingesting the document. At this stage, Python represents the sonnet as a sequence of characters; it has no understanding of word boundaries.

## 4.2 Tokens

What are words anyway?
Generally, we will make a distinction between `types` and `tokens` following the definition of [Smith, N.A., 2019](https://arxiv.org/pdf/1902.06006.pdf). 
- "A word **token** is a word observed in a piece of text." 
- "A word **type** is a distinct word, in the abstract, rather than a specific instance. Every word token is said to “belong” to its type."

 Example:
 > The sentence "two teas and two coffees" contains 5 tokens and 4 types (two appears twice).

As said in the introduction, textual data is unstructured. We have to manipulate and transform the sequence of characters in order to work with the content in meaningful ways.

Let's start with detecting word boundaries, and covert the string of characters to list of tokens.

In [2]:
sonnet[0]

'F'

As you notice `sonnet[0]` doesn't return the first word but the first character.

A seemingly straightforward way to transform a string into tokens is by splitting the text by white spaces. In this scenario, we perceive white spaces as boundaries between tokens. Luckily Python provides us with a tool to do just that. The `str.split()` method will use the white spaces to split a string into a list of tokens. Run the code below, and inspect the output.

In [3]:
tokens = sonnet.split() # split by white space save resulting list in tokens
print(tokens)

['From', 'fairest', 'creatures', 'we', 'desire', 'increase,', 'That', 'thereby', "beauty's", 'rose', 'might', 'never', 'die,', 'But', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease,', 'His', 'tender', 'heir', 'might', 'bear', 'his', 'memory:', 'But', 'thou,', 'contracted', 'to', 'thine', 'own', 'bright', 'eyes,', "Feed'st", 'thy', "light's", 'flame', 'with', 'self-substantial', 'fuel,', 'Making', 'a', 'famine', 'where', 'abundance', 'lies,', 'Thyself', 'thy', 'foe,', 'to', 'thy', 'sweet', 'self', 'too', 'cruel:', 'Thou', 'that', 'art', 'now', 'the', "world's", 'fresh', 'ornament,', 'And', 'only', 'herald', 'to', 'the', 'gaudy', 'spring,', 'Within', 'thine', 'own', 'bud', 'buriest', 'thy', 'content,', 'And', 'tender', 'churl', "mak'st", 'waste', 'in', 'niggarding:', 'Pity', 'the', 'world,', 'or', 'else', 'this', 'glutton', 'be,', 'To', 'eat', 'the', "world's", 'due,', 'by', 'the', 'grave', 'and', 'thee.']


In [4]:
print(type(tokens))

<class 'list'>


In [5]:
tokens[0]

'From'

The result of `str.split()` looks different to what we have encountered before:
- The output is enclosed by square brackets `[]`
- The quotation marks are now (approximately) around the individual words and not the whole string
- Words are separated by commas

What happened here is the following: split takes a string and returns a `list` (of tokens). A `list` is another Python data type, which we will be using a lot in the remainder of this course. The `break out` provides more information, but we discuss the most important aspect here as well.

A Python list "is an ordered collection of values" ([Wentworth, et al. 2012](https://openbookproject.net/thinkcs/python/english3e/lists.html)). It is a container that keeps several elements (also called) items in a particular order. Documents are often presented as a list, i.e. as a sequence of tokens in a specific order. 

Each element in the list implicitly indexed by place, i.e. you can retrieve items by their position (for example, the first and last word of the sonnet with `[0]` and `[-1]`).

In [6]:
tokens[0]

'From'

In [7]:
tokens[-1]

'thee.'

`len()` counts the number of items in a list (notice how this is different from the number of characters in a string).

In [8]:
len(tokens)

105

## `Breakout`
- [lists in Python](break_out/lists.ipynb)

Read more on [`' '.join()`](https://www.w3schools.com/python/ref_string_join.asp)

We called the variable in which we saved the split string  `tokens`, but upon closer inspection, you notice that some elements aren't technically tokens. Some include, for example, some punctuation marks. If we look at items at position 5, 8 or 41, the difficulty of converting a string to a list of tokens becomes apparent. 

In [9]:
tokens[5],tokens[8],tokens[41]

('increase,', "beauty's", 'self-substantial')

While `'increase,'` is a token followed by a punctuation mark, `"self-substantial"` is more complex. It depends on how you interpret and process such compounds (read it as one word, or split it into two, `"self"` and `"substantial"`, but are hyphens always token boundaries?

Luckily, you don't have to worry too much about the subtleties (unless you want to!) because Python comes with many convenient external libraries that provide you plethora of tools (in the form of function) that help you with complex processing tasks.

Here we take a closer look at a popular tool called the "Natural Language Toolkit" (abbreviated as NLTK). (Later, we discuss a few other options.)

NLTK is a Python library for natural language processing, it was built to make processing text data (such as tokenization) easier. You don't have to write your functions all the time. 

The syntax below is unfamiliar and the `break out` points to a more elaborate explanation. The line of code imports a tool (in this case a function with the name `word_tokenize`) into our Notebook. This function is stored in `nltk.tokenize`. 

You don't have to import this function every time you use it, only once when running the notebook suffices (unless you restart the Kernel or Runtime, because then everything gets deleted from memory.

In [10]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /Users/kbeelen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


After importing `word_tokenize`, we can apply it to our sonnet and print the result.

In [11]:
tokens_nltk = word_tokenize(sonnet)
print(tokens_nltk)

['From', 'fairest', 'creatures', 'we', 'desire', 'increase', ',', 'That', 'thereby', 'beauty', "'s", 'rose', 'might', 'never', 'die', ',', 'But', 'as', 'the', 'riper', 'should', 'by', 'time', 'decease', ',', 'His', 'tender', 'heir', 'might', 'bear', 'his', 'memory', ':', 'But', 'thou', ',', 'contracted', 'to', 'thine', 'own', 'bright', 'eyes', ',', "Feed'st", 'thy', 'light', "'s", 'flame', 'with', 'self-substantial', 'fuel', ',', 'Making', 'a', 'famine', 'where', 'abundance', 'lies', ',', 'Thyself', 'thy', 'foe', ',', 'to', 'thy', 'sweet', 'self', 'too', 'cruel', ':', 'Thou', 'that', 'art', 'now', 'the', 'world', "'s", 'fresh', 'ornament', ',', 'And', 'only', 'herald', 'to', 'the', 'gaudy', 'spring', ',', 'Within', 'thine', 'own', 'bud', 'buriest', 'thy', 'content', ',', 'And', 'tender', 'churl', "mak'st", 'waste', 'in', 'niggarding', ':', 'Pity', 'the', 'world', ',', 'or', 'else', 'this', 'glutton', 'be', ',', 'To', 'eat', 'the', 'world', "'s", 'due', ',', 'by', 'the', 'grave', 'and',

In [12]:
print(len(tokens_nltk))

127


### -- Exercise: 

The previous example returns a different number of tokens. Inspect the difference between splitting by white spaces and using NLTK's `word_tokenize`.

Similar to lowercasing, tokenization is an essential step in the text processing pipeline. We can investigate the sonnet in more detail, for example, by counting words. The easiest way of doing this is using a `Counter()` object (which we also have to import). Counter maps the types in list to their frequency.

At this point, we have all skills to put together a small word counting program.
- import the required libraries (line 1 and 2)
- create a variable with the name path that points to the location of the sonnet
- read the document
- lowercase the document
- tokenize the document
- count words
- print word frequencies

In [17]:
from collections import Counter # import Counter
from nltk.tokenize import word_tokenize # import tool for tokenization
path = "example_data/notebook_3/shakespeare_sonnet_i.txt" # tell python where the document is stored
sonnet = open(path,'r').read() # open and read the document
sonnet_lowercase = sonnet.lower() # lowercase and store result in sonnet_lower
tokens = word_tokenize(sonnet_lowercase) # tokenize and store the list in 
word_counts = Counter(tokens) # map types to the frequency
print(word_counts) # print word frequencies

Counter({',': 14, 'the': 6, "'s": 4, 'to': 4, 'thy': 4, ':': 3, 'world': 3, 'and': 3, 'that': 2, 'might': 2, 'but': 2, 'by': 2, 'his': 2, 'tender': 2, 'thou': 2, 'thine': 2, 'own': 2, 'from': 1, 'fairest': 1, 'creatures': 1, 'we': 1, 'desire': 1, 'increase': 1, 'thereby': 1, 'beauty': 1, 'rose': 1, 'never': 1, 'die': 1, 'as': 1, 'riper': 1, 'should': 1, 'time': 1, 'decease': 1, 'heir': 1, 'bear': 1, 'memory': 1, 'contracted': 1, 'bright': 1, 'eyes': 1, "feed'st": 1, 'light': 1, 'flame': 1, 'with': 1, 'self-substantial': 1, 'fuel': 1, 'making': 1, 'a': 1, 'famine': 1, 'where': 1, 'abundance': 1, 'lies': 1, 'thyself': 1, 'foe': 1, 'sweet': 1, 'self': 1, 'too': 1, 'cruel': 1, 'art': 1, 'now': 1, 'fresh': 1, 'ornament': 1, 'only': 1, 'herald': 1, 'gaudy': 1, 'spring': 1, 'within': 1, 'bud': 1, 'buriest': 1, 'content': 1, 'churl': 1, "mak'st": 1, 'waste': 1, 'in': 1, 'niggarding': 1, 'pity': 1, 'or': 1, 'else': 1, 'this': 1, 'glutton': 1, 'be': 1, 'eat': 1, 'due': 1, 'grave': 1, 'thee': 1, 

We skip many of the subtleties and technicalities here, but what is important to understand is that a `Counter()` maps word types (that occur in a document) to their frequencies. 

Mapping of keys (word types) to values (counts) is handled by a data type called **dictionaries** (`Counter` is a variant on a dictionary, with a few more useful methods, but as far as we are concerned, they are largely identical).

In Python, we could create a  simple translation dictionary, which looks like:

```python
{'one':'einz',
 'two':'zwei'}
```

Note the **curly brackets**, indicating a different data type (lists have square brackets, strings use quotation marks). 
Words at the left of the colon are called **keys**, those at the right are **values**, each key-value pair is called an **item**.

You can assign a dictionary to a variable and then **lookup** the value for a specific key, as shown in the example below. Please note that we are using here square brackets again (similar to how indexing looks up the item at a specific position).

In [18]:
english2german = {'one':'einz', 'two':'zwei'} # create and english to german dictionary
print(english2german['one']) # print the translation of 'one'

einz


Please consult the link in the breakout for more information about dictionaries.

## `Breakout`
- [dictionaries in Python](break_out/dictionaries.ipynb)

Returning to our example. As said earlier, `Counter()` objects are similar to dictionaries: you can look up the frequency of a word by entering a key and print the associated value. We can print the frequency of "and" in our sonnet.

In [19]:
print(word_counts['and'])

3


If the word doesn't appear in the text, it returns `None`.

In [20]:
word_counts['hello']

0

`Counter()` has more useful methods that make life easier: the `.most_common` method prints the `n` most common words. For example, we can print the ten most frequent words in our sonnet.

In [21]:
word_counts.most_common(10)

[(',', 14),
 ('the', 6),
 ("'s", 4),
 ('to', 4),
 ('thy', 4),
 (':', 3),
 ('world', 3),
 ('and', 3),
 ('that', 2),
 ('might', 2)]

## 4.3 Text Processing with SpaCy



While NLTK is a convenient tool and often used in DH, a few other libraries have emerged and are slowly pushing the state-of-the-art in the field. We will have a closer look at SpaCy, a powerful (and fast!) tool for automatic language analysis. Similar to NLTK we have to import the library at the start.

In [22]:
import spacy # import the SpaCy library

To use SpaCy we first have to load a model. This model is trained on a specific language and handles many different tasks: tokenization, lemmatization and more. In this sense, SpaCy works somewhat different than NLTK: with SpaCy we apply many different types of linguistic analysis and enrichment at once. Whereas in NLTK you would invoke a separate function. 

The code below makes this distinction more clear. We load the model and save it in the variable `nlp`.

In [23]:
nlp = spacy.load("en_core_web_sm") # Load English model

Next, an example text is assigned to `paragraph`.

In [24]:
paragraph = """A trifling incident thus served to settle a victory.  
            Now-a days, a soldier is so much of a machine that he seems simply to go through certain evolutions, in which there is no opportunity for the display of personal bravery or cowardice.  
            He does not know what is going on in other parts of the field, and has no real knowledge, till all be over, whether the day has been lost or won.”"""


Next we apply call `nlp` passing `paragraph` as an argument. The returns an instance of the class `spacy.tokens.doc.Doc`.

In [25]:
doc = nlp(paragraph)
type(doc)

spacy.tokens.doc.Doc

Similar to lists, we can retrieve individual elements from `doc` using the index notation.
Let's have a closer look at the third element, the word incident in `paragraph`.

In [26]:
print(doc[2])
print(type(doc[2]))

incident
<class 'spacy.tokens.token.Token'>


The `help()` function reveals the attributes and methods attached to individual tokens. 

An attribute is a value that belongs to an object. In the example below, each token has a boolean `is_digit` attribute that is `True` if the token consists of digits.  

The general syntax for attributes is: 
`object.attribute`

This looks similar to the dot notation for methods.
`object.method()`

The main syntactic difference is the use of parentheses. 

Whereas methods perform some operation on an object (and often return a value) an attribute is just a value attached to an object.

Apologies if this sounds confusing! After working through some examples you will quickly become familiar with attributes.

Apply `help` to "incident" token, and scroll down to the line **"Data descriptors defined here:"**
Here you find a list of attributes attached to the SpaCy `Token` object.

In [27]:
help(doc[2])

Help on Token object:

class Token(builtins.object)
 |  An individual token – i.e. a word, punctuation symbol, whitespace,
 |  etc.
 |  
 |  DOCS: https://spacy.io/api/token
 |  
 |  Methods defined here:
 |  
 |  __bytes__(...)
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |      Return hash(self).
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(...)
 |      The number of unicode characters in the token, i.e. `token.text`.
 |      
 |      RETURNS (int): The number of unicode characters in the token.
 |      
 |      DOCS: https://spacy.io/api/token#len
 |  
 |  __lt__(self, value, /)
 |      Return self<value.
 |  
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  
 |  __reduce__(...)
 |      Helper for pickle.
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  __str__(self, /

SpaCy performs a syntactic analysis on the text and determines the part-of-speech of each token, which is then stored under `.pos_` attribute. In the case of *"incident"* the part-of-speech is a noun.

In [28]:
doc[2].pos_

'NOUN'

It also takes care of lemmatization. The lemma is the standardized form of a token. For example, plural nouns are reduced to a singular, and verbs forms are brought back to their infinitive. For example, "served" has the lemma "serve", "revolutions" has the lemma "revolution".

Lemmas are attached to the `.lemma_` attribute of a token.


In [29]:
doc[4].lemma_

'serve'

In [30]:
doc[32].lemma_

'evolution'

Why is this useful? It depends on what you want to do. Similar to lowercasing, lemmatization reduces the complexity of (or normalizes) a text: tokens that otherwise have different surface forms are reduced to the same token. If you want to search for (or would like to know the frequency of) a word, normalization is often helpful—you'd capture, for example, "revolution" and "revolutions" at the same time—unless you are particularly interested verb conjugation.

To obtain the lemmatized text, we first create an new `list` variable. This will be an empty `list`, but as we iterate over the elements in `doc` as add the lemma of each token (hidden in the `.lemma_` atrribute. 

Even though this technique (of initializing an empty `list`) is maybe confusing at first, we will repeat it often in the following Notebook. Please take your time to understand the code below, as it contains a few new elements.

- creation of an empty `list`
- `for` loop
- appending items to a list. 

In [31]:
lemmas = []

for t in doc:
    lemmas.append(t.lemma_)
    
print(lemmas)

['a', 'trifle', 'incident', 'thus', 'serve', 'to', 'settle', 'a', 'victory', '.', ' \n            ', 'now', '-', 'a', 'day', ',', 'a', 'soldier', 'be', 'so', 'much', 'of', 'a', 'machine', 'that', '-PRON-', 'seem', 'simply', 'to', 'go', 'through', 'certain', 'evolution', ',', 'in', 'which', 'there', 'be', 'no', 'opportunity', 'for', 'the', 'display', 'of', 'personal', 'bravery', 'or', 'cowardice', '.', ' \n            ', '-PRON-', 'do', 'not', 'know', 'what', 'be', 'go', 'on', 'in', 'other', 'part', 'of', 'the', 'field', ',', 'and', 'have', 'no', 'real', 'knowledge', ',', 'till', 'all', 'be', 'over', ',', 'whether', 'the', 'day', 'have', 'be', 'lose', 'or', 'win', '.', '"']


- Iteration is performed by a `for` loop: in the example above we iterate of each `Token` in `doc`. `t` takes the value of each item in turn. We can use `print()` to make this visible. Iteration amounts to repeating the same operation to each element.


In [32]:
for t in doc:
    print(t)

A
trifling
incident
thus
served
to
settle
a
victory
.
 
            
Now
-
a
days
,
a
soldier
is
so
much
of
a
machine
that
he
seems
simply
to
go
through
certain
evolutions
,
in
which
there
is
no
opportunity
for
the
display
of
personal
bravery
or
cowardice
.
 
            
He
does
not
know
what
is
going
on
in
other
parts
of
the
field
,
and
has
no
real
knowledge
,
till
all
be
over
,
whether
the
day
has
been
lost
or
won
.
”



The `for` loop traverses through all items in doc and for each token `t` in doc, we can now repeat the same operation.
- We access the `.lemma_` attribute of `t`
- and append the lemma to the list `lemmas`. `lists.append(item)` adds items to list we initialized at the start of the cell.

It doesn't matter what name you use for the **loop variable** `t` (the variable between `for` and `in`) as long as you use it consistently. The examples below hopefully make this clear.

In [33]:
for abracadabra in doc: # you can give the loop variable any name you want
    print(abracadabra) # but you should use it consistently

A
trifling
incident
thus
served
to
settle
a
victory
.
 
            
Now
-
a
days
,
a
soldier
is
so
much
of
a
machine
that
he
seems
simply
to
go
through
certain
evolutions
,
in
which
there
is
no
opportunity
for
the
display
of
personal
bravery
or
cowardice
.
 
            
He
does
not
know
what
is
going
on
in
other
parts
of
the
field
,
and
has
no
real
knowledge
,
till
all
be
over
,
whether
the
day
has
been
lost
or
won
.
”


In [34]:
for t in doc: # otherwise things will go wrong
    print(r) # and this line will raise a NameError

NameError: name 'r' is not defined

Note also the spacing, lines that end with a colon are followed by indented lines. This part of the Python syntax, removing the indentation will rais an `IndentationError`. Run the code below to convince yourself.

In [35]:
for t in doc: 
print(t)

IndentationError: expected an indented block (1434422477.py, line 2)

In a similar fashion, we can harvest the part-of-speech of each token. Note that we repeat exactly the same procedure

In [36]:
pos = [] # create an empty list with name pos

for token in doc: # iterate over all items in doc
    pos.append(token.pos_) # append the .pos_ attribute to pos
    
print(pos) # print the result

['DET', 'VERB', 'NOUN', 'ADV', 'VERB', 'PART', 'VERB', 'DET', 'NOUN', 'PUNCT', 'SPACE', 'ADV', 'PUNCT', 'DET', 'NOUN', 'PUNCT', 'DET', 'NOUN', 'AUX', 'ADV', 'ADJ', 'ADP', 'DET', 'NOUN', 'SCONJ', 'PRON', 'VERB', 'ADV', 'PART', 'VERB', 'ADP', 'ADJ', 'NOUN', 'PUNCT', 'ADP', 'DET', 'PRON', 'AUX', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'ADJ', 'NOUN', 'CCONJ', 'NOUN', 'PUNCT', 'SPACE', 'PRON', 'AUX', 'PART', 'VERB', 'PRON', 'AUX', 'VERB', 'ADP', 'ADP', 'ADJ', 'NOUN', 'ADP', 'DET', 'NOUN', 'PUNCT', 'CCONJ', 'AUX', 'DET', 'ADJ', 'NOUN', 'PUNCT', 'SCONJ', 'DET', 'AUX', 'ADV', 'PUNCT', 'SCONJ', 'DET', 'NOUN', 'AUX', 'AUX', 'VERB', 'CCONJ', 'VERB', 'PUNCT', 'PUNCT']


... or both at the same time.

In [37]:
lemma_pos = []
for token in doc:
    lemma_pos.append((token.lemma_,token.pos_))
    
print(lemma_pos)


[('a', 'DET'), ('trifle', 'VERB'), ('incident', 'NOUN'), ('thus', 'ADV'), ('serve', 'VERB'), ('to', 'PART'), ('settle', 'VERB'), ('a', 'DET'), ('victory', 'NOUN'), ('.', 'PUNCT'), (' \n            ', 'SPACE'), ('now', 'ADV'), ('-', 'PUNCT'), ('a', 'DET'), ('day', 'NOUN'), (',', 'PUNCT'), ('a', 'DET'), ('soldier', 'NOUN'), ('be', 'AUX'), ('so', 'ADV'), ('much', 'ADJ'), ('of', 'ADP'), ('a', 'DET'), ('machine', 'NOUN'), ('that', 'SCONJ'), ('-PRON-', 'PRON'), ('seem', 'VERB'), ('simply', 'ADV'), ('to', 'PART'), ('go', 'VERB'), ('through', 'ADP'), ('certain', 'ADJ'), ('evolution', 'NOUN'), (',', 'PUNCT'), ('in', 'ADP'), ('which', 'DET'), ('there', 'PRON'), ('be', 'AUX'), ('no', 'DET'), ('opportunity', 'NOUN'), ('for', 'ADP'), ('the', 'DET'), ('display', 'NOUN'), ('of', 'ADP'), ('personal', 'ADJ'), ('bravery', 'NOUN'), ('or', 'CCONJ'), ('cowardice', 'NOUN'), ('.', 'PUNCT'), (' \n            ', 'SPACE'), ('-PRON-', 'PRON'), ('do', 'AUX'), ('not', 'PART'), ('know', 'VERB'), ('what', 'PRON'), (

This combination of lemmatization and part-of-speech tagging is quite common. It remove certain distinction (verb tense and plurals) but foregrounds that otherwise would have treated as the same word: for example the distinction between `fine` as noun and adjective.

SpaCy has a lot more to offer, and for example you can find Named Entities (places, persons and organisation) in texts. 

In [38]:
doc2 = nlp("Germany is a wonderful country. The city of Berlin is great! Do you Kaspar is still listening? He went to Stanford.")

for ent in doc2.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Germany 0 7 GPE
Berlin 44 50 GPE
Kaspar 68 74 PERSON
Stanford 106 114 ORG


## `Breakout`:
- [Indentation](break_out/indentation.ipynb)
- [`for` loop](break_out/loops.ipynb)

### -- Exercise:

The code below will load "A Tale of Two Cities" into your Notebook.
- apply the SpaCy `nlp` to this text
- collect all named entities in a list `ner`
- Count the named entities 
- print the ten most frequent entities
- repeat the above but this time collect the `.label_` atribute of a named entity
- compute and print the frequency of these labels

In [39]:
import requests 
book = requests.get('https://www.gutenberg.org/files/98/98-0.txt').content.decode('utf-8') # download book

## -- Exercise

You can also use SpaCy for other languages then English. The code below will install a German model.


In [44]:
import spacy
nlp_de = spacy.load("de_core_news_sm")

Now we download the "Die Leiden des jungen Werther" from gutenberg.org and save it in `werther`.

In [45]:
import requests 
werther = requests.get('https://www.gutenberg.org/cache/epub/2407/pg2407.txt').content.decode('utf-8') # download book

In [None]:
doc_de = nlp_de(werther)

Now apply the Named Entity recognition to `werther`!

In [None]:
## Enter code here

## Fin.