# Welcome

This notebook accompanies the Sunokisis Digital Classics common session on Named Entity Extraction, see <https://github.com/SunoikisisDC/SunoikisisDC-2016-2017/wiki/Named-Entity-Extraction-I>.

In this notebook we are going to experiment with three different methods for extracting named entities from a Latin text.

# Library imports

External modules and libraries can be imported using `import` statements.

Let's the [Natural Language ToolKit (NLTK)](http://www.nltk.org/), the [Classical Language ToolKit (CLTK)](http://cltk.org/), [MyCapytain](http://mycapytain.readthedocs.io/en/latest/) and some local libraries that are used in this notebook.

In [39]:
########
# NLTK #
########
import nltk
from nltk.tag import StanfordNERTagger
########
# CLTK #
########
import cltk
from cltk.tag.ner import tag_ner
##############
# MyCapytain #
##############
import MyCapytain 
from MyCapytain.resolvers.cts.api import HttpCTSResolver
from MyCapytain.retrievers.cts5 import CTS
from MyCapytain.common.constants import Mimetypes
#################
# other imports #
#################
import sys
sys.path.append("/opt/nlp/pymodules/")
from idai_journals.nlp import sub_leaves

And more precisely, we are using the following versions:

In [3]:
print(nltk.__version__)

3.2.2


In [4]:
print(cltk.__version__)

0.1.47


In [5]:
print(MyCapytain.__version__)

2.0.0b18


# Let's grab some text

To start with, we need some text from which we'll try to extract named entities using various methods and libraries.

There are several ways of doing this e.g.:
1. copy and paste the text from Perseus or the Latin Library into a text document, and read it into a variable
2. load a text from one of the Latin corpora available via `cltk` (cfr. this [blog post](https://disiectamembra.wordpress.com/2016/05/29/cltk-importing-the-latin-library-as-a-corpus/))
3. or load it from Perseus by leveraging its [Canonical Text Services]() API

Let's gor for #3 :)

## What's CTS?

CTS URNs stand for Canonical Text Service Uniform Resource Names.

You can think of a CTS URN like a **social security number** for texts (or parts of texts).

![caption](imgs/cts_urn_syntax.png)

Here are some examples of CTS URNs with different levels of granularity:
- `urn:cts:latinLit:phi0448` (Caesar)
- `urn:cts:latinLit:phi0448.phi001` (Caesar's *De Bello Gallico*)
- `urn:cts:latinLit:phi0448.phi001.perseus-lat2` DBG Latin edtion
- `urn:cts:latinLit:phi0448.phi001.perseus-lat2:1` DBG Latin edition, book 1
- `urn:cts:latinLit:phi0448.phi001.perseus-lat2:1.1.1` DBG Latin edition, book 1, chapter 1, section 1

How do I find out the CTS URN of a given author or text? The [Perseus Catalog](http://catalog.perseus.org/) is your friend! (crf. e.g. <http://catalog.perseus.org/catalog/urn:cts:latinLit:phi0448>)

## Querying a CTS API

The URN of the Latin edition of Caesar's **De Bello Gallico** is `urn:cts:latinLit:phi0448.phi001.perseus-lat2`.

In [6]:
my_passage = "urn:cts:latinLit:phi0448.phi001.perseus-lat2"

With this information, we can query a CTS API and get some information about this text.

For example, we can "discover" its canonical text structure, an essential information to be able to *cite* this text.

In [7]:
# We set up a resolver which communicates with an API available in Leipzig
resolver = HttpCTSResolver(CTS("http://cts.dh.uni-leipzig.de/api/cts/"))

In [8]:
# We require some metadata information
textMetadata = resolver.getMetadata("urn:cts:latinLit:phi0448.phi001.perseus-lat2")
# Texts in CTS Metadata have one interesting property : its citation scheme.
# Citation are embedded objects that carries information about how a text can be quoted, what depth it has
print([citation.name for citation in textMetadata.citation])

['Book', 'Chapter', 'Section']


But we can also query the same API and get back the text of a specific text section, for example the entire book 1.

To do so, we need to append the indication of the reference scope (i.e. book 1) to the URN.

In [9]:
my_passage = "urn:cts:latinLit:phi0448.phi001.perseus-lat2:1"

So we retrieve the first book of the **De Bello Gallico** by passing its CTS URN (that we just stored in the variable `my_passage`) to the CTS API, via the resolver provided by `MyCapytains`:

In [10]:
passage = resolver.getTextualNode(my_passage)

At this point the passage is available in various formats: text, but also TEI XML, etc.

Thus, we need to specify that we are interested in getting the text only:

In [11]:
de_bello_gallico_book1 = passage.export(Mimetypes.PLAINTEXT)

Let's check that the text is there by printing the content of the variable `de_bello_gallico_book1` where we stored it:

In [12]:
print(de_bello_gallico_book1)

COMMENTARIUS PRIMUS Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. Horum omnium fortissimi sunt Belgae, propterea quod a cultu atque humanitate provinciae longissime absunt, minimeque ad eos mercatores saepe commeant atque ea quae ad effeminandos animos pertinent important, proximique sunt Germanis, qui trans Rhenum incolunt, quibuscum continenter bellum gerunt. Qua de causa Helvetii quoque reliquos Gallos virtute praecedunt, quod fere cotidianis proeliis cum Germanis contendunt, cum aut suis finibus eos prohibent aut ipsi in eorum finibus bellum gerunt. [Eorum una, pars, quam Gallos obtinere dictum est, initium capit a flumine Rhodano, continetur Garumna flumine, Oceano, finibus Belgarum, attingit etiam ab Sequanis et Helvetiis flumen Rhenum, vergit ad septent

The text that we have just fetched by using a programming interface (API) can also be [viewed in the browser](http://cts.dh.uni-leipzig.de/read/latinLit/phi0448/phi001/perseus-lat2/1).

Or even imported as an iframe into this notebook! 

In [13]:
from IPython.display import IFrame
IFrame('http://cts.dh.uni-leipzig.de/read/latinLit/phi0448/phi001/perseus-lat2/1', width=1000, height=350)

Let's see how many words (tokens, more properly) there are in Caesar's *De Bello Gallico* I:

In [14]:
len(de_bello_gallico_book1.split(" "))

8176

# Very simple baseline

Now let's write what in NLP jargon is called a *baseline*, that is a method for extracting named entities that can serve as a term of comparison to evaluate the accuracy of other methods. 

**Baseline method**: 
- cycle through each token of the text
- if the token starts with a capital letter it's a named entity (only one type, i.e. `Entity`)

In [15]:
"T".istitle()

True

In [16]:
"t".istitle()

False

In [17]:
# we need a list to store the tagged tokens
tagged_tokens = []

# tokenisation is done by using the string method `split(" ")` 
# that splits a string upon white spaces
for n, token in enumerate(de_bello_gallico_book1.split(" ")):
    if(token.istitle()):
        tagged_tokens.append((token, "Entity"))
    #else:
        #tagged_tokens.append((token, "O"))    

Let's a havea  look at the first 50 tokens that we just tagged:

In [18]:
tagged_tokens[:50]

[('Gallia', 'Entity'),
 ('Belgae,', 'Entity'),
 ('Aquitani,', 'Entity'),
 ('Celtae,', 'Entity'),
 ('Galli', 'Entity'),
 ('Hi', 'Entity'),
 ('Gallos', 'Entity'),
 ('Aquitanis', 'Entity'),
 ('Garumna', 'Entity'),
 ('Belgis', 'Entity'),
 ('Matrona', 'Entity'),
 ('Sequana', 'Entity'),
 ('Horum', 'Entity'),
 ('Belgae,', 'Entity'),
 ('Germanis,', 'Entity'),
 ('Rhenum', 'Entity'),
 ('Qua', 'Entity'),
 ('Helvetii', 'Entity'),
 ('Gallos', 'Entity'),
 ('Germanis', 'Entity'),
 ('[Eorum', 'Entity'),
 ('Gallos', 'Entity'),
 ('Rhodano,', 'Entity'),
 ('Garumna', 'Entity'),
 ('Oceano,', 'Entity'),
 ('Belgarum,', 'Entity'),
 ('Sequanis', 'Entity'),
 ('Helvetiis', 'Entity'),
 ('Rhenum,', 'Entity'),
 ('Belgae', 'Entity'),
 ('Galliae', 'Entity'),
 ('Rheni,', 'Entity'),
 ('Aquitania', 'Entity'),
 ('Garumna', 'Entity'),
 ('Pyrenaeos', 'Entity'),
 ('Oceani', 'Entity'),
 ('Hispaniam', 'Entity'),
 ('Apud', 'Entity'),
 ('Helvetios', 'Entity'),
 ('Orgetorix.', 'Entity'),
 ('Is', 'Entity'),
 ('M.', 'Entity'),
 ('

For convenience we can also wrap our baseline code into a function that we call `extract_baseline`. Let's define it:

In [19]:
def extract_baseline(input_text):
    """
    :param input_text: the text to tag (string)
    :return: a list of tuples, where tuple[0] is the token and tuple[1] is the named entity tag
    """
    # we need a list to store the tagged tokens
    tagged_tokens = []

    # tokenisation is done by using the string method `split(" ")` 
    # that splits a string upon white spaces
    for n, token in enumerate(input_text.split(" ")):
        if(token.istitle()):
            tagged_tokens.append((token, "Entity"))
        #else:
            #tagged_tokens.append((token, "O")) 
    return tagged_tokens

And now we can call it like this:

In [20]:
tagged_tokens_baseline = extract_baseline(de_bello_gallico_book1)

In [21]:
tagged_tokens_baseline[-50:]

[('Marcomanos,', 'Entity'),
 ('Tribocos,', 'Entity'),
 ('Vangiones,', 'Entity'),
 ('Nemetes,', 'Entity'),
 ('Sedusios,', 'Entity'),
 ('Suebos,', 'Entity'),
 ('Eo', 'Entity'),
 ('Romanis', 'Entity'),
 ('Caesar', 'Entity'),
 ('Ita', 'Entity'),
 ('Relictis', 'Entity'),
 ('At', 'Entity'),
 ('Germani', 'Entity'),
 ('Reperti', 'Entity'),
 ('Cum', 'Entity'),
 ('Id', 'Entity'),
 ('P.', 'Entity'),
 ('Crassus', 'Entity'),
 ('Ita', 'Entity'),
 ('Rhenum', 'Entity'),
 ('L', 'Entity'),
 ('Ibi', 'Entity'),
 ('In', 'Entity'),
 ('Ariovistus,', 'Entity'),
 ('Duae', 'Entity'),
 ('Ariovisti', 'Entity'),
 ('Sueba', 'Entity'),
 ('Norica,', 'Entity'),
 ('Voccionis', 'Entity'),
 ('Gallia', 'Entity'),
 ('C.', 'Entity'),
 ('Valerius', 'Entity'),
 ('Procillus,', 'Entity'),
 ('Caesarem', 'Entity'),
 ('Quae', 'Entity'),
 ('Caesari', 'Entity'),
 ('Galliae,', 'Entity'),
 ('Is', 'Entity'),
 ('Item', 'Entity'),
 ('M.', 'Entity'),
 ('Metius', 'Entity'),
 ('Hoc', 'Entity'),
 ('Rhenum', 'Entity'),
 ('Suebi,', 'Entity'),


We can modify slightly our function so that it prints the snippet of text where an entity is found:

In [22]:
def extract_baseline(input_text):
    """
    :param input_text: the text to tag (string)
    :return: a list of tuples, where tuple[0] is the token and tuple[1] is the named entity tag
    """
    # we need a list to store the tagged tokens
    tagged_tokens = []

    # tokenisation is done by using the string method `split(" ")` 
    # that splits a string upon white spaces
    for n, token in enumerate(input_text.split(" ")):
        if(token.istitle()):
            tagged_tokens.append((token, "Entity"))
            context = input_text.split(" ")[n-5:n+5]
            print("Found entity \"%s\" in context \"%s\""%(token, " ".join(context)))
        #else:
            #tagged_tokens.append((token, "O"))  
    return tagged_tokens

In [23]:
tagged_text_baseline = extract_baseline(de_bello_gallico_book1)

Found entity "Gallia" in context ""
Found entity "Belgae," in context "partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui"
Found entity "Aquitani," in context "quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua"
Found entity "Celtae," in context "Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi"
Found entity "Galli" in context "qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua,"
Found entity "Hi" in context "lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus"
Found entity "Gallos" in context "institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen,"
Found entity "Aquitanis" in context "inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis"
Found entity "Garumna" in context "se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona"
Found entity "Belgis" in context "ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit."
Found e

In [24]:
tagged_text_baseline[:150]

[('Gallia', 'Entity'),
 ('Belgae,', 'Entity'),
 ('Aquitani,', 'Entity'),
 ('Celtae,', 'Entity'),
 ('Galli', 'Entity'),
 ('Hi', 'Entity'),
 ('Gallos', 'Entity'),
 ('Aquitanis', 'Entity'),
 ('Garumna', 'Entity'),
 ('Belgis', 'Entity'),
 ('Matrona', 'Entity'),
 ('Sequana', 'Entity'),
 ('Horum', 'Entity'),
 ('Belgae,', 'Entity'),
 ('Germanis,', 'Entity'),
 ('Rhenum', 'Entity'),
 ('Qua', 'Entity'),
 ('Helvetii', 'Entity'),
 ('Gallos', 'Entity'),
 ('Germanis', 'Entity'),
 ('[Eorum', 'Entity'),
 ('Gallos', 'Entity'),
 ('Rhodano,', 'Entity'),
 ('Garumna', 'Entity'),
 ('Oceano,', 'Entity'),
 ('Belgarum,', 'Entity'),
 ('Sequanis', 'Entity'),
 ('Helvetiis', 'Entity'),
 ('Rhenum,', 'Entity'),
 ('Belgae', 'Entity'),
 ('Galliae', 'Entity'),
 ('Rheni,', 'Entity'),
 ('Aquitania', 'Entity'),
 ('Garumna', 'Entity'),
 ('Pyrenaeos', 'Entity'),
 ('Oceani', 'Entity'),
 ('Hispaniam', 'Entity'),
 ('Apud', 'Entity'),
 ('Helvetios', 'Entity'),
 ('Orgetorix.', 'Entity'),
 ('Is', 'Entity'),
 ('M.', 'Entity'),
 ('

# NER with CLTK

The CLTK library has some basic support for the extraction of named entities from Latin and Greek texts (see [CLTK's documentation](http://docs.cltk.org/en/latest/latin.html#named-entity-recognition)).

The current implementation (as of version 0.1.47) uses a lookup-based method.

For each token in a text, the tagger checks whether that token is contained within a predefined list of possible named entities:
- list of Latin proper nouns: <https://github.com/cltk/latin_proper_names_cltk>
- list of Greek proper nouns: <https://github.com/cltk/greek_proper_names_cltk>

Let's run CLTK's tagger (it takes a moment):

In [25]:
%%time
tagged_text_cltk = tag_ner('latin', input_text=de_bello_gallico_book1)

CPU times: user 14.2 s, sys: 0 ns, total: 14.2 s
Wall time: 14.2 s


Let's have a look at the ouput, only the first 10 tokens (by using the list slicing notation):

In [26]:
tagged_text_cltk[:10]

[('COMMENTARIUS',),
 ('PRIMUS',),
 ('Gallia', 'Entity'),
 ('est',),
 ('omnis',),
 ('divisa',),
 ('in',),
 ('partes',),
 ('tres',),
 (',',)]

The output looks slightly different from the one of our baseline function (the size of the tuples in the list varies). 

But we can write a function to fix this, we call it `reshape_cltk_output`:

In [27]:
def reshape_cltk_output(tagged_tokens):
    reshaped_output = []
    for tagged_token in tagged_tokens:
        if(len(tagged_token)==1):
            continue
            #reshaped_output.append((tagged_token[0], "O"))
        else:
            reshaped_output.append((tagged_token[0], tagged_token[1]))
    return reshaped_output

We apply this function to CLTK's output:

In [28]:
tagged_text_cltk = reshape_cltk_output(tagged_text_cltk)

And the resulting output looks now ok:

In [29]:
tagged_text_cltk[:20]

[('Gallia', 'Entity'),
 ('Belgae', 'Entity'),
 ('Aquitani', 'Entity'),
 ('Celtae', 'Entity'),
 ('Galli', 'Entity'),
 ('Hi', 'Entity'),
 ('Gallos', 'Entity'),
 ('Aquitanis', 'Entity'),
 ('Belgis', 'Entity'),
 ('Matrona', 'Entity'),
 ('Sequana', 'Entity'),
 ('Belgae', 'Entity'),
 ('Germanis', 'Entity'),
 ('Rhenum', 'Entity'),
 ('Gallos', 'Entity'),
 ('Germanis', 'Entity'),
 ('Eorum', 'Entity'),
 ('Gallos', 'Entity'),
 ('Rhodano', 'Entity'),
 ('Oceano', 'Entity')]

Now let's compare the two list of tagged tokens by using a python function called `zip`, which allows us to read multiple lists simultaneously:

In [30]:
list(zip(tagged_text_baseline[:20], tagged_text_cltk[:20]))

[(('Gallia', 'Entity'), ('Gallia', 'Entity')),
 (('Belgae,', 'Entity'), ('Belgae', 'Entity')),
 (('Aquitani,', 'Entity'), ('Aquitani', 'Entity')),
 (('Celtae,', 'Entity'), ('Celtae', 'Entity')),
 (('Galli', 'Entity'), ('Galli', 'Entity')),
 (('Hi', 'Entity'), ('Hi', 'Entity')),
 (('Gallos', 'Entity'), ('Gallos', 'Entity')),
 (('Aquitanis', 'Entity'), ('Aquitanis', 'Entity')),
 (('Garumna', 'Entity'), ('Belgis', 'Entity')),
 (('Belgis', 'Entity'), ('Matrona', 'Entity')),
 (('Matrona', 'Entity'), ('Sequana', 'Entity')),
 (('Sequana', 'Entity'), ('Belgae', 'Entity')),
 (('Horum', 'Entity'), ('Germanis', 'Entity')),
 (('Belgae,', 'Entity'), ('Rhenum', 'Entity')),
 (('Germanis,', 'Entity'), ('Gallos', 'Entity')),
 (('Rhenum', 'Entity'), ('Germanis', 'Entity')),
 (('Qua', 'Entity'), ('Eorum', 'Entity')),
 (('Helvetii', 'Entity'), ('Gallos', 'Entity')),
 (('Gallos', 'Entity'), ('Rhodano', 'Entity')),
 (('Germanis', 'Entity'), ('Oceano', 'Entity'))]

But, as you can see, the two lists are not aligned.

This is due to how the CLTK function tokenises the text. The comma after "tres" becomes a token on its own, whereas when we tokenise by white space the comma is attached to "tres" (i.e. "tres,").


A solution to this is to pass to the `tag_ner` function the text already tokenised by text.

In [31]:
tagged_text_cltk = reshape_cltk_output(tag_ner('latin', input_text=de_bello_gallico_book1.split(" ")))

In [32]:
list(zip(tagged_text_baseline[:20], tagged_text_cltk[:20]))

[(('Gallia', 'Entity'), ('Gallia', 'Entity')),
 (('Belgae,', 'Entity'), ('Galli', 'Entity')),
 (('Aquitani,', 'Entity'), ('Hi', 'Entity')),
 (('Celtae,', 'Entity'), ('Gallos', 'Entity')),
 (('Galli', 'Entity'), ('Aquitanis', 'Entity')),
 (('Hi', 'Entity'), ('Belgis', 'Entity')),
 (('Gallos', 'Entity'), ('Matrona', 'Entity')),
 (('Aquitanis', 'Entity'), ('Sequana', 'Entity')),
 (('Garumna', 'Entity'), ('Rhenum', 'Entity')),
 (('Belgis', 'Entity'), ('Gallos', 'Entity')),
 (('Matrona', 'Entity'), ('Germanis', 'Entity')),
 (('Sequana', 'Entity'), ('Gallos', 'Entity')),
 (('Horum', 'Entity'), ('Sequanis', 'Entity')),
 (('Belgae,', 'Entity'), ('Belgae', 'Entity')),
 (('Germanis,', 'Entity'), ('Galliae', 'Entity')),
 (('Rhenum', 'Entity'), ('Aquitania', 'Entity')),
 (('Qua', 'Entity'), ('Pyrenaeos', 'Entity')),
 (('Helvetii', 'Entity'), ('Oceani', 'Entity')),
 (('Gallos', 'Entity'), ('Hispaniam', 'Entity')),
 (('Germanis', 'Entity'), ('Is', 'Entity'))]

# NER with NLTK

In [33]:
stanford_model_italian = "/opt/nlp/stanford-tools/stanford-ner-2015-12-09/classifiers/ner-ita-nogpe-noiob_gaz_wikipedia_sloppy.ser.gz"

In [34]:
ner_tagger = StanfordNERTagger(stanford_model_italian)

In [35]:
tagged_text_nltk = ner_tagger.tag(de_bello_gallico_book1.split(" "))

Let's have a look at the output

In [36]:
tagged_text_nltk[0:150]

[('COMMENTARIUS', 'LOC'),
 ('PRIMUS', 'LOC'),
 ('Gallia', 'LOC'),
 ('est', 'O'),
 ('omnis', 'O'),
 ('divisa', 'O'),
 ('in', 'O'),
 ('partes', 'O'),
 ('tres,', 'O'),
 ('quarum', 'O'),
 ('unam', 'O'),
 ('incolunt', 'O'),
 ('Belgae,', 'O'),
 ('aliam', 'O'),
 ('Aquitani,', 'O'),
 ('tertiam', 'O'),
 ('qui', 'O'),
 ('ipsorum', 'O'),
 ('lingua', 'O'),
 ('Celtae,', 'O'),
 ('nostra', 'O'),
 ('Galli', 'PER'),
 ('appellantur.', 'PER'),
 ('Hi', 'PER'),
 ('omnes', 'O'),
 ('lingua,', 'O'),
 ('institutis,', 'O'),
 ('legibus', 'O'),
 ('inter', 'O'),
 ('se', 'O'),
 ('differunt.', 'O'),
 ('Gallos', 'PER'),
 ('ab', 'O'),
 ('Aquitanis', 'O'),
 ('Garumna', 'O'),
 ('flumen,', 'O'),
 ('a', 'O'),
 ('Belgis', 'PER'),
 ('Matrona', 'PER'),
 ('et', 'O'),
 ('Sequana', 'O'),
 ('dividit.', 'O'),
 ('Horum', 'O'),
 ('omnium', 'O'),
 ('fortissimi', 'O'),
 ('sunt', 'O'),
 ('Belgae,', 'O'),
 ('propterea', 'O'),
 ('quod', 'O'),
 ('a', 'O'),
 ('cultu', 'O'),
 ('atque', 'O'),
 ('humanitate', 'O'),
 ('provinciae', 'O'),
 ('l

# Wrap up

At this point we can "compare" the output of the three different methods we used, again by using the `zip` function. 

In [37]:
list(zip(tagged_text_baseline[:50], tagged_text_cltk[:50],tagged_text_nltk[:50]))

[(('Gallia', 'Entity'), ('Gallia', 'Entity'), ('COMMENTARIUS', 'LOC')),
 (('Belgae,', 'Entity'), ('Galli', 'Entity'), ('PRIMUS', 'LOC')),
 (('Aquitani,', 'Entity'), ('Hi', 'Entity'), ('Gallia', 'LOC')),
 (('Celtae,', 'Entity'), ('Gallos', 'Entity'), ('est', 'O')),
 (('Galli', 'Entity'), ('Aquitanis', 'Entity'), ('omnis', 'O')),
 (('Hi', 'Entity'), ('Belgis', 'Entity'), ('divisa', 'O')),
 (('Gallos', 'Entity'), ('Matrona', 'Entity'), ('in', 'O')),
 (('Aquitanis', 'Entity'), ('Sequana', 'Entity'), ('partes', 'O')),
 (('Garumna', 'Entity'), ('Rhenum', 'Entity'), ('tres,', 'O')),
 (('Belgis', 'Entity'), ('Gallos', 'Entity'), ('quarum', 'O')),
 (('Matrona', 'Entity'), ('Germanis', 'Entity'), ('unam', 'O')),
 (('Sequana', 'Entity'), ('Gallos', 'Entity'), ('incolunt', 'O')),
 (('Horum', 'Entity'), ('Sequanis', 'Entity'), ('Belgae,', 'O')),
 (('Belgae,', 'Entity'), ('Belgae', 'Entity'), ('aliam', 'O')),
 (('Germanis,', 'Entity'), ('Galliae', 'Entity'), ('Aquitani,', 'O')),
 (('Rhenum', 'Entity

In [38]:
for baseline_out, cltk_out, nltk_out in zip(tagged_text_baseline[:150], tagged_text_cltk[:150], tagged_text_nltk[:150]):
    print("Baseline: %s\nCLTK: %s\nNLTK: %s\n"%(baseline_out, cltk_out, nltk_out))

Baseline: ('Gallia', 'Entity')
CLTK: ('Gallia', 'Entity')
NLTK: ('COMMENTARIUS', 'LOC')

Baseline: ('Belgae,', 'Entity')
CLTK: ('Galli', 'Entity')
NLTK: ('PRIMUS', 'LOC')

Baseline: ('Aquitani,', 'Entity')
CLTK: ('Hi', 'Entity')
NLTK: ('Gallia', 'LOC')

Baseline: ('Celtae,', 'Entity')
CLTK: ('Gallos', 'Entity')
NLTK: ('est', 'O')

Baseline: ('Galli', 'Entity')
CLTK: ('Aquitanis', 'Entity')
NLTK: ('omnis', 'O')

Baseline: ('Hi', 'Entity')
CLTK: ('Belgis', 'Entity')
NLTK: ('divisa', 'O')

Baseline: ('Gallos', 'Entity')
CLTK: ('Matrona', 'Entity')
NLTK: ('in', 'O')

Baseline: ('Aquitanis', 'Entity')
CLTK: ('Sequana', 'Entity')
NLTK: ('partes', 'O')

Baseline: ('Garumna', 'Entity')
CLTK: ('Rhenum', 'Entity')
NLTK: ('tres,', 'O')

Baseline: ('Belgis', 'Entity')
CLTK: ('Gallos', 'Entity')
NLTK: ('quarum', 'O')

Baseline: ('Matrona', 'Entity')
CLTK: ('Germanis', 'Entity')
NLTK: ('unam', 'O')

Baseline: ('Sequana', 'Entity')
CLTK: ('Gallos', 'Entity')
NLTK: ('incolunt', 'O')

Baseline: ('Horum

# Exercise

Extract the named entities from the English translation of the *De Bello Gallico* book 1.

The CTS URN for this translation is `urn:cts:latinLit:phi0448.phi001.perseus-eng2:1`.

Modify the code above to use the English model of the Stanford tagger instead of the italian one.

Hint: