# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Description" data-toc-modified-id="Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Description</a></div><div class="lev1 toc-item"><a href="#Library-imports" data-toc-modified-id="Library-imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Library imports</a></div><div class="lev1 toc-item"><a href="#Let's-grab-some-text-to-process" data-toc-modified-id="Let's-grab-some-text-to-process-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Let's grab some text to process</a></div><div class="lev2 toc-item"><a href="#What's-CTS?" data-toc-modified-id="What's-CTS?-31"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>What's CTS?</a></div><div class="lev2 toc-item"><a href="#Querying-a-CTS-API" data-toc-modified-id="Querying-a-CTS-API-32"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Querying a CTS API</a></div><div class="lev1 toc-item"><a href="#Very-simple-baseline" data-toc-modified-id="Very-simple-baseline-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Very simple baseline</a></div><div class="lev1 toc-item"><a href="#NER-with-CLTK" data-toc-modified-id="NER-with-CLTK-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>NER with CLTK</a></div><div class="lev1 toc-item"><a href="#NER-with-NLTK" data-toc-modified-id="NER-with-NLTK-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>NER with NLTK</a></div><div class="lev1 toc-item"><a href="#NER-with-NLTK-on-the-translation" data-toc-modified-id="NER-with-NLTK-on-the-translation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>NER with NLTK on the translation</a></div><div class="lev1 toc-item"><a href="#Let's-create-some-indices" data-toc-modified-id="Let's-create-some-indices-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Let's create some <em>indices</em></a></div>

# Description

TODO

# Library imports

External modules and libraries can be imported using `import` statements.

Let's the [Natural Language ToolKit (NLTK)](http://www.nltk.org/), the [Classical Language ToolKit (CLTK)](http://cltk.org/), [MyCapytain](http://mycapytain.readthedocs.io/en/latest/) and some local libraries that are used in this notebook.

In [23]:
import nltk
from nltk.tag import StanfordNERTagger
import cltk
# install with `pip install https://github.com/Capitains/MyCapytain/archive/master.zip`
# installing just with `pip install MyCapytain` won't work
import MyCapytain
import sys
sys.path.append("/opt/nlp/pymodules/")
from idai_journals.nlp import sub_leaves

And more precisely, we are using the following versions:

In [19]:
print(nltk.__version__)

3.2.2


In [20]:
print(cltk.__version__)

0.1.47


In [22]:
print(MyCapytain.__version__)

2.0.0b18


# Let's grab some text to process

To start with, we need some text from which we'll try to extract named entities using various methods and libraries.

There are several ways of doing this e.g.:
1. copy and paste the text from Perseus or the Latin Library into a text document, and read it into a variable
2. load a text from one of the Latin corpora available via `cltk` (cfr. this [blog post](https://disiectamembra.wordpress.com/2016/05/29/cltk-importing-the-latin-library-as-a-corpus/))
3. or load it from Perseus by leveraging its [Canonical Text Services]() API

Let's gor for #3 :)

## What's CTS?

CTS URNs stand for Canonical Text Service Uniform Resource Names.

You can think of a CTS URN like a social security number for a text (or part of a text).

**TODO**: insert image explaining CTS

- `urn:cts:latinLit:phi0448` (Caesar)
- `urn:cts:latinLit:phi0448.phi001` (Caesar's *De Bello Gallico*)
- `urn:cts:latinLit:phi0448.phi001.perseus-lat2`
- `urn:cts:latinLit:phi0448.phi001.perseus-lat2:1`
- `urn:cts:latinLit:phi0448.phi001.perseus-lat2:1.1.1`

How do I find out the CTS URN of a given author or text?

The Perseus Catalog is your friend!

## Querying a CTS API

In [59]:
from MyCapytain.resolvers.cts.api import HttpCTSResolver
from MyCapytain.retrievers.cts5 import CTS
from MyCapytain.common.constants import Mimetypes

# We set up a resolver which communicates with an API available in Leipzig
resolver = HttpCTSResolver(CTS("http://cts.dh.uni-leipzig.de/api/cts/"))

# We require some metadata information
textMetadata = resolver.getMetadata("urn:cts:latinLit:phi0448.phi001.perseus-lat2")

# Texts in CTS Metadata have one interesting property : its citation scheme.
# Citation are embedded objects that carries information about how a text can be quoted, what depth it has
print([citation.name for citation in textMetadata.citation])

['Book', 'Chapter', 'Section']


In [56]:
# Now, we want to retrieve the first book of the **De Bello Gallico**
passage = resolver.getTextualNode("urn:cts:latinLit:phi0448.phi001.perseus-lat2", subreference="1")
# And we want to have its content exported to plain text
de_bello_gallico_book1 = passage.export(Mimetypes.PLAINTEXT)

The text that we have just fetched by using a programming interface (API) can also be [viewed in the browser](http://cts.dh.uni-leipzig.de/read/latinLit/phi0448/phi001/perseus-lat2/1).

Or even imported as an iframe into this notebook! 

In [60]:
from IPython.display import IFrame
IFrame('http://cts.dh.uni-leipzig.de/read/latinLit/phi0448/phi001/perseus-lat2/1', width=1000, height=350)

Let's see how many words (tokens, more properly) there are in Caesar's *De Bello Gallico* I:

In [51]:
len(de_bello_gallico_book1.split(" "))

8176

In [53]:
#de_bello_gallico_book1

# Very simple baseline

Rewrite code below as normal for loop, and then show how to do it in list comprehension (quotation from NLTK book?)

In [7]:
baseline_named_entities = [{"ne": token, 
                      "context":" ".join(de_bello_gallico_book1.split(" ")[n-5:n+5])} 
                                          for n, token in enumerate(de_bello_gallico_book1.split(" ")) 
                                                                                                  if(token.istitle())]

In [8]:
len(baseline_named_entities)

968

In [9]:
baseline_named_entities[:100]

[{'context': '', 'ne': 'Gallia'},
 {'context': 'partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui',
  'ne': 'Belgae,'},
 {'context': 'quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua',
  'ne': 'Aquitani,'},
 {'context': 'Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi',
  'ne': 'Celtae,'},
 {'context': 'qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua,',
  'ne': 'Galli'},
 {'context': 'lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus',
  'ne': 'Hi'},
 {'context': 'institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen,',
  'ne': 'Gallos'},
 {'context': 'inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis',
  'ne': 'Aquitanis'},
 {'context': 'se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona',
  'ne': 'Garumna'},
 {'context': 'ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.',
  'ne': 'Belgis'},
 {'contex

# NER with CLTK

**TODO**: explain briefly the method used by CLTK (dictionary lookup)

In [10]:
from cltk.tag.ner import tag_ner

In [11]:
tagged_text = tag_ner('latin', input_text=de_bello_gallico_book1)

In [12]:
named_entities = [tagged_token for tagged_token in tagged_text if len(tagged_token)>1]

In [13]:
len(named_entities)

603

In [14]:
named_entities[:100]

[('Gallia', 'Entity'),
 ('Belgae', 'Entity'),
 ('Aquitani', 'Entity'),
 ('Celtae', 'Entity'),
 ('Galli', 'Entity'),
 ('Hi', 'Entity'),
 ('Gallos', 'Entity'),
 ('Aquitanis', 'Entity'),
 ('Belgis', 'Entity'),
 ('Matrona', 'Entity'),
 ('Sequana', 'Entity'),
 ('Belgae', 'Entity'),
 ('Germanis', 'Entity'),
 ('Rhenum', 'Entity'),
 ('Gallos', 'Entity'),
 ('Germanis', 'Entity'),
 ('Eorum', 'Entity'),
 ('Gallos', 'Entity'),
 ('Rhodano', 'Entity'),
 ('Oceano', 'Entity'),
 ('Belgarum', 'Entity'),
 ('Sequanis', 'Entity'),
 ('Rhenum', 'Entity'),
 ('Belgae', 'Entity'),
 ('Galliae', 'Entity'),
 ('Rheni', 'Entity'),
 ('Aquitania', 'Entity'),
 ('Pyrenaeos', 'Entity'),
 ('Oceani', 'Entity'),
 ('Hispaniam', 'Entity'),
 ('Orgetorix', 'Entity'),
 ('Is', 'Entity'),
 ('Messala', 'Entity'),
 ('Pisone', 'Entity'),
 ('Galliae', 'Entity'),
 ('Id', 'Entity'),
 ('Rheno', 'Entity'),
 ('Germanis', 'Entity'),
 ('Iura', 'Entity'),
 ('Sequanos', 'Entity'),
 ('Lemanno', 'Entity'),
 ('Rhodano', 'Entity'),
 ('His', 'Entit

# NER with NLTK

In [15]:
stanford_model_italian = "/opt/nlp/stanford-tools/stanford-ner-2015-12-09/classifiers/ner-ita-nogpe-noiob_gaz_wikipedia_sloppy.ser.gz"

In [17]:
ner_tagger = StanfordNERTagger(stanford_model_italian)

In [18]:
nltk_tagged_text = ner_tagger.tag(de_bello_gallico_book1.split(" "))

In [19]:
nltk_named_entities = [(token, ne_tag) for token, ne_tag in nltk_tagged_text if ne_tag !="O"]

In [20]:
len(nltk_named_entities)

175

In [21]:
nltk_named_entities

[('COMMENTARIUS', 'LOC'),
 ('PRIMUS', 'LOC'),
 ('Gallia', 'LOC'),
 ('Galli', 'PER'),
 ('appellantur.', 'PER'),
 ('Hi', 'PER'),
 ('Gallos', 'PER'),
 ('Belgis', 'PER'),
 ('Matrona', 'PER'),
 ('Gallos', 'PER'),
 ('Gallos', 'PER'),
 ('Aquitania', 'LOC'),
 ('M.', 'PER'),
 ('Pisone', 'PER'),
 ('Ubi', 'ORG'),
 ('d.', 'PER'),
 ('V.', 'PER'),
 ('Kal.', 'PER'),
 ('Apr.', 'PER'),
 ('L.', 'PER'),
 ('Pisone,', 'PER'),
 ('A.', 'PER'),
 ('Gabinio', 'PER'),
 ('consulibus.', 'PER'),
 ('Caesari', 'PER'),
 ('Gallia', 'LOC'),
 ('Ubi', 'ORG'),
 ('Ubi', 'ORG'),
 ('Sequanis', 'ORG'),
 ('Caesari', 'ORG'),
 ('renuntiatur', 'ORG'),
 ('Helvetiis', 'ORG'),
 ('Sequanorum', 'ORG'),
 ('et', 'ORG'),
 ('Haeduorum', 'ORG'),
 ('Caesarem', 'ORG'),
 ('certiorem', 'ORG'),
 ('Caesar', 'PER'),
 ('Ubi', 'ORG'),
 ('per', 'ORG'),
 ('exploratores', 'ORG'),
 ('Caesar', 'ORG'),
 ('Caesar', 'PER'),
 ('Arari', 'LOC'),
 ('Divico', 'PER'),
 ('Cassiano', 'PER'),
 ('Romanus', 'PER'),
 ('His', 'PER'),
 ('Caesar', 'PER'),
 ('Divico', 'PER

In [22]:
from nltk.chunk import RegexpParser

In [23]:
italian_chunker = RegexpParser(r'''
PER:
    {<PER><(PER|LOC|MISC|ORG)>*}
LOC:
    {<LOC><(PER|LOC|MISC|ORG)>*}
ORG:
    {<ORG><(PER|LOC|MISC|ORG)>*}
MISC:
    {<MISC><(PER|LOC|MISC|ORG)>*}
''')

In [24]:
trees = italian_chunker.parse(nltk_tagged_text)

In [25]:
trees



ImportError: No module named '_tkinter'

Tree('S', [Tree('LOC', [('COMMENTARIUS', 'LOC'), ('PRIMUS', 'LOC'), ('Gallia', 'LOC')]), ('est', 'O'), ('omnis', 'O'), ('divisa', 'O'), ('in', 'O'), ('partes', 'O'), ('tres,', 'O'), ('quarum', 'O'), ('unam', 'O'), ('incolunt', 'O'), ('Belgae,', 'O'), ('aliam', 'O'), ('Aquitani,', 'O'), ('tertiam', 'O'), ('qui', 'O'), ('ipsorum', 'O'), ('lingua', 'O'), ('Celtae,', 'O'), ('nostra', 'O'), Tree('PER', [('Galli', 'PER'), ('appellantur.', 'PER'), ('Hi', 'PER')]), ('omnes', 'O'), ('lingua,', 'O'), ('institutis,', 'O'), ('legibus', 'O'), ('inter', 'O'), ('se', 'O'), ('differunt.', 'O'), Tree('PER', [('Gallos', 'PER')]), ('ab', 'O'), ('Aquitanis', 'O'), ('Garumna', 'O'), ('flumen,', 'O'), ('a', 'O'), Tree('PER', [('Belgis', 'PER'), ('Matrona', 'PER')]), ('et', 'O'), ('Sequana', 'O'), ('dividit.', 'O'), ('Horum', 'O'), ('omnium', 'O'), ('fortissimi', 'O'), ('sunt', 'O'), ('Belgae,', 'O'), ('propterea', 'O'), ('quod', 'O'), ('a', 'O'), ('cultu', 'O'), ('atque', 'O'), ('humanitate', 'O'), ('provin

In [29]:
sub_leaves(trees, label="ORG")

[[('Ubi', 'ORG')],
 [('Ubi', 'ORG')],
 [('Ubi', 'ORG')],
 [('Sequanis', 'ORG')],
 [('Caesari', 'ORG'), ('renuntiatur', 'ORG'), ('Helvetiis', 'ORG')],
 [('Sequanorum', 'ORG'), ('et', 'ORG'), ('Haeduorum', 'ORG')],
 [('Caesarem', 'ORG'), ('certiorem', 'ORG')],
 [('Ubi', 'ORG'), ('per', 'ORG'), ('exploratores', 'ORG'), ('Caesar', 'ORG')],
 [('Ubi', 'ORG')],
 [('Ipse', 'ORG')],
 [('exploratores', 'ORG'), ('Caesar', 'ORG'), ('cognovit', 'ORG')],
 [('Ipse', 'ORG')],
 [('Nam', 'ORG')],
 [('Diu', 'ORG')],
 [('Ipse', 'ORG')],
 [('C', 'ORG')],
 [('Ipse', 'ORG'), ('autem', 'ORG'), ('Ariovistus', 'ORG')],
 [('Cimbris', 'ORG'), ('et', 'ORG'), ('Teutonis', 'ORG')],
 [('Huic', 'ORG'), ('legioni', 'ORG'), ('Caesar', 'ORG')],
 [('Ubi', 'ORG')],
 [('Ubi', 'ORG')],
 [('Ubi', 'ORG')],
 [('citeriorem', 'ORG'), ('Galliam', 'ORG')]]

# NER with NLTK on the translation

`urn:cts:latinLit:phi0448.phi001.perseus-eng2:1`

# Let's create some *indices*