<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

For questions/comments/improvements, email author@email.address.<br />
____

# `Multilingual NER` `2`

### Rules Based Multilingual NER: 
This is lesson `2` of 3 in the educational series on `multilingual NER`. This notebook is intended `to rules-based NER`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Reference` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Beginner` / `Intermediate` / `Advanced`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* Object-oriented programming (classes, instances, inheritance)
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understand How to use spaCy to do NER
2. Understand How to Create an EntityRuler
3. Understand How to Identify Languages of a Corpus
4. Understand A bit about Unsupervised Learning
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* spaCy - for NLP
* spacy_langdetect - for language detection

## Install Required Libraries

In [2]:
!pip install spacy
!pip install spacy_langdetect
!pip install bulk
!pip install pandas
!pip install umap-learn
!pip install sentence_transformers
!python -m spacy download en_core_web_sm
!python -m spacy download es_core_news_sm

You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m


You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mCollecting sentence_transformers
  Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting transformers<5.0.0,>=4.6.0
  Using cached transformers-4.21.0-py3-none-any.whl (4.7 MB)
Collecting torch>=1.6.0
  Downloading torch-1.12.0-cp38-cp38-manylinux1_x86_64.whl (776.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 MB[0m [31m937.9 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting torchvision
  Downloading torchvision-0.13.0-cp38-cp38-manylinux1_x86_64.whl (19.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m83.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.96

[0mCollecting en-core-web-sm==3.3.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting es-core-news-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.3.0/es_core_news_sm-3.3.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m93.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m


Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.3.0
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')


In [3]:
### Import Libraries ###
import spacy
import pandas as pd
from umap import UMAP
from sentence_transformers import SentenceTransformer

In [4]:
# Load the universal sentence encoder
model = SentenceTransformer('silencesys/paraphrase-xlm-r-multilingual-v1-fine-tuned-for-latin')

Downloading:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/774 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/589 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

# Introduction to spaCy

The spaCy (spelled correctly) library is a robust machine learning NLP library developed by Explosion AI, a Berlin based team of computer scientists and computational linguists. It supports a wide variety of European languages out-of-the-box with statistical models capable of parsing texts, identifying parts-of-speech, and extract entities. SpaCy is also capable of easily improving or training from scratch custom models on domain-specific texts.

In this notebook, we will go through the steps for installing spaCy, downloading a pretrained language model, and performing the essential tasks of NLP.

## Sentence Tokenization

A common essential task of NLP is known as tokenization. We looked at tokenization briefly in the last notebook in which we wanted to break a text into individual components. This is one form of tokenization known as word tokenization. There are, however, many other forms, such as sentence tokenization. Sentence tokenization is precisely the same as word tokenization, except instead of breaking a text up into individual word and punctuation components, we break a text up into individual sentences.

If you are familiar with Python, you may be familiar with the built-in split() function which allows for a programmer to split a text by whitespace (default) or by passing an argument of a string to define where to split a text, i.e. split(“.”). A common practice (without NLP frameworks) is to split a text into sentences by simply using the split function, but this is ill-advised. Let us consider the example below

In [5]:
text = "Martin J. Thompson is known for his writing skills. He is also good at programming."

In [6]:
#Now, let's try and use the split function to split the text object based on punctuation.
new = text.split(".")
print (new)

['Martin J', ' Thompson is known for his writing skills', ' He is also good at programming', '']


While we successfully were able to split the two sentences, we had the unfortunate result of splitting at Martin J. The reason for this may be obvious. In English, it is common convention to indicate abbreviation with the same punctuation mark used to indicate the end of a sentence. The reason for this extends to the early middle ages when Irish monks began to introduce punctuation and spacing to better read Latin (a story for another day).

The very thing that makes texts easier to read, however, greatly hinders our ability to easily split sentences. For this reason, another method is needed. This is where sentence tokenization comes into play. In order to see how sentence tokenization differs, let’s begin with our first spaCy usage.

In [7]:
#First, we import spaCy
import spacy

Next, we need to load an NLP model object. To do this, we use the spacy.load() function. This will take one argument, the model one wishes to load. We will use the small English model.

In [9]:
#standard spaCy English model
#more languages at https://spacy.io/usage/models

nlp = spacy.load("en_core_web_sm") #small reach model, there is medium and larger reach models

With the nlp object created, we can use it to to parse a text. To do this, we create a doc object. This object will contain a lot of data on the text.

In [10]:
#processed text as an object
doc = nlp(text)

In [11]:
print (doc)

Martin J. Thompson is known for his writing skills. He is also good at programming.


While this looks identical to the "text" string above, it is quite different. To demonstrate this, let us use the sentence tokenizer.

In [12]:
#iterate over sentences
#tokenize at the sentence level
# will treat sentences with a semicolon; and handles line breaks -- works with html

for sent in doc.sents:
    print (sent)

Martin J. Thompson is known for his writing skills.
He is also good at programming.


# spaCy's Built-In NER

Another essential task of NLP, and the chief subject of this series, is named entity recognition (NER). I spoke about NER in the last notebook. Here, I’d like to demonstrate how to perform basic NER via spaCy. Again, we will iterate over the doc object as we did above, but instead of iterating over doc.sents, we will iterate over doc.ents. For our purposes right now, I simply want to print off each entity’s text (the string itself) and its corresponding label (note the _ after label). I will be explaining this process in much greater detail in the next two notebooks.

In [13]:
# ents contains all entity data provided by the pipeline

for ent in doc.ents:
    print (ent.text, ent.label_) #underscore provides the label (argmax?)

Martin J. Thompson PERSON


As we can see the small spaCy statistical machine learning model has correctly identified that Martin J. Thompson is, in fact, an entity. What kind of entity? A person. We will explore how it made this determination in notebook Day-03 in which we explore machine learning NLP more closely.

# spaCy's EntityRuler

The Python library spaCy offers a few different methods for performing rules-based NER. One such method is via its EntityRuler.

The EntityRuler is a spaCy factory that allows one to create a set of patterns with corresponding labels. A factory in spaCy is a set of classes and functions preloaded in spaCy that perform set tasks. In the case of the EntityRuler, the factory at hand allows the user to create an EntityRuler, give it a set of instructions, and then use this instructions to find and label entities.

Once the user has created the EntityRuler and given it a set of instructions, the user can then add it to the spaCy pipeline as a new pipe. I have spoken in the past notebooks briefly about pipes, but perhaps it is good to address them in more detail here.

A pipe is a component of a pipeline. A pipeline’s purpose is to take input data, perform some sort of operations on that input data, and then output those operations either as a new data or extracted metadata. A pipe is an individual component of a pipeline. In the case of spaCy, there are a few different pipes that perform different tasks. The tokenizer, tokenizes the text into individual tokens; the parser, parses the text, and the NER identifies entities and labels them accordingly. All of this data is stored in the Doc object as we saw in Notebook 01_02 of this series.

It is important to remember that pipelines are sequential. This means that components earlier in a pipeline affect what later components receive. Sometimes this sequence is essential, meaning later pipes depend on earlier pipes. At other times, this sequence is not essential, meaning later pipes can function without earlier pipes. It is important to keep this in mind as you create custom spaCy models (or any pipeline for that matter).

In this notebook, we will be looking closely at the EntityRuler as a component of a spaCy model’s pipeline. Off-the-shelf spaCy models come preloaded with an NER model; they do not, however, come with an EntityRuler. In order to incorperate an EntityRuler into a spaCy model, it must be created as a new pipe, given instructions, and then added to the model. Once this is complete, the user can save that new model with the EntityRuler to the disk.

The full documentation of spaCy EntityRuler can be found here: https://spacy.io/api/entityruler .

This notebook with synthesize this documentation for non-specialists and provide some examples of it in action.

In [14]:
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = "The village of Treblinka is in Poland. Treblinka was also an extermination camp."

#Create the Doc object
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

Poland GPE


Depending on the version of model you are using, some results may vary.

The output from the code above demonstrates spaCy’s small model’s to identify Treblinka, which is a small village in Poland. As the sample text indicates, it was also an extermination camp during WWII. In the first sentence, the spaCy model tagged Treblinka as an LOC (location) and in the second it was missed entirely. Both are either imprecise or wrong. I would have accepted ORG for the second sentence, as spaCy’s model does not know how to classify an extermination camp, but what these results demonstrate is the model’s failure to generalize on data. The reason? There are a few, but I suspect the model never encountered the word Treblinka.

This is a common problem in NLP for specific domains. Often times the domains in which we wish to deploy models, off-the-shelf models will fail because they have not been trained on domain-specific texts. We can resolve this, however, either via spaCy’s EntityRuler or via training a new model. As we will see over the next few notebooks, we can use spaCy’s EntityRuler to easily achieve both.

For now, let’s first remedy the issue by giving the model instructions for correctly identifying Treblinka. For simplicity, we will use spaCy’s GPE label. In a later notebook, we will teach a model to correctly identify Treblinka in the latter context as a concentration camp.

In [17]:
#Import the requisite library
import spacy

#Build upon the spaCy Small Model
nlp = spacy.load("en_core_web_sm")

#Sample text
text = "The village of Treblinka is in Poland. Treblinka was also an extermination camp."

#Create the EntityRuler
#can now act as a variable
ruler = nlp.add_pipe("entity_ruler")

#List of Entities and Patterns
#SpaCy can assign labels to these patterns
# list of dictionaries, whose key or label for an entity and the pattern, or keyword
patterns = [
                {"label": "GPE", "pattern": "Treblinka"}
            ]

ruler.add_patterns(patterns)


doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_, ent.label)

Treblinka GPE 384
Poland GPE 384
Treblinka GPE 384


If you executed the code above and found that you had the same output, then you did everything correctly. Let's now analyze our pipeline witth nlp.analyze_pipes()

In [16]:
#allows you to look at the structure of your pipeline
#spam ruler can be very helful for annotation
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

# Detecting Languages in Texts

Often in multilingual NER, we need to first understand our corpus. In order to do this, we need to analyze all at once to understand what specific languages we need to allow for. If we are working with modern languages, we have several different approaches to do this. First, we can use an off-the-shelf model to identify the languages for a given document.

## Language Detection with SpaCy and LangDetect

There are several libraries for doing this in Python, but let's first look at LangDetect which has a wraper for spaCy 2 that we can update to spaCy 3 with the code below.

In [18]:
#Source: https://stackoverflow.com/questions/66712753/how-to-use-languagedetector-from-spacy-langdetect-package
import spacy
from spacy.language import Language
from spacy_langdetect import LanguageDetector

@Language.factory("language_detector")
def get_lang_detector(nlp, name):
    return LanguageDetector()

The above code imports the LanguageDetector class from spacy_langdetect and updates the code in the documentation by correctly assigning it as a factory that we can use. Let's now create a model and load it as a pipe.

In [19]:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('language_detector', last=True)

<spacy_langdetect.spacy_langdetect.LanguageDetector at 0x7f173abc3220>

Now that we have the language_detector pipe added to a model, we can use it and access the special attribute, "language" that is attached to the Doc container.

In [20]:
text = "This is an English text."
doc = nlp(text)
print(doc._.language)

{'language': 'en', 'score': 0.9999975649857006}


In [21]:
text = "Ceci est un texte français"
doc = nlp(text)
print(doc._.language)

{'language': 'fr', 'score': 0.9999954392816763}


In [22]:
text = "Este é um outro texto sem idioma especificado."
doc = nlp(text)
print(doc._.language)

{'language': 'pt', 'score': 0.9999976737392297}


## Languages that are not well Represented in Machine Learning Models

All of these examples look good, but let's give LangDetect an unfair test, something it never saw before: modern-Irish.

In [23]:
#Gaelic (Modern Irish) Underepresented languages

text = "Seo í an Ghaeilge."
doc = nlp(text)
print(doc._.language)

{'language': 'de', 'score': 0.7142808371494478}


And here are the limits to something like LangDetect. There are other models available that include less common languages, but one of the things that we can do is we can use machine learning or dictionaries to identify our specific languages that are not represented, e.g. look for the words (or in the case of languages like Old Kannada, which have unique UTF-8 characters, we can even look for the characters) in a text.

## Corpora where there are Multiple Languages inside Each Document

But LangDetect also cannot reliably detect multiple languages. Its classification is a hard one and it has a singular output: one language with one confidence score. What if our text as multiple languages, such as the example below?

In [24]:
large_text = '''This is a text where the first line is in English.
Maar de tweede regel is in het Nederlands.
Dies ist ein deutscher Text.'''

If we run LangDetect over this text, we get the following output.

In [25]:
#overall document detection

doc = nlp(large_text)
print(doc._.language)

{'language': 'en', 'score': 0.571425575717438}


This is okay. We see that Nederlands (Dutch) is the highest ranking language at 42% confidence. But this text has multiple languages. In this scenario, we need to analyze each sentence. This is common practice when working with multiple languages in a single document in a corpus.

To analyze the data correctly, therefore, we need to access the Sentence Container.

In [26]:
#each sentence detection

for sent in doc.sents:
    print(f"Sentence: {sent.text.strip()}")
    print(sent._.language)
    print()

Sentence: This is a text where the first line is in English.
{'language': 'en', 'score': 0.9999973440917085}

Sentence: Maar de tweede regel is in het Nederlands.
{'language': 'nl', 'score': 0.9999936320417699}

Sentence: Dies ist ein deutscher Text.
{'language': 'de', 'score': 0.9999975790839888}



## Corpora with Multiple Languages in the Same Sentence

Of all the problems, the more challenging problem to solve is dealing with corpora where multiple languages can be found in an individual sentence. In my experience, this usually occurs when non-native speakers of one language need to reference something in their native tongue or when the society that produced the document is strongly bi-lingual that the expectation is speakers and readers would understand both languages equally well. In other instances, I have also seen this occur with quotes.

In [27]:
#hard shift between two languages
multilingual_sent = "Josephus vero de historicorum aetate hunc loquitur in modum : Οἱ μέν τοι τὰς ἱστορίας ἐπιχειρήσαντες συγγράφειν παρ᾽ αὐτοῖς, λέγω δὲ τοὺς περὶ Κάδμόν τε τὸν Μιλήσιον, καὶ τὸν Ἀργεῖον Ἀκουσίλαον, καὶ μετὰ τοῦτον εἴ τινες ἄλλοι λέγονται γένεσθαι, βραχὺ τῆς Περσῶν ἐπὶ τὴν ἑλλάδα στρατείας τῷ χρόνῳ προέλαβον: Qui autem historias apud eos conscribere tentavere, id est, Cadmus Milesius et Acusilaus Argivus, et post hunc quicunque alii fuisse feruntur, paulum Persarum expeditionem praecessere ."

In [28]:
print(multilingual_sent)

Josephus vero de historicorum aetate hunc loquitur in modum : Οἱ μέν τοι τὰς ἱστορίας ἐπιχειρήσαντες συγγράφειν παρ᾽ αὐτοῖς, λέγω δὲ τοὺς περὶ Κάδμόν τε τὸν Μιλήσιον, καὶ τὸν Ἀργεῖον Ἀκουσίλαον, καὶ μετὰ τοῦτον εἴ τινες ἄλλοι λέγονται γένεσθαι, βραχὺ τῆς Περσῶν ἐπὶ τὴν ἑλλάδα στρατείας τῷ χρόνῳ προέλαβον: Qui autem historias apud eos conscribere tentavere, id est, Cadmus Milesius et Acusilaus Argivus, et post hunc quicunque alii fuisse feruntur, paulum Persarum expeditionem praecessere .


Josephus vero de historicorum aetate hunc loquitur in modum : Οἱ μέν τοι τὰς ἱστορίας ἐπιχειρήσαντες συγγράφειν παρ᾽ αὐτοῖς, λέγω δὲ τοὺς περὶ **Κάδμόν** τε τὸν **Μιλήσιον**, καὶ τὸν **Ἀργεῖον Ἀκουσίλαον**, καὶ μετὰ τοῦτον εἴ τινες ἄλλοι λέγονται γένεσθαι, βραχὺ τῆς Περσῶν ἐπὶ τὴν ἑλλάδα στρατείας τῷ χρόνῳ προέλαβον: Qui autem historias apud eos conscribere tentavere, id est, **Cadmus Milesius** et **Acusilaus Argivus**, et post hunc quicunque alii fuisse feruntur, paulum Persarum expeditionem praecessere .

In [29]:
df = pd.read_csv("../data/original.csv")
df

Unnamed: 0,text
0,"OMNES latinae linguae Patres, scriptoresque ec..."
1,"Laudandum quidem ingenium, damnanda vero haere..."
2,At cum altera pars et melior in fidei certamin...
3,Hunc merito suum occidentalis Ecclesia sibi vi...
4,Hujusce proinde magni nominis umbra ac patroci...
...,...
17391,"praeses, duos Tertulliani libros, de Oratione ..."
17392,"Dolendum istud eximium opus non, nisi toto hoc..."
17393,"Tantum referre nobis licet, eodem tempore quo ..."
17394,Index analyticus amplissimus tomum tertium abs...


In [30]:
sentences = df.text.tolist()
sentences[9872:9875]

['Josephus vero de historicorum aetate hunc loquitur in modum : Οἱ μέν τοι τὰς ἱστορίας ἐπιχειρήσαντες συγγράφειν παρ᾽ αὐτοῖς, λέγω δὲ τοὺς περὶ Κάδμόν τε τὸν Μιλήσιον, καὶ τὸν Ἀργεῖον Ἀκουσίλαον, καὶ μετὰ τοῦτον εἴ τινες ἄλλοι λέγονται γένεσθαι, βραχὺ τῆς Περσῶν ἐπὶ τὴν ἑλλάδα στρατείας τῷ χρόνῳ προέλαβον: Qui autem historias apud eos conscribere tentavere, id est, Cadmus Milesius et Acusilaus Argivus, et post hunc quicunque alii fuisse feruntur, paulum Persarum expeditionem praecessere .',
 'Nos vero de Graeciae sapientibus et eorum aetate in primo Apparatus nostri tomo disputavimus.',
 'De Moysis porro et aliorum prophetarum tempore egimus non solum in eodem citati Apparatus nostri loco, sed in superiori etiam de Lactantii operibus dissertatione.']

In [31]:
# Calculate embeddings 
#takes in individual texts and converting the string into a numerical representation
#multiple dimensions

X =  model.encode(sentences[9800:10000])

In [32]:
# Reduce the dimensions with UMAP
## dimensionality reduction
# 2 dimensions

umap = UMAP()
X_tfm = umap.fit_transform(X)

In [34]:
# Apply coordinates
new_df = pd.DataFrame(sentences[9800:10000], columns=['text'])
new_df['x'] = X_tfm[:, 0]
new_df['y'] = X_tfm[:, 1]
new_df.to_csv("../data/ready_class.csv", index=False)
# new_df

Unnamed: 0,text,x,y
0,"Verum his illis verbis, adhuc minus dicimus, e...",5.983134,-2.091753
1,ARTICULUS V. Quam luculenter Tertullianus prob...,3.770844,-1.901804
2,Nemo certe diffitebitur librorum Moysis antiqu...,4.913873,-1.405707
3,Quamobrem Tertullianus veterum scriptorum test...,2.925365,-1.349026
4,Primum itaque docet Moysem Inacho fuisse coaevum.,3.632606,-2.047698
...,...,...,...
195,"Verum Tertullianus, qui paucas inter lineas si...",1.960120,0.372385
196,Alia autem omnia illius mysteria et documenta ...,5.237106,-0.015902
197,"Atque ita quidem Christiani fiunt, non autem n...",4.737965,1.201558
198,Visne illud tibi ex ipsomet Tertulliano probari?,3.794235,-0.461073


**!!!Do not run in Constellate, please!!! Convert to Code to Use it
!python -m bulk text ../data/ready_class.csv**

In [36]:
#!python -m bulk text ../data/ready_class.csv**

About to serve `bulk` over at http://localhost:5006/.
^C

Aborted!


# EntityRuler with Multilingual Corpora

In [37]:
#this will decline out a set of patterns

def first_decliner(root):
    endings = ["us", "i", "o", "um", "e"]
    patterns = []
    for ending in endings:
        patterns.append({"pattern": root+ending, "label": "PERSON"})
    return patterns
marius = first_decliner("Mari")
marius

[{'pattern': 'Marius', 'label': 'PERSON'},
 {'pattern': 'Marii', 'label': 'PERSON'},
 {'pattern': 'Mario', 'label': 'PERSON'},
 {'pattern': 'Marium', 'label': 'PERSON'},
 {'pattern': 'Marie', 'label': 'PERSON'}]

In [38]:
nlp_latin = spacy.blank("en")
nlp_latin_ruler = nlp_latin.add_pipe("entity_ruler")
nlp_latin_ruler.add_patterns(marius)

In [39]:
text = "Marius was a consul in Rome. Marie is the vocative form."
doc_latin = nlp_latin(text)
for ent in doc_latin.ents:
    print (ent.text, ent.label_)

Marius PERSON
Marie PERSON


In [40]:
#implement regex; specify that you're looking for a text
#reference a 1960s way of fuzzy string matching

#always right rules that will be true positives

def latin_roots(root):
    return [{"pattern": [{"TEXT": {"REGEX": "^" + root + r"\w+"}}], "label": "PERSON"}]
marius2 = latin_roots("Mari")
nlp_latin2 = spacy.blank("en")
nlp_latin_ruler2 = nlp_latin2.add_pipe("entity_ruler")
nlp_latin_ruler2.add_patterns(marius2)
text = "Marius was a consul in Rome. Marie is the vocative form. Caesar was a dictator."
doc_latin2 = nlp_latin2(text)
for ent in doc_latin2.ents:
    print (ent.text, ent.label_)

Marius PERSON
Marie PERSON


# Bringing Everything Together

In [41]:
lang_detector = spacy.blank('en')
lang_detector.add_pipe("sentencizer")
lang_detector.add_pipe('language_detector')

<spacy_langdetect.spacy_langdetect.LanguageDetector at 0x7f1720bf37c0>

In [42]:
multilingual_document = """This is a story about Margaret who speaks Spanish.
'Juan Miguel es mi amigo y tiene veinte años.' Margeret said to her friend Sarah.
"""

In [43]:
english_nlp = spacy.load("en_core_web_sm")
spanish_nlp = spacy.load("es_core_news_sm")

In [44]:
for sent in lang_detector(multilingual_document.strip()).sents:
    print (sent)
    print (sent._.language)
    eng_doc = english_nlp(sent.text.strip())
    for ent in eng_doc.ents:
        print (ent.text, ent.label_)
    print()

This is a story about Margaret who speaks Spanish.
{'language': 'en', 'score': 0.999998003935004}
Margaret PERSON
Spanish LANGUAGE


'Juan Miguel es mi amigo y tiene veinte años.'
{'language': 'es', 'score': 0.999996897054559}
Juan Miguel es mi amigo y PERSON

Margeret said to her friend Sarah.
{'language': 'en', 'score': 0.5919624592921453}
Sarah PERSON



## Switching between Languages with Conditionals

In [45]:
for sent in lang_detector(multilingual_document.strip()).sents:
    print (sent)
    print (sent._.language)
    if sent._.language["language"] == "en":
        nested_doc = english_nlp(sent.text.strip())
    elif sent._.language['language'] == "es":
        nested_doc = spanish_nlp(sent.text.strip())
    for ent in nested_doc.ents:
        print (ent.text, ent.label_)
    
    print ()

This is a story about Margaret who speaks Spanish.
{'language': 'en', 'score': 0.9999950105339779}
Margaret PERSON
Spanish LANGUAGE


'Juan Miguel es mi amigo y tiene veinte años.'
{'language': 'es', 'score': 0.9999968036610237}
Juan Miguel PER
años. ORG

Margeret said to her friend Sarah.
{'language': 'en', 'score': 0.714278268862856}
Sarah PERSON

