This notebook was created by [William Mattingly](https://datascience.si.edu/people/dr-william-mattingly) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/).

This notebook is adapted by Zhuo Chen under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />
____

# Multilingual NER 2

This is lesson 2 of 3 in the educational series on multilingual NER. This notebook is focused on rules-based NER. 

**Audience:** Teachers / Learners / Researchers

**Use case:** Tutorial / How-To / Reference / Explanation

**Difficulty:** Beginner / Intermediate / Advanced

**Completion time:** 90 minutes

**Knowledge Required:** 

* [Python Basics](./python-basics-1.ipynb)
* [Python intermediate 4](./python-intermediate-4.ipynb)

**Knowledge Recommended:**

* Basic file operations (open, close, read, write)


**Learning Objectives:**
After this lesson, learners will be able to:

* Understand How to use spaCy to do NER
* Understand How to Create an EntityRuler
* Understand How to Identify Languages of a Corpus
* Understand A bit about Unsupervised Learning

___

# Install required Python libraries

In [None]:
!pip3 install spacy # for NLP
!pip3 install -U spacy
!pip3 install spacy_langdetect # for language detection
!pip3 install bulk
!pip3 install pandas
!pip3 install umap-learn
!pip3 install sentence_transformers
!python3 -m spacy download en_core_web_sm # for English NER
!python3 -m spacy download es_core_news_sm # for Spanish NER
!python3 -m spacy download zh_core_web_sm # for Spanish NER

# Introduction to spaCy

The spaCy (spelled correctly) library is a robust machine learning library for Natural Language Processing. It supports a wide variety of languages with statistical models capable of parsing texts, identifying parts-of-speech, and extract entities. 

Let's see an example of NLP task that spaCy can do for us.

## Tokenization
Recall that last time we have seen a graph showing the NLP pipeline. A pipeline's purpose is to take input data, perform some sort of operations on that input data, and then output some useful information from the data. On the pipeline, we find the pipes. A pipe is an individual component of a pipeline. Different pipes perform different tasks. After we read in the data from a text file, an essential task of NLP is tokenization. 

<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/NER_NLP_Pipeline.png' width=700></center>

One form of tokenization is **word tokenization**. When we do word tokenization, we break a text up into individual words and punctuations. Another form of tokenization is **sentence tokenization**. Sentence tokenization is precisely the same as word tokenization, except instead of breaking a text up into individual words and punctuations, we break a text up into individual sentences.

If you are an English speaker, you may think you do not need spaCy for sentence tokenization, because in English, the end of a sentence is indicated by a period `.`. Why not just use the the built-in `split()` function which allows us to split a text string by the period `.`? 

This is a ligit question, but simply splitting a text string by the period `.` will run into problems sometimes and spaCy is actually way more smarter.

In [None]:
# String to be split
text = "Martin J. Thompson is known for his writing skills. He is also good at programming."

In [None]:
# Split the string by period
sents = text.split(".")
print(sents)

We had the unfortunate result of splitting at Martin J. The reason for this is obvious. In English, it is common convention to indicate abbreviation with the same punctuation mark used to indicate the end of a sentence. 

We can use SpaCy, however, to do sentence tokenization. SpaCy is smart enough to not break at Martin J.

First, let's import the spaCy library. Then, we need to load an NLP model object. To do this, we use the `spacy.load()` function. Here, we load the small English NLP model trained on written web text that includes vocabulary, syntax and entities.

In [None]:
# Load the small English NLP model
import spacy
nlp = spacy.load("en_core_web_sm")

We can use this English NLP model to parse a text and create a Doc object. If you need a quick refresh about what classes and object are, you can refer to [Python intermediate 4](./python-intermediate-4.ipynb).

In [None]:
# Use the English model to parse the text we created
doc = nlp(text)

There is a lot of data stored in the Doc object. For example, we can iterate over the sentences in the Doc object and print them out.

In [None]:
# Get the sentence tokens in doc
for sent in doc.sents:
    print(sent)

# spaCy's built-in NER

We have seen one example NLP task that spaCy can do for us. Now let's move on to named entity recognition, the NLP task we focus on in this series.

SpaCy already has a built NER off the shelf for us to use. 

We will iterate over the doc object as we did above, but instead of iterating over `doc.sents`, we will iterate over `doc.ents`. For our purposes right now, we simply want to get each entity's text (the string itself) and its corresponding label (note the underscore `_` after label).

In [None]:
# Print out the entities in the doc object together with their labels
for ent in doc.ents: # iterate over the entities 
    print (ent.text, ent.label_)

As we can see the small English model has correctly identified that Martin J. Thompson is an entity and given it the correct label PERSON.

Of course we have many different kinds of entities. Here is a list of entity labels used by the small English NLP model we loaded.

In [None]:
# List of labels in the small English model for NER
nlp.get_pipe("ner").labels

If you would like to know the meaning of a label, you can use the `explain` function.

In [None]:
# Get what a label means
spacy.explain('NORP')

# spaCy's EntityRuler

Life would be so easy if we could just grab the ready-to-use built-in NER of spaCy and apply it to the large volume of data we have at hand. However, things are not that easy.

In [None]:
# Another sample text string
text = "Aars is a small town in Denmark. The town was founded in the 14th century."

#Create the Doc object
doc = nlp(text)

#extract entities
for ent in doc.ents:
    print (ent.text, ent.label_)

We see that the built-in NER failed to identify Aars as an entity of the GPE type. If we do want to extract 'Aars' from the text and give it a label of GPE, what can we do? 

## Add EntityRuler as a new pipe

Recall that we have talked about the pipes in a pipeline at the beginning of this lesson. In the case of spaCy, there are a few different pipes that perform different tasks. The tokenizer tokenizes the text into individual tokens; the parser parses the text, and the NER identifies entities and labels them accordingly. When we create a Doc object, all of this data is stored in the Doc object.  

In [None]:
# Take a look at the current pipes
nlp.analyze_pipes()

The EntityRuler is a spaCy factory that allows one to create a set of patterns with corresponding labels. In order to extract the target entities and label them successfully, we can create an EntityRuler, give it some instructions, and then add it to the spaCy pipeline as a new pipe. 

In [None]:
# Create the EntityRuler
ruler = nlp.add_pipe("entity_ruler")

# List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Aars"}
            ]

ruler.add_patterns(patterns)

After we add the EntityRuler, we can use the new pipeline to do NER. 

In [None]:
# Use the new model to parse the text and create a new Doc object
doc = nlp(text)

# Iterate over the entities and print them out
for ent in doc.ents: 
    print (ent.text, ent.label_)

In [None]:
# Take a look at the pipes in the new pipeline
nlp.analyze_pipes()

## The importance of order

It is important to remember that pipelines are sequential. This means that components earlier in a pipeline affect what later components receive.

In [None]:
# Use the new model to parse a new text string
text = "Xiong'an is a satellite city of Beijing."
nlp1 = spacy.load("en_core_web_sm")
doc=nlp1(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Xiong'an is a name of a city. We would want to label it as GPE, not ORG. 

In [None]:
# Create the EntityRuler
ruler = nlp1.add_pipe("entity_ruler")

# List of Entities and Patterns
patterns = [
                {"label": "GPE", "pattern": "Xiong'an"}
            ]

ruler.add_patterns(patterns)

# Get the entities
doc = nlp1(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Why do we still mislabel Xiong'an? This is because when we add the EntityRuler as a new pipe, it gets added at the end of the pipeline automatically. That means the EntityRuler will come after the built-in NER in spaCy. Since NER is a hard classification task, an entity that gets labeled will not be relabeled. If Xiong'an is labeled already by the built-in NER as ORG, it will not be relabeled by the EntityRuler that comes after. In order to give the EntityRuler primacy, we will have to put it in a position before the built-in NER when we add it so that it takes primacy over the built-in NER. 

In [None]:
# Load the model
nlp2 = spacy.load("en_core_web_sm")

# Create the EntityRuler and add it to the model
ruler = nlp2.add_pipe("entity_ruler", before='ner')

# Add the new patterns to the ruler
patterns = [
                {"label": "GPE", "pattern": "Xiong'an"}
            ]

ruler.add_patterns(patterns)

# Use the new model to parse the text
doc = nlp2(text)

# Get the entities
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
# EntityRuler comes before the built in ner in nlp2
nlp2.analyze_pipes()

### Write a regex pattern

Suppose we have a text written in English, except that the names are written in Latin. 

In [None]:
# English text with Latin names
text = "Marius was a consul in Rome. Marie is the vocative form."

In [None]:
# Write a function that capture the pattern for the Latin name Marius
def pattern(root):
    endings = ["us", "i", "o", "um", "e"]
    patterns = []
    for ending in endings:
        patterns.append({"pattern": root+ending, "label": "PERSON"})
    return patterns
marius = pattern("Mari")
marius

In [None]:
# Create an empty English NLP model
nlp_latin = spacy.blank("en")

# Add an EntityRuler
nlp_latin_ruler = nlp_latin.add_pipe("entity_ruler")

# add the pattern for the Latin name Marius to the EntityRuler
nlp_latin_ruler.add_patterns(marius)

In [None]:
# Create a Doc object
doc_latin = nlp_latin(text)

# Iterate over the entities in Doc object and print them out
for ent in doc_latin.ents:
    print (ent.text, ent.label_)

We could also use regex to help us write the pattern. 

In [None]:
# Write a function which returns the pattern for Latin name Marius
def latin_roots(root):
    return [{"pattern": [{"TEXT": {"REGEX": "^" + root + r"(us|i|o|um|e)$"}}], "label": "PERSON"}]

# Save the pattern to the variable marious2
marius2 = latin_roots("Mari")

# Create a blank English NLP model
nlp_latin2 = spacy.blank("en")

# Add an EntityRuler to the model
nlp_latin_ruler2 = nlp_latin2.add_pipe("entity_ruler")

# Add the pattern for Latin name Marius to the EntityRuler
nlp_latin_ruler2.add_patterns(marius2)

# Text to be parsed
text = "Marius was a consul in Rome. Marie is the vocative form. Caesar was a dictator."

# Create a Doc object using the new model with the regex pattern in EntityRuler
doc_latin2 = nlp_latin2(text)

# Iterate over the entities and print them out
for ent in doc_latin2.ents:
    print(ent.text, ent.label_)

# Exercise (to be added)

# Detecting languages in texts

When we work with a multilingual corpus, we will first want to know the different languages used in the corpus. There are different approaches to do this. In this section, I will introduce a third-party library Lingua for language detection. Currently, 75 languages are supported by Lingua.

## Language detection with Lingua

In [None]:
!pip3 install lingua-language-detector

In [None]:
# import the language detector builder
from lingua import LanguageDetectorBuilder

In [None]:
# build a language detector
detector = LanguageDetectorBuilder.from_all_languages().build()

In [None]:
# Use the detector to detect the language of a string
detector.detect_language_of("This is an English text")

In [None]:
# Use the detector to detect the language of a string
detector.detect_language_of("Este é um outro texto sem idioma especificado")

In [None]:
# Use the detector to detect the language of a string
detector.detect_language_of("这是一句中文")

Sometimes you may already know the range of languages in your corpus. You just want to identify the language for each document. In this case, you could narrow down the language detector to only a few languages. 

In [None]:
# build a language detector
from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

# Use the detector to decide between the given languages 
detector.compute_language_confidence_values("This is an English text")

## Multiple languages in the same file

The examples we go over just now assume that only one language is used in each document. However, the language detector we build cannot reliably detect multiple languages, because it will only output one language for a text by default. What if our text as multiple languages, such as the example below?

In [None]:
# a text string with multiple languages 
large_text = '''This is a text where the first line is in English.
Maar de tweede regel is in het Nederlands. 
Dies ist ein deutscher Text.'''

In [None]:
# build a language detector
languages = [Language.ENGLISH, Language.DUTCH, Language.GERMAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

If we run the detector over this text, we get the following output.

In [None]:
# Use the detector to decide the language of the text
detector.detect_language_of(large_text)

By default, Lingua returns the most likely language for a given input text. 

In [None]:
# Get the likelihood of the decision
confidence_values = detector.compute_language_confidence_values(large_text)
for language, value in confidence_values:
    print(f"{language.name}: {value:.2f}")

But this text has multiple languages. In this example text, each sentence is written in a different language. Therefore, we need to get each sentence string and run the detector over it.

In [None]:
# Create a Doc object 
doc = nlp(large_text)

# Iterate over each sentence and run the detector over it
for sent in doc.sents:
    print(f"Sentence: {sent.text.strip()}")
    print(detector.detect_language_of(sent.text))

# Bring everything together

In [None]:
# A document that has two languages, English and Spanish
multilingual_document = """This is a story about Margaret who speaks Spanish. 
'Juan Miguel es mi amigo y tiene veinte años.' Margeret said to her friend Sarah.
"""

In [None]:
# build a language detector
from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

In [None]:
# Load the relevant models
english_nlp = spacy.load("en_core_web_sm") # for English
spanish_nlp = spacy.load("es_core_news_sm") # for Spanish

In [None]:
# Create an NLP model and create a Doc object
multi_nlp = spacy.blank('en')

# Add sentencizer
multi_nlp.add_pipe('sentencizer')

# Create a Doc object
multi_doc = multi_nlp(multilingual_document.strip())


In [None]:
type(list(multi_doc.sents)[1])

In [None]:
# Switching between languages with conditionals

for sent in multi_doc.sents:
    if detector.detect_language_of(sent.text).name == "ENGLISH":
        print(sent)
        nested_doc = english_nlp(sent.text.strip())
    elif detector.detect_language_of(sent.text).name == "SPANISH":
        print(sent)
        nested_doc = spanish_nlp(sent.text.strip())
    for ent in nested_doc.ents:
        print(ent.text, ent.label_)
    print()

# Exercise (to be added)