# Language Processing

> **_NOTE:_** The examples may take a little while to run as they download the models each time.

## Language detection

We'll use [Apache OpenNLP](https://opennlp.apache.org/) to detect the likely language for some fragments of text.

In [1]:
%%classpath add mvn
org.apache.opennlp opennlp-tools 1.9.3

In [2]:
%import opennlp.tools.langdetect.*

In [3]:
base     = 'http://apache.forsale.plus/opennlp/models'
url      = "$base/langdetect/1.8.3/langdetect-183.bin"
model    = new LanguageDetectorModel(new URL(url))
detector = new LanguageDetectorME(model)

['Bienvenido a Madrid', 'Bienvenue à Paris',
 'Добре дошли в София', 'Velkommen til København'].collect {
    t -> detector.predictLanguage(t).lang
}

[spa, fra, bul, dan]

## Sentence Detection

OpenNLP also supports sentence detection. We load the trained sentence detection model for English and use that to process some text. Even though the text has 28 full stops, only 4 of them are associated with the end of a sentence.

In [4]:
import opennlp.tools.sentdetect.*

def text = '''
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951. It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings. He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.
'''

base     = 'http://opennlp.sourceforge.net/models-1.5'
url      = "$base/en-sent.bin"

def model = new SentenceModel(new URL(url))
def detector = new SentenceDetectorME(model)
def sentences = detector.sentDetect(text)
assert text.count('.') == 28
assert sentences.size() == 4
sentences.join('\n\n')

The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951.

It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf.

Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings.

He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.

## Entity Detection

Sometimes when analysing text we want to search for meaningful entities such as the dates, locations, names of people, etc. The following example uses OpenNLP. It has numerous named entity models which select such aspects individually. We'll use 5 English-language models: person, money, date, time, and location, but there are [other models and models for some other languages](http://opennlp.sourceforge.net/models-1.5/).

In [5]:
import opennlp.tools.namefind.*
import opennlp.tools.tokenize.SimpleTokenizer
import opennlp.tools.util.Span

String[] sentences = [
    "A commit by Daniel Sun on December 6, 2020 improved Groovy 4's language integrated query.",
    "A commit by Daniel on Sun. December 6, 2020 improved Groovy 4's language integrated query.",
    'The Groovy in Action book by Dierk Koenig et. al. is a bargain at $50, or indeed any price.',
    'The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.',
    'I saw Ms. May Smith waving to June Jones.',
    'The parcel was passed from May to June.'
]

def base     = 'http://opennlp.sourceforge.net/models-1.5'
def modelNames = ['person', 'money', 'date', 'time', 'location']
def finders = modelNames.collect{
    println "Loading $it ..."
    new NameFinderME(new TokenNameFinderModel(new URL(("$base/en-ner-${it}.bin"))))
}

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each { sentence ->
    String[] tokens = tokenizer.tokenize(sentence)
    Span[] tokenSpans = tokenizer.tokenizePos(sentence)
    def entityText = [:]
    def entityPos = [:]
    finders.indices.each {fi ->
        // could be made smarter by looking at probabilities and overlapping spans
        Span[] spans = finders[fi].find(tokens)
        spans.each{span ->
            def se = span.start..<span.end
            def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end)
            entityPos[span.start] = pos
            entityText[span.start] = "$span.type(${sentence[pos]})"
        }
    }
    entityPos.keySet().toList().reverseEach {
        def pos = entityPos[it]
        def (from, to) = [pos.from, pos.to + 1]
        sentence = sentence[0..<from] + entityText[it] + sentence[to..-1]
    }
    println sentence
}
OutputCell.HIDDEN

Loading person ...
Loading money ...
Loading date ...
Loading time ...
Loading location ...
A commit by person(Daniel Sun) on date(December 6, 2020) improved Groovy 4's language integrated query.
A commit by person(Daniel) on Sun. date(December 6, 2020) improved Groovy 4's language integrated query.
The Groovy in Action book by person(Dierk Koenig) et. al. is a bargain at money($50), or indeed any price.
The conference wrapped up date(yesterday) at time(5:30 p.m.) in location(Copenhagen), location(Denmark).
I saw Ms. person(May Smith) waving to person(June Jones).
The parcel was passed from date(May to June).


## Parts of Speech (POS) Detection

Parts of speech (POS) detection tags words as nouns, verbs and other [parts-of-speed](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).
Some of the common tags are:

| Tag | Meaning |
| --- | --- |
| CC | coordinating conjunction |
| CD | cardinal number |
| DT | determiner |
| IN | preposition or subordinating conjunction |
| JJ | adjective |
| JJR | adjective, comparative |
| NN | noun, singular or mass |
| NNS | noun, plural |
| NNP | proper noun, singular |
| POS | possessive ending |
| PRP | personal pronoun |
| PRP$ | possessive pronoun |
| RB | adverb |
| TO | the word "to" |
| VB | verb, base form |
| VBD | verb, past tense |
| VBZ | verb, third person singular present |

Here, we use OpenNLP's POS detection capabilities to detect the parts of speech for a nyumber of sentences:

In [6]:
import opennlp.tools.postag.*
import opennlp.tools.tokenize.SimpleTokenizer

def base     = 'http://opennlp.sourceforge.net/models-1.5'
def sentences = [
    'Paul has two sisters, Maree and Christine.',
    'His bark was much worse than his bite',
    'Turn on the lights to the master bedroom',
    "Light 'em all up",
    'Make it dark downstairs'
]
def model = new POSModel(new URL(("$base/en-pos-maxent.bin")))
def posTagger = new POSTaggerME(model)
def tokenizer = SimpleTokenizer.INSTANCE
sentences.each {
    String[] tokens = tokenizer.tokenize(it)
    String[] tags = posTagger.tag(tokens)
    println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}
OutputCell.HIDDEN

NNP(Paul) VBZ(has) CD(two) NNS(sisters) , NNP(Maree) CC(and) NNP(Christine) .
PRP$(His) NN(bark) VBD(was) RB(much) JJR(worse) IN(than) PRP$(his) NN(bite)
VB(Turn) IN(on) DT(the) NNS(lights) TO(to) DT(the) NN(master) NN(bedroom)
NN(Light) POS(') NN(em) DT(all) IN(up)
VB(Make) PRP(it) JJ(dark) NN(downstairs)
