In [None]:
#| hide
#| default_exp pretrained_models

# NLP with pre-trained models: spaCy and Stanford NLP
(follows: )

In [None]:
#| export
import spacy
en = spacy.load('en_core_web_sm')

In [None]:
#| export
text = ("Donald John Trump (born June 14, 1946) is the 45th and former president of "
        "the United States.  Before entering politics, he was a businessman and television personality.")

print(f"Type of text: {type(text)}")
print(f"Length of text: {len(text)}")

Type of text: <class 'str'>
Length of text: 169


By applying the spaCy model we assigned to the variable `en`. We can generate a processed document wit spaCy, `doc_en` that has sentences and tokens:

In [None]:
#| export
doc_en = en(text)

In [None]:
#| export
list(doc_en.sents)

[Donald John Trump (born June 14, 1946) is the 45th and former president of the United States.  ,
 Before entering politics, he was a businessman and television personality.]

In [None]:
#| export
print(len(list(doc_en.sents)))

2


In [None]:
#| export
from IPython.display import display_html
import tabulate

tokens = [[t] for t in doc_en]
display(display_html(tabulate.tabulate(tokens, tablefmt='html')))

0
Donald
John
Trump
(
born
June
14
","
1946
)


None

spaCy also identifies a number of linguistic features for every token: `lemma`, `pos_` (the universal POS tags), and `tag_`(contains the more finegrained, language-specific POS tags):

In [None]:
#| export
features = [[t.orth_, t.lemma_, t.pos_, t.tag_] for t in doc_en]
display(display_html(tabulate.tabulate(features, tablefmt='html')))

0,1,2,3
Donald,Donald,PROPN,NNP
John,John,PROPN,NNP
Trump,Trump,PROPN,NNP
(,(,PUNCT,-LRB-
born,bear,VERB,VBN
June,June,PROPN,NNP
14,14,NUM,CD
",",",",PUNCT,","
1946,1946,NUM,CD
),),PUNCT,-RRB-


None

spaCy also offers pre-trained models for NER (Named Entity Recognition). The results can be found on the `ent_iob_` and `ent_type_` attributes.

The `ent_type_` attribute informs us about what type of entity the token refers to: 'Donald Trump' => person, 'June 14, 1946' => date, '45th' => ordinal number, and 'the United States' => GPE (Geo Political Entity).

The `ent_iob_` attribute gives, by way of the letters 'I,O,B' the position of the token in the entity, where `O` means that the token is outside of an entity, `B` the entity is at the beginning of a token, and `I` means it is inside a token. So basically the IOB scheme gives you information about begin and parts of entities (positional).

In [None]:
#| export
entities = [(t.orth_, t.ent_iob_, t.ent_type_) for t in doc_en]
display(display_html(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Donald,B,PERSON
John,I,PERSON
Trump,I,PERSON
(,O,
born,O,
June,B,DATE
14,I,DATE
",",I,DATE
1946,I,DATE
),O,


None

We can access the recognized entities directly when we use the `ents` attribute of the document directly:

In [None]:
#| export
print([(ent.text, ent.label_) for ent in doc_en.ents])

[('Donald John Trump', 'PERSON'), ('June 14, 1946', 'DATE'), ('45th', 'ORDINAL'), ('the United States', 'GPE')]


On top of all this, the spaCy model also has a dependency parser on board that analyzes the grammatical realtions between the tokens:

In [None]:
#| export
syntax = [[token.text, token.dep_, token.head.text] for token in doc_en]

We display the results, kept in the variable `syntax`, in the usual way:

In [None]:
display(display_html(tabulate.tabulate(syntax, tablefmt='html')))

0,1,2
Donald,compound,Trump
John,compound,Trump
Trump,nsubj,is
(,punct,Trump
born,acl,Trump
June,npadvmod,born
14,nummod,June
",",punct,June
1946,nummod,June
),punct,Trump


None

## Multilingual NLP

As can be inferred from the spaCy model we called this model is based on and targeted at the English language.

One can use the spaCy website to select models to use for different usecases:

https://spacy.io/usage/models

But models for other languages are also available. Let's try one out on a Dutch text:

In [None]:
#| export
nl = spacy.load('nl_core_news_sm')
text_nl = ("Mark Rutte is minister-president van Nederland." "Hij is van de VVD en heeft een slecht geheugen.")

In [None]:
#| export
doc_nl = nl(text_nl)

Because the Dutch model was trained in its particular way, there are differences with the English model.

The most important is that the Dutch models do not offer lemmatization, the `lemma_` attribute returns the `orth_` attribute.

NB. whenever numbers turn up in the tables that are generated, they refer to the ID's of tokens in vectorspace. This usually means that we specified the attribute of a token `ent_iob` without the ending underscore: `ent_iob_`.

In [None]:
#| export
info = [(t.lemma_, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_nl]
display(display_html(tabulate.tabulate(info, tablefmt='html')))

0,1,2,3,4
Mark,PROPN,SPEC|deeleigen,B,PERSON
Rutte,PROPN,SPEC|deeleigen,O,
zijn,AUX,WW|pv|tgw|ev,O,
ministerpresident,NOUN,N|soort|ev|basis|zijd|stan,O,
van,ADP,VZ|init,O,
Nederland,PROPN,N|eigen|ev|basis|onz|stan,B,GPE
.,PUNCT,LET,O,
hij,PRON,VNW|pers|pron|nomin|vol|3|ev|masc,O,
zijn,AUX,WW|pv|tgw|ev,O,
van,ADP,VZ|init,O,


None

If one is working with Dutch texts, then the Python library **StanfordNLP** that is build on top of PyTorchprovides a fully neural pipeline with lemmatization.

Add a lookup based lemmatizer for Dutch:

https://github.com/explosion/spaCy/blob/master/spacy/lang/de/tokenizer_exceptions.py