In [None]:
#| hide
#| default_exp pretrained_models

In [None]:
#| hide
%matplotlib inline
from nbdev.showdoc import *

In [None]:
#| export
import spacy
from IPython.display import display_html
import tabulate
import stanza
import spacy_stanza

# NLP with pre-trained models: spaCy and Stanford NLP
(follows: )

In [None]:
#| export
en = spacy.load('en_core_web_sm')

In [None]:
#| export
text = ("Donald John Trump (born June 14, 1946) is the 45th and former president of "
        "the United States.  Before entering politics, he was a businessman and television personality.")

print(f"Type of text: {type(text)}")
print(f"Length of text: {len(text)}")

Type of text: <class 'str'>
Length of text: 169


By applying the spaCy model we assigned to the variable `en`. We can generate a processed document wit spaCy, `doc_en` that has sentences and tokens:

In [None]:
#| export
doc_en = en(text)

In [None]:
#| export
list(doc_en.sents)

[Donald John Trump (born June 14, 1946) is the 45th and former president of the United States.  ,
 Before entering politics, he was a businessman and television personality.]

In [None]:
#| export
print(len(list(doc_en.sents)))

2


In [None]:
#| export
tokens = [[t] for t in doc_en]
display(display_html(tabulate.tabulate(tokens, tablefmt='html')))

0
Donald
John
Trump
(
born
June
14
","
1946
)


None

spaCy also identifies a number of linguistic features for every token: `lemma`, `pos_` (the universal POS tags), and `tag_`(contains the more finegrained, language-specific POS tags):

In [None]:
#| export
features = [[t.orth_, t.lemma_, t.pos_, t.tag_] for t in doc_en]
display(display_html(tabulate.tabulate(features, tablefmt='html')))

0,1,2,3
Donald,Donald,PROPN,NNP
John,John,PROPN,NNP
Trump,Trump,PROPN,NNP
(,(,PUNCT,-LRB-
born,bear,VERB,VBN
June,June,PROPN,NNP
14,14,NUM,CD
",",",",PUNCT,","
1946,1946,NUM,CD
),),PUNCT,-RRB-


None

spaCy also offers pre-trained models for NER (Named Entity Recognition). The results can be found on the `ent_iob_` and `ent_type_` attributes.

The `ent_type_` attribute informs us about what type of entity the token refers to: 'Donald Trump' => person, 'June 14, 1946' => date, '45th' => ordinal number, and 'the United States' => GPE (Geo Political Entity).

The `ent_iob_` attribute gives, by way of the letters 'I,O,B' the position of the token in the entity, where `O` means that the token is outside of an entity, `B` the entity is at the beginning of a token, and `I` means it is inside a token. So basically the IOB scheme gives you information about begin and parts of entities (positional).

In [None]:
#| export
entities = [(t.orth_, t.ent_iob_, t.ent_type_) for t in doc_en]
display(display_html(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Donald,B,PERSON
John,I,PERSON
Trump,I,PERSON
(,O,
born,O,
June,B,DATE
14,I,DATE
",",I,DATE
1946,I,DATE
),O,


None

We can access the recognized entities directly when we use the `ents` attribute of the document directly:

In [None]:
#| export
print([(ent.text, ent.label_) for ent in doc_en.ents])

[('Donald John Trump', 'PERSON'), ('June 14, 1946', 'DATE'), ('45th', 'ORDINAL'), ('the United States', 'GPE')]


On top of all this, the spaCy model also has a dependency parser on board that analyzes the grammatical realtions between the tokens:

In [None]:
#| export
syntax = [[token.text, token.dep_, token.head.text] for token in doc_en]

We display the results, kept in the variable `syntax`, in the usual way:

In [None]:
display(display_html(tabulate.tabulate(syntax, tablefmt='html')))

0,1,2
Donald,compound,Trump
John,compound,Trump
Trump,nsubj,is
(,punct,Trump
born,acl,Trump
June,npadvmod,born
14,nummod,June
",",punct,June
1946,nummod,June
),punct,Trump


None

## Multilingual NLP

As can be inferred from the spaCy model we called this model is based on and targeted at the English language.

One can use the spaCy website to select models to use for different usecases:

https://spacy.io/usage/models

But models for other languages are also available. Let's try one out on a Dutch text:

In [None]:
#| export
nl = spacy.load('nl_core_news_sm')
text_nl = ("Mark Rutte is minister-president van Nederland." "Hij is van de VVD en heeft een slecht geheugen.")

In [None]:
#| export
doc_nl = nl(text_nl)

Because the Dutch model was trained in its particular way, there are differences with the English model.

The most important is that the Dutch models do not offer lemmatization, the `lemma_` attribute returns the `orth_` attribute.

NB. whenever numbers turn up in the tables that are generated, they refer to the ID's of tokens in vectorspace. This usually means that we specified the attribute of a token `ent_iob` without the ending underscore: `ent_iob_`.

In [None]:
#| export
info = [(t.lemma_, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_nl]
display(display_html(tabulate.tabulate(info, tablefmt='html')))

0,1,2,3,4
Mark,PROPN,SPEC|deeleigen,B,PERSON
Rutte,PROPN,SPEC|deeleigen,O,
zijn,AUX,WW|pv|tgw|ev,O,
ministerpresident,NOUN,N|soort|ev|basis|zijd|stan,O,
van,ADP,VZ|init,O,
Nederland,PROPN,N|eigen|ev|basis|onz|stan,B,GPE
.,PUNCT,LET,O,
hij,PRON,VNW|pers|pron|nomin|vol|3|ev|masc,O,
zijn,AUX,WW|pv|tgw|ev,O,
van,ADP,VZ|init,O,


None

If one is working with Dutch texts, then the Python library **stanza** is the one to use (in the Telematika notebook the stanfordnlp library is used, but this library is not recommended anymore.)



In [None]:
# we ran 'stanza.download('nl') in the terminal
nl_nlp = stanza.Pipeline('nl', use_gpu=False)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-08-29 16:39:16 INFO: Loading these models for language: nl (Dutch):
| Processor | Package |
-----------------------
| tokenize  | alpino  |
| pos       | alpino  |
| lemma     | alpino  |
| depparse  | alpino  |
| ner       | conll02 |

2022-08-29 16:39:16 INFO: Use device: cpu
2022-08-29 16:39:16 INFO: Loading: tokenize
2022-08-29 16:39:16 INFO: Loading: pos
2022-08-29 16:39:16 INFO: Loading: lemma
2022-08-29 16:39:16 INFO: Loading: depparse
2022-08-29 16:39:17 INFO: Loading: ner
2022-08-29 16:39:17 INFO: Done loading processors!


In [None]:
doc_nl_stanza = nl_nlp(text_nl)

Still get this to work with GPU, but so far, so good. We now have access, via the model, to text and lemma, but also to the attributes `upos`, `xpos`, `govenor`, and `dependency_relation`.

In [None]:
stanza_info = []
for sentence in doc_nl_stanza.sentences:
  for word in sentence.words:
    stanza_info.append((len(stanza_info)+1, word.text, word.lemma, word.pos, word.upos, word.xpos, word.deprel))

In [None]:
display_html(tabulate.tabulate(stanza_info, tablefmt='html'))

0,1,2,3,4,5,6
1,Mark,Mark,PROPN,PROPN,SPEC|deeleigen,nsubj
2,Rutte,Rutte,PROPN,PROPN,SPEC|deeleigen,flat
3,is,zijn,AUX,AUX,WW|pv|tgw|ev,cop
4,minister-president,minister_president,NOUN,NOUN,N|soort|ev|basis|zijd|stan,root
5,van,van,ADP,ADP,VZ|init,case
6,Nederland,Nederland,PROPN,PROPN,N|eigen|ev|basis|onz|stan,nmod
7,.,.,PUNCT,PUNCT,LET,punct
8,Hij,hij,PRON,PRON,VNW|pers|pron|nomin|vol|3|ev|masc,nsubj
9,is,zijn,AUX,AUX,WW|pv|tgw|ev,root
10,van,van,ADP,ADP,VZ|init,case


### Combining spaCy and Stanza

Thanks to the spacy-stanza wrapper we can combine the 2 libraries in pipelines. First we install `spacy_stanza` with Pip.

In [None]:
nlp_spacy_stanza = spacy_stanza.load_pipeline('nl', use_gpu=False)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.0.json:   0%|   …

2022-08-29 21:00:33 INFO: Loading these models for language: nl (Dutch):
| Processor | Package |
-----------------------
| tokenize  | alpino  |
| pos       | alpino  |
| lemma     | alpino  |
| depparse  | alpino  |
| ner       | conll02 |

2022-08-29 21:00:33 INFO: Use device: cpu
2022-08-29 21:00:33 INFO: Loading: tokenize
2022-08-29 21:00:33 INFO: Loading: pos
2022-08-29 21:00:33 INFO: Loading: lemma
2022-08-29 21:00:33 INFO: Loading: depparse
2022-08-29 21:00:33 INFO: Loading: ner
2022-08-29 21:00:34 INFO: Done loading processors!


In [None]:
doc_nlp_spacy_stanza = nlp_spacy_stanza("Mark Rutte is minister-president van Nederland." "Hij is van de VVD en heeft een slecht actief geheugen.")
for token in doc_nlp_spacy_stanza:
  print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc_nlp_spacy_stanza.ents)

Mark Mark PROPN nsubj PER
Rutte Rutte PROPN flat PER
is zijn AUX cop 
minister-president minister_president NOUN root 
van van ADP case 
Nederland Nederland PROPN nmod LOC
. . PUNCT punct 
Hij hij PRON nsubj 
is zijn AUX root 
van van ADP case 
de de DET det 
VVD VVD PROPN obl ORG
en en CCONJ cc 
heeft hebben VERB conj 
een een DET det 
slecht slecht ADJ advmod 
actief actief ADJ amod 
geheugen geheug NOUN obj 
. . PUNCT punct 
(Mark Rutte, Nederland, VVD)


In [None]:
#| hide
import nbdev; nbdev.nbdev_export()