# NLP with pretrained models - spaCy and StanfordNLP

In [1]:
import spacy

en = spacy.load("en")

In [2]:
text = ("Donald John Trump (born June 14, 1946) is the 45th and current president of "
        "the United States.  Before entering politics, he was a businessman and television personality.")
print(text)

Donald John Trump (born June 14, 1946) is the 45th and current president of the United States.  Before entering politics, he was a businessman and television personality.


In [3]:
doc_en = en(text)

First spaCy splits your document into sentences, and the sentences in tokens.

In [4]:
list(doc_en.sents)

[Donald John Trump (born June 14, 1946) is the 45th and current president of the United States.  ,
 Before entering politics, he was a businessman and television personality.]

In [5]:
from IPython.display import HTML, display
import tabulate

tokens = [[token] for token in doc_en]
display(HTML(tabulate.tabulate(tokens, tablefmt='html')))

0
Donald
John
Trump
(
born
June
14
","
1946
)


In addition, spaCy also identifies a number of linguistic features for every token. The most basic of these are the lemma, and two types of parts-of-speech tags: the `pos_` attribute contains the [Universal POS tags](https://universaldependencies.org/u/pos/) from the [Universal Dependencies](https://universaldependencies.org/), while the `tag_` attribute contains more fine-grained, language-specific part-of-speech tags.

In [6]:
features = [[t.orth_, t.lemma_, t.pos_, t.tag_] for t  in doc_en]
display(HTML(tabulate.tabulate(features, tablefmt='html')))

0,1,2,3
Donald,donald,PROPN,NNP
John,john,PROPN,NNP
Trump,trump,PROPN,NNP
(,(,PUNCT,-LRB-
born,bear,VERB,VBN
June,june,PROPN,NNP
14,14,NUM,CD
",",",",PUNCT,","
1946,1946,NUM,CD
),),PUNCT,-RRB-


Next, spaCy also offers pre-trained models for named entity recognition. Their results can be found on the `ent_iob_` and `ent_type` attributes. The `ent_type` attribute tells us what type of entity the token refers to. In the English models, these entity types follow the [OntoNotes standard](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf). In our example, we see that `Donald John Trump` refers to a person, `June 14, 1946` to a date, `45th` to an ordinal number, and `the United States` to a geo-political entity (GPE). 

The letters on the `ent_iob_` attribute give the position of the token in the entity. `O` means the token is outside of an entity, `B` means the token is at the beginning of an entity, and `I` means it is inside an entity (at any position except for the beginning). In this way, we can tell apart several entities of the same type that immediately follow each other. Together these letters form the so-called `BIO` tagging scheme. There are other tagging schemes, such as `BILUO`, which also has letters for the last position and single (unique) tokens in an entity, but the BIO scheme gives you all the information you need.  

In [7]:
entities = [(t.orth_, t.ent_iob_, t.ent_type_) for t in doc_en]
display(HTML(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Donald,B,PERSON
John,I,PERSON
Trump,I,PERSON
(,O,
born,O,
June,B,DATE
14,I,DATE
",",I,DATE
1946,I,DATE
),O,


You can also access the entities directly on the `ents` attribute of the document: 

In [8]:
print([(ent.text, ent.label_) for ent in doc_en.ents])

[('Donald John Trump', 'PERSON'), ('June 14, 1946', 'DATE'), ('45th', 'ORDINAL'), ('the United States', 'GPE')]


Finally, spaCy also contains a dependency parser, which analyzes the grammatical relations between the tokens. 

In [9]:
syntax = [[token.text, token.dep_, token.head.text ] for token in doc_en]
display(HTML(tabulate.tabulate(syntax, tablefmt='html')))

0,1,2
Donald,compound,Trump
John,compound,Trump
Trump,nsubj,is
(,punct,Trump
born,acl,Trump
June,npadvmod,born
14,nummod,June
",",punct,June
1946,nummod,June
),punct,Trump


## Multingual NLP

SpaCy doesn't only have models for English, but also for many other languages. Here's an example of a Dutch sentence, which means "Charles Michel is the prime minister of Belgium".

In [10]:
nl = spacy.load("nl")
text_nl = "Charles Michel is de eerste minister van België."
doc_nl = nl(text_nl)

The tokens in the Dutch document have the same attributes as those in the English one. Take care, however, because the functionality of the models can differ across languages. Here are three main differences between the English and the Dutch model: 

- The Dutch model does not offer lemmatization: the lemma_ attribute is identical to the orth_ attribute.
- The Dutch model has a very different fine-grained part-of-speech tags on the tag_ attribute.
- The Dutch model has different entity types (PER, LOC and ORG) than the English one. 

This is a result of the training corpora that were used to build the models, whose annotation guidelines may be very different.

In [11]:
info = [(t.orth_, t.lemma_, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_nl]
display(HTML(tabulate.tabulate(info, tablefmt='html')))

0,1,2,3,4,5
Charles,Charles,NOUN,N_N|eigen|ev|neut_eigen|ev|neut___,B,PER
Michel,Michel,PROPN,PROPN___,I,PER
is,is,VERB,V|hulpofkopp|ott|3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin,O,
de,de,DET,Art|bep|zijdofmv|neut__Definite=Def|PronType=Art,O,
eerste,eerste,NUM,Num|rang|bep|attr|onverv__Definite=Def|NumType=Ord,O,
minister,minister,NOUN,N|soort|ev|neut__Number=Sing,O,
van,van,ADP,Prep|voor__AdpType=Prep,O,
België,België,NOUN,N|eigen|ev|neut__Number=Sing,B,LOC
.,.,PUNCT,Punc|punt__PunctType=Peri,O,


## StanfordNLP

Another library whose functionality overlaps with that of spaCy is StanfordNLP. [StanfordNLP](https://stanfordnlp.github.io/stanfordnlp/), not to be confused with Stanford's Java [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library, is a [Python library](https://github.com/stanfordnlp/stanfordnlp) built on top of PyTorch that offers a fully neural pipeline with tokenization (including multi-word units), lemmatization, part-of-speech tagging (including morphological features) and dependency parsing. These components were built and trained for the [CoNLL-2018 shared task](https://nlp.stanford.edu/pubs/qi2018universal.pdf). There are no named entities, but the quality of the dependency parsing is state of the art. On top of that, it also offers a Python interface to CoreNLP. 

Its API is very similar to that of spaCy:

In [12]:
import stanfordnlp

#stanfordnlp.download('nl')
nl_stanford = stanfordnlp.Pipeline(lang="nl")

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/Users/yvespeirsman/stanfordnlp_resources/nl_alpino_models/nl_alpino_tokenizer.pt', 'lang': 'nl', 'shorthand': 'nl_alpino', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/Users/yvespeirsman/stanfordnlp_resources/nl_alpino_models/nl_alpino_tagger.pt', 'pretrain_path': '/Users/yvespeirsman/stanfordnlp_resources/nl_alpino_models/nl_alpino.pretrain.pt', 'lang': 'nl', 'shorthand': 'nl_alpino', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/Users/yvespeirsman/stanfordnlp_resources/nl_alpino_models/nl_alpino_lemmatizer.pt', 'lang': 'nl', 'shorthand': 'nl_alpino', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': '/Users/yvespeirsman/stanfordnlp_resources/nl_alpino_models/nl_alpino_parser.p

In [13]:
doc_nl_stanford = nl_stanford(text_nl)

The `text` and `lemma` properties speak for themselves. The `upos` attribute contains the universal dependencies we also find on spaCy's `pos_` attribute; the `xpos` attribute corresponds to spaCy's `tag_` attribute and contains the fine-grained tags with morphological properties. The `governor` attribute contains the (1-based) index of the head of each token; `dependency_relation` contains the grammatical relation between the two. 

In [14]:
stanford_info = []
for sentence in doc_nl_stanford.sentences:
    for token in sentence.tokens:
        for word in token.words:
            stanford_info.append((len(stanford_info)+1, word.text, word.lemma, word.upos, word.xpos, word.dependency_relation, word.governor))

In [15]:
display(HTML(tabulate.tabulate(stanford_info, tablefmt='html')))

0,1,2,3,4,5,6
1,Charles,Charles,PROPN,SPEC|deeleigen,nsubj,6
2,Michel,Michel,PROPN,SPEC|deeleigen,flat:name,1
3,is,zijn,AUX,WW|pv|tgw|ev,cop,6
4,de,de,DET,LID|bep|stan|rest,det,6
5,eerste,één,ADJ,TW|rang|prenom|stan,amod,6
6,minister,minister,NOUN,N|soort|ev|basis|zijd|stan,root,0
7,van,van,ADP,VZ|init,case,8
8,België,België,PROPN,N|eigen|ev|basis|onz|stan,nmod,6
9,.,.,PUNCT,LET,punct,6


## Combining spaCy and StanfordNLP

If you can't choose between spaCy and StanfordNLP, you can also combine the two. Yes, sometimes you can have the best of both worlds. Thanks to the `spacy_stanfordnlp` wrapper, we can plug a Stanford NLP model into a spaCy pipeline, and get its annotations in a spaCy document. Suddenly we have Dutch lemmatization and state-of-the-art part-of-speech tagging and dependency parsing.

In [16]:
from spacy_stanfordnlp import StanfordNLPLanguage

nl_combined = StanfordNLPLanguage(nl_stanford)

doc_nl_combined = nl_combined(text_nl)

info = [(t.orth_, t.lemma_, t.pos_, t.tag_) for t in doc_nl_combined]
display(HTML(tabulate.tabulate(info, tablefmt='html')))

0,1,2,3
Charles,Charles,PROPN,SPEC|deeleigen
Michel,Michel,PROPN,SPEC|deeleigen
is,zijn,AUX,WW|pv|tgw|ev
de,de,DET,LID|bep|stan|rest
eerste,één,ADJ,TW|rang|prenom|stan
minister,minister,NOUN,N|soort|ev|basis|zijd|stan
van,van,ADP,VZ|init
België,België,PROPN,N|eigen|ev|basis|onz|stan
.,.,PUNCT,LET


Additionally, we can mix and match the strengths of the two libraries, for example by extending our pipeline with spaCy's Named Entity Recognition.

In [17]:
nl_combined = StanfordNLPLanguage(nl_stanford)
nl_ner = nl.get_pipe("ner")
nl_combined.add_pipe(nl_ner)
nl_combined.vocab.strings.add("PER")

doc_nl_combined = nl_combined(text_nl)

info = [(t.orth_, t.lemma_, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_nl_combined]
display(HTML(tabulate.tabulate(info, tablefmt='html')))

0,1,2,3,4,5
Charles,Charles,PROPN,SPEC|deeleigen,B,PER
Michel,Michel,PROPN,SPEC|deeleigen,I,PER
is,zijn,AUX,WW|pv|tgw|ev,O,
de,de,DET,LID|bep|stan|rest,O,
eerste,één,ADJ,TW|rang|prenom|stan,O,
minister,minister,NOUN,N|soort|ev|basis|zijd|stan,O,
van,van,ADP,VZ|init,O,
België,België,PROPN,N|eigen|ev|basis|onz|stan,B,LOC
.,.,PUNCT,LET,O,
