<a href="https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W01-use-pretrained-models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP with pretrained models - spaCy and StanfordNLP

In [3]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_md
!python -m spacy download zh_core_web_md

[0mCollecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
Collecting zh-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_md-3.2.0/zh_core_web_md-3.2.0-py3-none-any.whl (79.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.0/79.0 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('zh_core_web_md')


![](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [4]:
import spacy
from IPython.display import HTML, display
import tabulate

In [5]:
spacy.info()

{'location': '/usr/local/lib/python3.7/dist-packages/spacy',
 'pipelines': {'en_core_web_md': '3.2.0', 'zh_core_web_md': '3.2.0'},
 'platform': 'Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic',
 'python_version': '3.7.13',
 'spacy_version': '3.2.4'}

- Play with [this Web App](https://share.streamlit.io/howard-haowen/spacy-streamlit/app.py) that I created, which is powered by spaCy.

## English NLP

In [6]:
en = spacy.load("en_core_web_md")

In [7]:
text = ("Donald John Trump (born June 14, 1946) is the 45th and current president of "
        "the United States.  Before entering politics, he was a businessman and television personality.")
print(text)

Donald John Trump (born June 14, 1946) is the 45th and current president of the United States.  Before entering politics, he was a businessman and television personality.


In [8]:
doc_en = en(text)

First spaCy splits your document into sentences, and the sentences in tokens.

In [9]:
list(doc_en.sents)

[Donald John Trump (born June 14, 1946) is the 45th and current president of the United States.,
  Before entering politics, he was a businessman and television personality.]

In [51]:
tokens = [token.text for token in doc_en]
print(tokens)

['Donald', 'John', 'Trump', '(', 'born', 'June', '14', ',', '1946', ')', 'is', 'the', '45th', 'and', 'current', 'president', 'of', 'the', 'United', 'States', '.', ' ', 'Before', 'entering', 'politics', ',', 'he', 'was', 'a', 'businessman', 'and', 'television', 'personality', '.']


In addition, spaCy also identifies a number of linguistic features for every token. The most basic of these are the lemma, and two types of parts-of-speech tags: the `pos_` attribute contains the [Universal POS tags](https://universaldependencies.org/u/pos/) from the [Universal Dependencies](https://universaldependencies.org/), while the `tag_` attribute contains more fine-grained, language-specific part-of-speech tags.

In [11]:
features = [[t.orth_, t.lemma_, t.pos_, t.tag_] for t in doc_en]
display(HTML(tabulate.tabulate(features, tablefmt='html')))

0,1,2,3
Donald,Donald,PROPN,NNP
John,John,PROPN,NNP
Trump,Trump,PROPN,NNP
(,(,PUNCT,-LRB-
born,bear,VERB,VBN
June,June,PROPN,NNP
14,14,NUM,CD
",",",",PUNCT,","
1946,1946,NUM,CD
),),PUNCT,-RRB-


Next, spaCy also offers pre-trained models for named entity recognition. Their results can be found on the `ent_iob_` and `ent_type` attributes. The `ent_type` attribute tells us what type of entity the token refers to. In the English models, these entity types follow the [OntoNotes standard](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf). In our example, we see that `Donald John Trump` refers to a person, `June 14, 1946` to a date, `45th` to an ordinal number, and `the United States` to a geo-political entity (GPE). 

The letters on the `ent_iob_` attribute give the position of the token in the entity. `O` means the token is outside of an entity, `B` means the token is at the beginning of an entity, and `I` means it is inside an entity (at any position except for the beginning). In this way, we can tell apart several entities of the same type that immediately follow each other. Together these letters form the so-called `BIO` tagging scheme. There are other tagging schemes, such as `BILUO`, which also has letters for the last position and single (unique) tokens in an entity, but the BIO scheme gives you all the information you need.  

In [12]:
entities = [(t.orth_, t.ent_iob_, t.ent_type_) for t in doc_en]
display(HTML(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Donald,B,PERSON
John,I,PERSON
Trump,I,PERSON
(,O,
born,O,
June,B,DATE
14,I,DATE
",",I,DATE
1946,I,DATE
),O,


You can also access the entities directly on the `ents` attribute of the document: 

In [13]:
print([(ent.text, ent.label_) for ent in doc_en.ents])

[('Donald John Trump', 'PERSON'), ('June 14, 1946', 'DATE'), ('45th', 'ORDINAL'), ('the United States', 'GPE')]


spaCy also contains a dependency parser, which analyzes the grammatical relations between the tokens. 

In [14]:
syntax = [[token.text, token.dep_, token.head.text ] for token in doc_en]
display(HTML(tabulate.tabulate(syntax, tablefmt='html')))

0,1,2
Donald,compound,Trump
John,compound,Trump
Trump,nsubj,is
(,punct,Trump
born,acl,Trump
June,npadvmod,born
14,nummod,June
",",punct,June
1946,nummod,June
),punct,Trump


Finally, the English spaCy model contains a morphological parser.

In [42]:
features = [[t.text, t.morph] for t in doc_en]
display(HTML(tabulate.tabulate(features, tablefmt='html')))

0,1
Donald,Number=Sing
John,Number=Sing
Trump,Number=Sing
(,PunctSide=Ini|PunctType=Brck
born,Aspect=Perf|Tense=Past|VerbForm=Part
June,Number=Sing
14,NumType=Card
",",PunctType=Comm
1946,NumType=Card
),PunctSide=Fin|PunctType=Brck


## Multingual NLP

SpaCy doesn't only have models for English, but also for many other languages.

In [15]:
zh = spacy.load("zh_core_web_md")

In [16]:
text_zh = "中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。"
doc_zh = zh(text_zh)

The tokens in the Chinese document have the same attributes as those in the English one. Take care, however, because the functionality of the models can differ across languages. Here are three main differences between the English and the Chinese model: 

- The Chinese model does not offer lemmatization: the `text` attribute is identical to the `orth_` attribute.
- The Chinese model has a very different fine-grained part-of-speech tags on the `tag_` attribute.
- The Chinese model has different entity types (PER, LOC and ORG) than the English one. 

This is a result of the training corpora that were used to build the models, whose annotation guidelines may be very different.

In [17]:
list(doc_zh.sents)

[中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。,
 活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。]

In [18]:
tok_text = [tok.text for tok in doc_zh]
tok_orth = [tok.orth_ for tok in doc_zh]
print(tok_text)
print(tok_orth)

['中山', '大學', '人文', '暨', '科技', '跨領域', '學士', '學位', '學程', '助理', '教授', '宋世祥', '表示', '，', '2021年', '聖誕', '節', '假期', '期間', '，', '師生', '舉辦', '「', '街頭', '玩童', '～鹽', '埕兒', '童街', '區遊', '戲日', '」', '成果', '展', '。', '活動', '中', '可', '看見', '學生', '運用', '贊助', '單位', '瑞儀', '教育', '基金會', '致贈', '的', '廢棄木', '棧板', '，', '製作', '了', '6', '具', '兒童', '創意', '遊具', '，', '一方面', '展示', '學習', '成果', '，', '也', '希望', '藉此', '呼籲', '高雄', '民眾', '重視', '兒童', '的', '遊戲權', '。']
['中山', '大學', '人文', '暨', '科技', '跨領域', '學士', '學位', '學程', '助理', '教授', '宋世祥', '表示', '，', '2021年', '聖誕', '節', '假期', '期間', '，', '師生', '舉辦', '「', '街頭', '玩童', '～鹽', '埕兒', '童街', '區遊', '戲日', '」', '成果', '展', '。', '活動', '中', '可', '看見', '學生', '運用', '贊助', '單位', '瑞儀', '教育', '基金會', '致贈', '的', '廢棄木', '棧板', '，', '製作', '了', '6', '具', '兒童', '創意', '遊具', '，', '一方面', '展示', '學習', '成果', '，', '也', '希望', '藉此', '呼籲', '高雄', '民眾', '重視', '兒童', '的', '遊戲權', '。']


In [19]:
info = [(t.text, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_zh]
display(HTML(tabulate.tabulate(info, tablefmt='html')))

0,1,2,3,4
中山,PROPN,NR,B,ORG
大學,NOUN,NN,I,ORG
人文,NOUN,NN,I,ORG
暨,CCONJ,CC,I,ORG
科技,NOUN,NN,I,ORG
跨領域,ADJ,JJ,I,ORG
學士,NOUN,NN,O,
學位,NOUN,NN,O,
學程,NOUN,NN,O,
助理,NOUN,NN,O,


In [20]:
for tok in list(doc_zh.sents)[1]:
    print(f"{tok.text} >>> {tok.pos_}")

活動 >>> NOUN
中 >>> PART
可 >>> VERB
看見 >>> VERB
學生 >>> NOUN
運用 >>> VERB
贊助 >>> NOUN
單位 >>> NOUN
瑞儀 >>> PROPN
教育 >>> NOUN
基金會 >>> NOUN
致贈 >>> NOUN
的 >>> PART
廢棄木 >>> NOUN
棧板 >>> NOUN
， >>> PUNCT
製作 >>> VERB
了 >>> PART
6 >>> NUM
具 >>> NUM
兒童 >>> NOUN
創意 >>> ADJ
遊具 >>> NOUN
， >>> PUNCT
一方面 >>> ADV
展示 >>> VERB
學習 >>> NOUN
成果 >>> NOUN
， >>> PUNCT
也 >>> ADV
希望 >>> VERB
藉此 >>> ADV
呼籲 >>> VERB
高雄 >>> PROPN
民眾 >>> NOUN
重視 >>> VERB
兒童 >>> NOUN
的 >>> PART
遊戲權 >>> PROPN
。 >>> PUNCT


In [21]:
for tok in list(doc_zh.sents)[1]:
    print(f"{tok.text} >>> {tok.tag_} | {spacy.explain(tok.tag_)}")

活動 >>> NN | noun, singular or mass
中 >>> LC | localizer
可 >>> VV | other verb
看見 >>> VV | other verb
學生 >>> NN | noun, singular or mass
運用 >>> VV | other verb
贊助 >>> NN | noun, singular or mass
單位 >>> NN | noun, singular or mass
瑞儀 >>> NR | proper noun
教育 >>> NN | noun, singular or mass
基金會 >>> NN | noun, singular or mass
致贈 >>> NN | noun, singular or mass
的 >>> DEG | associative 的
廢棄木 >>> NN | noun, singular or mass
棧板 >>> NN | noun, singular or mass
， >>> PU | punctuation
製作 >>> VV | other verb
了 >>> AS | aspect marker
6 >>> CD | cardinal number
具 >>> M | measure word
兒童 >>> NN | noun, singular or mass
創意 >>> JJ | adjective (English), other noun-modifier (Chinese)
遊具 >>> NN | noun, singular or mass
， >>> PU | punctuation
一方面 >>> AD | adverb
展示 >>> VV | other verb
學習 >>> NN | noun, singular or mass
成果 >>> NN | noun, singular or mass
， >>> PU | punctuation
也 >>> AD | adverb
希望 >>> VV | other verb
藉此 >>> AD | adverb
呼籲 >>> VV | other verb
高雄 >>> NR | proper noun
民眾 >>> NN | noun, singul

## Visualization

In [22]:
from spacy import displacy

In [23]:
displacy.render(doc_zh, style='ent',jupyter=True, options={'distance':130})

In [24]:
text = "我想要三份2號餐"
doc = zh(text)
displacy.render(doc, style='dep',jupyter=True, options={'distance':130})

## DataFrame + spaCy = dframcy

In [25]:
!pip install dframcy

[0m

In [26]:
from dframcy import DframCy

In [27]:
nlp = spacy.load('zh_core_web_md')
dframcy = DframCy(nlp)
doc = dframcy.nlp(text_zh)
annotation_dataframe = dframcy.to_dataframe(doc)
annotation_dataframe

Unnamed: 0,token_text,token_start,token_end,token_pos_,token_tag_,token_dep_,token_head,token_ent_type_
0,中山,0,2,PROPN,NR,compound:nn,大學,ORG
1,大學,2,4,NOUN,NN,nmod:assmod,科技,ORG
2,人文,4,6,NOUN,NN,conj,科技,ORG
3,暨,6,7,CCONJ,CC,cc,科技,ORG
4,科技,7,9,NOUN,NN,compound:nn,跨領域,ORG
...,...,...,...,...,...,...,...,...
69,重視,128,130,VERB,VV,ccomp,呼籲,
70,兒童,130,132,NOUN,NN,dobj,重視,
71,的,132,133,PART,DEC,mark,重視,
72,遊戲權,133,136,PROPN,NR,dobj,呼籲,


In [28]:
nn_filt = annotation_dataframe['token_tag_'] == 'NN'
dobj_filt = annotation_dataframe['token_dep_'] == 'dobj'
annotation_dataframe[(nn_filt) & dobj_filt]

Unnamed: 0,token_text,token_start,token_end,token_pos_,token_tag_,token_dep_,token_head,token_ent_type_
29,戲日,58,60,NOUN,NN,dobj,舉辦,
32,展,63,64,NOUN,NN,dobj,舉辦,
48,棧板,92,94,NOUN,NN,dobj,運用,
56,遊具,104,106,NOUN,NN,dobj,製作,
61,成果,114,116,NOUN,NN,dobj,展示,
70,兒童,130,132,NOUN,NN,dobj,重視,


## Vectors

In [29]:
doc = zh("教授")
tok = doc[0]
tok.vector

array([ 2.2328  , -1.1713  , -3.3528  , -1.1691  , -0.26724 ,  4.4476  ,
       -0.66089 ,  2.6248  , -1.5367  , -2.8449  , -4.0233  ,  1.5727  ,
        1.978   ,  2.7964  ,  1.003   ,  0.29978 ,  0.056525,  3.7048  ,
        2.0446  ,  2.2452  , -5.7184  ,  0.77814 , -1.8383  , -0.017231,
       -1.91    , -6.4355  , -4.6737  , -0.13519 ,  0.66087 , -1.6718  ,
        3.5934  ,  2.3382  , -4.5406  ,  1.6124  , -2.2361  , -6.0387  ,
       -3.4078  ,  1.1304  ,  0.80933 ,  1.9734  ,  2.3314  , -0.9882  ,
       -1.1947  ,  2.2628  , -1.3687  , -6.4278  ,  0.15906 ,  0.047335,
       -2.8157  , -1.6407  ,  2.4385  , -0.84336 ,  3.081   ,  5.9188  ,
       -1.3019  ,  1.2971  ,  7.2325  ,  2.9722  , -0.45552 ,  1.5148  ,
       -1.1193  ,  3.8739  ,  1.482   , -2.4657  ,  1.4627  , -3.562   ,
       -2.1737  , -1.4306  ,  3.4363  , -1.2796  , -1.4106  ,  2.2146  ,
        2.9325  , -2.5172  ,  2.7192  , -0.84556 , -2.5362  ,  2.2079  ,
       -3.2217  , -2.2081  ,  4.6204  ,  0.98445 , 

In [30]:
tok.vector.shape

(300,)

In [31]:
word_1 = nlp.vocab["高興"]
word_2 = nlp.vocab["高雄"]
word_3 = nlp.vocab["開心"]
word_1_word_2 = word_1.similarity(word_2)
word_1_word_3 = word_1.similarity(word_3)
print(f"Distance btn Word 1 and 2: {word_1_word_2}")
print(f"Distance btn Word 1 and 3: {word_1_word_3}")

Distance btn Word 1 and 2: 0.27085748314857483
Distance btn Word 1 and 3: 0.8141297101974487


- Cosine similarity 

![](https://datascience-enthusiast.com/figures/cosine_sim.png)

- Formula for calculating cosine similarity between two vectors

![](https://miro.medium.com/max/1400/1*LfW66-WsYkFqWc4XYJbEJg.png)

Finally, the Chinese spaCy model **does NOT** contain a morphological parser, for which we'll turn to StandfordNLP.

In [52]:
features = [[t.text, t.morph] for t in doc_zh]
display(HTML(tabulate.tabulate(features, tablefmt='html')))

0,1
中山,
大學,
人文,
暨,
科技,
跨領域,
學士,
學位,
學程,
助理,


## StanfordNLP

Another library whose functionality overlaps with that of spaCy is StanfordNLP. [StanfordNLP](https://stanfordnlp.github.io/stanfordnlp/), not to be confused with Stanford's Java [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library, is a [Python library](https://github.com/stanfordnlp/stanfordnlp) built on top of PyTorch that offers a fully neural pipeline with tokenization (including multi-word units), lemmatization, part-of-speech tagging (including morphological features) and dependency parsing. These components were built and trained for the [CoNLL-2018 shared task](https://nlp.stanford.edu/pubs/qi2018universal.pdf). There are no named entities, but the quality of the dependency parsing is state of the art. On top of that, it also offers a Python interface to CoreNLP. 

**`stanfordnlp` has been renamed to `stanza`.**

In [32]:
!pip install spacy-stanza

[0m

In [33]:
import stanza
import spacy_stanza

In [34]:
stanza.download("zh-hant")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-07 08:51:43 INFO: Downloading default packages for language: zh-hant (Traditional_Chinese)...
2022-04-07 08:51:43 INFO: File exists: /root/stanza_resources/zh-hant/default.zip.
2022-04-07 08:51:48 INFO: Finished downloading models and saved to /root/stanza_resources.


In [35]:
nlp = spacy_stanza.load_pipeline("zh")

2022-04-07 08:51:48 INFO: "zh" is an alias for "zh-hans"
2022-04-07 08:51:48 INFO: Loading these models for language: zh-hans (Simplified_Chinese):
| Processor | Package   |
-------------------------
| tokenize  | gsdsimp   |
| pos       | gsdsimp   |
| lemma     | gsdsimp   |
| depparse  | gsdsimp   |
| sentiment | ren       |
| ner       | ontonotes |

2022-04-07 08:51:48 INFO: Use device: cpu
2022-04-07 08:51:48 INFO: Loading: tokenize
2022-04-07 08:51:48 INFO: Loading: pos
2022-04-07 08:51:50 INFO: Loading: lemma
2022-04-07 08:51:50 INFO: Loading: depparse
2022-04-07 08:51:51 INFO: Loading: sentiment
2022-04-07 08:51:52 INFO: Loading: ner
2022-04-07 08:51:53 INFO: Done loading processors!


In [50]:
text_zh = "中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。"
doc = nlp(text_zh)

Different models often produce different tokenization results, which in turn would have impact on POS, TAG, DEP, and NER.  

- Here're the results based on StandfordNLP.

In [37]:
features = [[t.text, t.pos_, t.tag_, t.dep_, t.ent_type_, t.head.text] for t in doc]
display(HTML(tabulate.tabulate(features, tablefmt='html')))

0,1,2,3,4,5
中山,PROPN,NNP,nmod,ORG,人文
大學,PROPN,NNP,nmod,ORG,人文
人文,NOUN,NN,nmod,ORG,教授
暨,CCONJ,CC,cc,ORG,領域
科技,NOUN,NN,nsubj,ORG,跨
跨,VERB,VV,conj,ORG,人文
領域,NOUN,NN,conj,ORG,人文
學士,NOUN,NN,nmod,ORG,教授
學,NUM,CD,nummod,ORG,位
位,NOUN,NNB,clf,ORG,教授


- Here're the results based on spaCy.

In [53]:
features = [[t.text, t.pos_, t.tag_, t.dep_, t.ent_type_, t.head.text] for t in doc_zh]
display(HTML(tabulate.tabulate(features, tablefmt='html')))

0,1,2,3,4,5
中山,PROPN,NR,compound:nn,ORG,大學
大學,NOUN,NN,nmod:assmod,ORG,科技
人文,NOUN,NN,conj,ORG,科技
暨,CCONJ,CC,cc,ORG,科技
科技,NOUN,NN,compound:nn,ORG,跨領域
跨領域,ADJ,JJ,amod,ORG,教授
學士,NOUN,NN,compound:nn,,教授
學位,NOUN,NN,compound:nn,,學程
學程,NOUN,NN,compound:nn,,教授
助理,NOUN,NN,compound:nn,,教授


Unlike spaCy, StanfordNLP has a morphological parser. Refer to [this page](https://universaldependencies.org/treebanks/zh_gsd/index.html) for details of the supported features.  

In [60]:
text = "他們吃飽了就睡，睡飽了就吃，其他的事情都不想做"
doc = nlp(text)
features = [[t.text, t.morph] for t in doc]
display(HTML(tabulate.tabulate(features, tablefmt='html')))

0,1
他,Person=3
們,
吃飽,
了,Aspect=Perf
就,
睡,
，,
睡飽,
了,Aspect=Perf
就,
