# NLP: Applications

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/phonchi/ModularPython/blob/master/NLP-use-pretrained-models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/phonchi/ModularPython/blob/master/NLP-use-pretrained-models.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

This notebook is adapted by [Haowen Jiang](https://howard-haowen.rohan.tw/) from [this one](https://github.com/nlptown/nlp-notebooks/blob/master/NLP%20with%20pretrained%20models%20-%20spaCy%20and%20StanfordNLP.ipynb) included in the [nlptown
/nlp-notebooks](https://github.com/nlptown/nlp-notebooks) repo. It is meant for the 2022 [NLP Workshop at NSYSU](https://howard-haowen.rohan.tw/NLP-demos/nsysu_workshop).

In [None]:
from datetime import date

today = date.today()
print("Last updated:", today)

Last updated: 2024-06-11


# NLP with pretrained models - spaCy and StanfordNLP

In [None]:
# @title spaCy Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    !pip install -U pip setuptools wheel -qq
    !pip install -U spacy -qq
    !python -m spacy download en_core_web_md # downloads the medium-sized English language model
    !python -m spacy download zh_core_web_md # downloads the medium-sized Chinese language model

![](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [None]:
import spacy
import pandas as pd

In [None]:
spacy.info()

{'spacy_version': '3.7.5',
 'location': '/usr/local/lib/python3.10/dist-packages/spacy',
 'platform': 'Linux-6.1.85+-x86_64-with-glibc2.35',
 'python_version': '3.10.12',
 'pipelines': {'zh_core_web_md': '3.7.0',
  'en_core_web_md': '3.7.1',
  'en_core_web_sm': '3.7.1'}}

- To get you started, play with [this Web App](https://share.streamlit.io/howard-haowen/spacy-streamlit/app.py) that I created, which is powered by spaCy.

## English NLP

In [None]:
en = spacy.load("en_core_web_md") # Loading the spaCy Model which includes vocabulary, syntax models, and entities.
df_metadata = pd.DataFrame([en.meta])
df_metadata.T

Unnamed: 0,0
lang,en
name,core_web_md
version,3.7.1
description,English pipeline optimized for CPU. Components...
author,Explosion
email,contact@explosion.ai
url,https://explosion.ai
license,MIT
spacy_version,">=3.7.2,<3.8.0"
spacy_git_version,bd2c17e20


In [None]:
text = ("Donald John Trump (born June 14, 1946) is the 45th and previous president of "
     "the United States.  Before entering politics, he was a businessman and television personality.")
print(text)

Donald John Trump (born June 14, 1946) is the 45th and previous president of the United States.  Before entering politics, he was a businessman and television personality.


Here, the text about Donald Trump is processed by the spaCy model, creating a document object `doc_en` that contains all the information about the text's structure and content.

In [None]:
doc_en = en(text)
type(doc_en.sents)

generator

First spaCy splits your document into sentences, and the sentences in tokens.

- `doc_en.sents` = generator


In [None]:
list(next(doc_en.sents))

[Donald,
 John,
 Trump,
 (,
 born,
 June,
 14,
 ,,
 1946,
 ),
 is,
 the,
 45th,
 and,
 previous,
 president,
 of,
 the,
 United,
 States,
 .,
  ]

In [None]:
tokens = [token.text for token in doc_en]
print(tokens)

['Donald', 'John', 'Trump', '(', 'born', 'June', '14', ',', '1946', ')', 'is', 'the', '45th', 'and', 'previous', 'president', 'of', 'the', 'United', 'States', '.', ' ', 'Before', 'entering', 'politics', ',', 'he', 'was', 'a', 'businessman', 'and', 'television', 'personality', '.']


In addition, spaCy identifies a variety of linguistic features for each token. Among the foundational features are the lemma and two types of parts-of-speech (POS) tags. The `pos_` attribute encompasses the [Universal POS tags](https://universaldependencies.org/u/pos/) derived from the [Universal Dependencies](https://universaldependencies.org/) framework, which provide a consistent categorization of word types across languages. On the other hand, the `tag_` attribute offers more detailed, language-specific POS tags that capture finer grammatical distinctions.

In [None]:
# orthographic representation, lemma, coarse-grained part-of-speech (pos_), and fine-grained part-of-speech (tag_).
features = [
    {'Text': token.orth_, 'Lemma': token.lemma_, 'POS': token.pos_, 'Detailed POS': token.tag_}
    for token in doc_en
]

df_features = pd.DataFrame(features)
df_features

Unnamed: 0,Text,Lemma,POS,Detailed POS
0,Donald,Donald,PROPN,NNP
1,John,John,PROPN,NNP
2,Trump,Trump,PROPN,NNP
3,(,(,PUNCT,-LRB-
4,born,bear,VERB,VBN
5,June,June,PROPN,NNP
6,14,14,NUM,CD
7,",",",",PUNCT,","
8,1946,1946,NUM,CD
9,),),PUNCT,-RRB-


Next, spaCy includes pre-trained models for named entity recognition (NER). The outcomes of these models are reflected in the `ent_iob_` and `ent_type` attributes. The `ent_type` attribute specifies the category of entity identified by the model, such as a person, date, ordinal number, or geopolitical entity (GPE). For instance, in English models adhering to the [OntoNotes standard](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf), "Donald John Trump" is recognized as a person, "June 14, 1946" as a date, "45th" as an ordinal number, and "the United States" as a GPE.

The `ent_iob_` attribute indicates the token's position within an entity: `O` for outside any entity, `B` for the beginning of an entity, and `I` for inside an entity (but not at the beginning). This notation is part of the `BIO` tagging scheme, which helps differentiate between consecutive entities of the same type.

> Other schemes like `BILUO` include additional designations for the last token of an entity and for unique, standalone entity tokens, providing detailed positional information within entity sequences.

In [None]:
# Extracting named entity information from each token in the document
entities = [
    {'Text': token.orth_, 'IOB Tag': token.ent_iob_, 'Entity Type': token.ent_type_}
    for token in doc_en  # Iterate over each token
]

df_entities = pd.DataFrame(entities)
df_entities

Unnamed: 0,Text,IOB Tag,Entity Type
0,Donald,B,PERSON
1,John,I,PERSON
2,Trump,I,PERSON
3,(,O,
4,born,O,
5,June,B,DATE
6,14,I,DATE
7,",",I,DATE
8,1946,I,DATE
9,),O,


You can also access the entities directly on the `ents` attribute of the document:

In [None]:
print([(ent.text, ent.label_) for ent in doc_en.ents])

[('Donald John Trump', 'PERSON'), ('June 14, 1946', 'DATE'), ('45th', 'ORDINAL'), ('the United States', 'GPE')]


spaCy also contains a dependency parser, which analyzes the grammatical relations between the tokens.

In [None]:
# Extracting syntax or dependency parsing information from each token
syntax = [
    {'Token': token.text, 'Dependency': token.dep_, 'Head': token.head.text}
    for token in doc_en  # Iterate over each token in the document
]

df_syntax = pd.DataFrame(syntax)
df_syntax

Unnamed: 0,Token,Dependency,Head
0,Donald,compound,Trump
1,John,compound,Trump
2,Trump,nsubj,is
3,(,punct,Trump
4,born,acl,Trump
5,June,npadvmod,born
6,14,nummod,June
7,",",punct,June
8,1946,nummod,June
9,),punct,Trump


Finally, the English spaCy model contains a morphological parser.

In [None]:
# Extracting morphological features from each token in the document
features = [
    {'Token': token.text, 'Morphological Features': token.morph}
    for token in doc_en  # Iterate over each token
]

df_features = pd.DataFrame(features)
df_features

Unnamed: 0,Token,Morphological Features
0,Donald,(Number=Sing)
1,John,(Number=Sing)
2,Trump,(Number=Sing)
3,(,"(PunctSide=Ini, PunctType=Brck)"
4,born,"(Aspect=Perf, Tense=Past, VerbForm=Part)"
5,June,(Number=Sing)
6,14,(NumType=Card)
7,",",(PunctType=Comm)
8,1946,(NumType=Card)
9,),"(PunctSide=Fin, PunctType=Brck)"


## Multilingual NLP

SpaCy doesn't only have models for English, but also for many other languages.

In [None]:
zh = spacy.load("zh_core_web_md")
df_metadata = pd.DataFrame([en.meta])
df_metadata.T

Unnamed: 0,0
lang,en
name,core_web_md
version,3.7.1
description,English pipeline optimized for CPU. Components...
author,Explosion
email,contact@explosion.ai
url,https://explosion.ai
license,MIT
spacy_version,">=3.7.2,<3.8.0"
spacy_git_version,bd2c17e20


In [None]:
text_zh = "中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。"
doc_zh = zh(text_zh)

The tokens in the Chinese document share the same attribute structure as those in the English document in spaCy. However, the functionalities of the models can vary significantly between languages. One key difference to note is in the handling of lemmatization:

- **Lack of Lemmatization in Chinese Model**: Unlike the English model, the Chinese model does not provide lemmatization. This means that the `text` attribute, which represents the surface form of the word, is identical to the `orth_` attribute, which represents the exact orthography of the word in the document.

This distinction is important to consider when performing text processing tasks, as it affects the depth of linguistic analysis available for each language.

In [None]:
list(doc_zh.sents)

[中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。,
 活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。]

In [None]:
tok_text = [tok.text for tok in doc_zh]
tok_orth = [tok.orth_ for tok in doc_zh]
print(tok_text)
print(tok_orth)

['中山', '大學', '人文', '暨', '科技', '跨領域', '學士', '學位', '學程', '助理', '教授', '宋世祥', '表示', '，', '2021年', '聖誕', '節', '假期', '期間', '，', '師生', '舉辦', '「', '街頭', '玩童', '～鹽', '埕兒', '童街', '區遊', '戲日', '」', '成果', '展', '。', '活動', '中', '可', '看見', '學生', '運用', '贊助', '單位', '瑞儀', '教育', '基金會', '致贈', '的', '廢棄木', '棧板', '，', '製作', '了', '6', '具', '兒童', '創意', '遊具', '，', '一方面', '展示', '學習', '成果', '，', '也', '希望', '藉此', '呼籲', '高雄', '民眾', '重視', '兒童', '的', '遊戲權', '。']
['中山', '大學', '人文', '暨', '科技', '跨領域', '學士', '學位', '學程', '助理', '教授', '宋世祥', '表示', '，', '2021年', '聖誕', '節', '假期', '期間', '，', '師生', '舉辦', '「', '街頭', '玩童', '～鹽', '埕兒', '童街', '區遊', '戲日', '」', '成果', '展', '。', '活動', '中', '可', '看見', '學生', '運用', '贊助', '單位', '瑞儀', '教育', '基金會', '致贈', '的', '廢棄木', '棧板', '，', '製作', '了', '6', '具', '兒童', '創意', '遊具', '，', '一方面', '展示', '學習', '成果', '，', '也', '希望', '藉此', '呼籲', '高雄', '民眾', '重視', '兒童', '的', '遊戲權', '。']


In [None]:
for tok in list(doc_zh.sents)[1]: # The second sentence
    print(f"{tok.text} >>> {tok.pos_}")

活動 >>> NOUN
中 >>> PART
可 >>> VERB
看見 >>> VERB
學生 >>> NOUN
運用 >>> VERB
贊助 >>> VERB
單位 >>> NOUN
瑞儀 >>> PROPN
教育 >>> NOUN
基金會 >>> NOUN
致贈 >>> VERB
的 >>> PART
廢棄木 >>> NOUN
棧板 >>> NOUN
， >>> PUNCT
製作 >>> VERB
了 >>> PART
6 >>> NUM
具 >>> NUM
兒童 >>> NOUN
創意 >>> ADJ
遊具 >>> NOUN
， >>> PUNCT
一方面 >>> ADV
展示 >>> VERB
學習 >>> ADJ
成果 >>> NOUN
， >>> PUNCT
也 >>> ADV
希望 >>> VERB
藉此 >>> ADV
呼籲 >>> VERB
高雄 >>> PROPN
民眾 >>> NOUN
重視 >>> VERB
兒童 >>> NOUN
的 >>> PART
遊戲權 >>> NOUN
。 >>> PUNCT


- The Chinese model has a very different fine-grained part-of-speech tags on the `tag_` attribute.

In [None]:
# Printing each token's text, detailed POS tag, and an explanation of the tag.
for tok in list(doc_zh.sents)[1]:
    print(f"{tok.text} >>> {tok.tag_} | {spacy.explain(tok.tag_)}")

活動 >>> NN | noun, singular or mass
中 >>> LC | localizer
可 >>> VV | other verb
看見 >>> VV | other verb
學生 >>> NN | noun, singular or mass
運用 >>> VV | other verb
贊助 >>> VV | other verb
單位 >>> NN | noun, singular or mass
瑞儀 >>> NR | proper noun
教育 >>> NN | noun, singular or mass
基金會 >>> NN | noun, singular or mass
致贈 >>> VV | other verb
的 >>> DEC | 的 in a relative clause
廢棄木 >>> NN | noun, singular or mass
棧板 >>> NN | noun, singular or mass
， >>> PU | punctuation
製作 >>> VV | other verb
了 >>> AS | aspect marker
6 >>> CD | cardinal number
具 >>> M | measure word
兒童 >>> NN | noun, singular or mass
創意 >>> JJ | adjective (English), other noun-modifier (Chinese)
遊具 >>> NN | noun, singular or mass
， >>> PU | punctuation
一方面 >>> AD | adverb
展示 >>> VV | other verb
學習 >>> JJ | adjective (English), other noun-modifier (Chinese)
成果 >>> NN | noun, singular or mass
， >>> PU | punctuation
也 >>> AD | adverb
希望 >>> VV | other verb
藉此 >>> AD | adverb
呼籲 >>> VV | other verb
高雄 >>> NR | proper noun
民眾 >>> NN |

- The Chinese model has different entity types (PER, LOC and ORG) than the English one.

This is a result of the training corpora that were used to build the models, whose annotation guidelines may be very different.

In [None]:
info = [(t.text, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_zh]
display(HTML(tabulate.tabulate(info, tablefmt='html')))

0,1,2,3,4
中山,PROPN,NR,B,ORG
大學,NOUN,NN,I,ORG
人文,NOUN,NN,I,ORG
暨,CCONJ,CC,I,ORG
科技,NOUN,NN,I,ORG
跨領域,ADJ,JJ,I,ORG
學士,NOUN,NN,O,
學位,NOUN,NN,O,
學程,NOUN,NN,O,
助理,NOUN,NN,O,


## Visualization

In [None]:
from spacy import displacy

In [None]:
displacy.render(doc_zh, style='ent',jupyter=True, options={'distance':130})

In [None]:
text = "我想要三份2號餐"
doc = zh(text)
displacy.render(doc, style='dep',jupyter=True, options={'distance':130})

## DataFrame + spaCy = dframcy

In [None]:
# @title dframcy Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    !pip install dframcy -qq

[0m

In [None]:
from dframcy import DframCy

In [None]:
nlp = spacy.load('zh_core_web_md')
# Initialize DframCy with the spaCy NLP model to integrate with pandas DataFrame.
dframcy = DframCy(nlp)
# Process the Chinese text using the NLP model to create a spaCy document.
doc = dframcy.nlp(text_zh)
# Convert the NLP document annotations to a pandas DataFrame for easier analysis.
annotation_dataframe = dframcy.to_dataframe(doc)
annotation_dataframe

Unnamed: 0,token_text,token_start,token_end,token_pos_,token_tag_,token_dep_,token_head,token_ent_type_
0,中山,0,2,PROPN,NR,compound:nn,大學,ORG
1,大學,2,4,NOUN,NN,nmod:assmod,跨領域,ORG
2,人文,4,6,NOUN,NN,conj,跨領域,ORG
3,暨,6,7,CCONJ,CC,cc,跨領域,ORG
4,科技,7,9,NOUN,NN,compound:nn,跨領域,ORG
...,...,...,...,...,...,...,...,...
69,重視,128,130,VERB,VV,ccomp,呼籲,
70,兒童,130,132,NOUN,NN,nmod:assmod,遊戲權,
71,的,132,133,PART,DEG,case,兒童,
72,遊戲權,133,136,NOUN,NN,dobj,重視,


Once annotations are stored as a DataFrame object, filtering can be easily done by leveraging the power of `pandas` syntax.

In [None]:
# Create a filter for rows where the part-of-speech tag is 'NN' (noun).
nn_filt = annotation_dataframe['token_tag_'] == 'NN'
# Create a filter for rows where the dependency label is 'dobj' (direct object).
dobj_filt = annotation_dataframe['token_dep_'] == 'dobj'
# Get rows where the token is a noun and serves as a direct object.
annotation_dataframe[(nn_filt) & dobj_filt]

Unnamed: 0,token_text,token_start,token_end,token_pos_,token_tag_,token_dep_,token_head,token_ent_type_
32,展,63,64,NOUN,NN,dobj,舉辦,
48,棧板,92,94,NOUN,NN,dobj,看見,
56,遊具,104,106,NOUN,NN,dobj,製作,
61,成果,114,116,NOUN,NN,dobj,展示,
72,遊戲權,133,136,NOUN,NN,dobj,重視,


## Vectors

In [None]:
doc = zh("教授")
tok = doc[0]
tok.vector

array([ 2.2328  , -1.1713  , -3.3528  , -1.1691  , -0.26724 ,  4.4476  ,
       -0.66089 ,  2.6248  , -1.5367  , -2.8449  , -4.0233  ,  1.5727  ,
        1.978   ,  2.7964  ,  1.003   ,  0.29978 ,  0.056525,  3.7048  ,
        2.0446  ,  2.2452  , -5.7184  ,  0.77814 , -1.8383  , -0.017231,
       -1.91    , -6.4355  , -4.6737  , -0.13519 ,  0.66087 , -1.6718  ,
        3.5934  ,  2.3382  , -4.5406  ,  1.6124  , -2.2361  , -6.0387  ,
       -3.4078  ,  1.1304  ,  0.80933 ,  1.9734  ,  2.3314  , -0.9882  ,
       -1.1947  ,  2.2628  , -1.3687  , -6.4278  ,  0.15906 ,  0.047335,
       -2.8157  , -1.6407  ,  2.4385  , -0.84336 ,  3.081   ,  5.9188  ,
       -1.3019  ,  1.2971  ,  7.2325  ,  2.9722  , -0.45552 ,  1.5148  ,
       -1.1193  ,  3.8739  ,  1.482   , -2.4657  ,  1.4627  , -3.562   ,
       -2.1737  , -1.4306  ,  3.4363  , -1.2796  , -1.4106  ,  2.2146  ,
        2.9325  , -2.5172  ,  2.7192  , -0.84556 , -2.5362  ,  2.2079  ,
       -3.2217  , -2.2081  ,  4.6204  ,  0.98445 , 

In [None]:
tok.vector.shape

(300,)

In [None]:
word_1 = nlp.vocab["高興"]
word_2 = nlp.vocab["高雄"]
word_3 = nlp.vocab["開心"]
word_1_word_2 = word_1.similarity(word_2)
word_1_word_3 = word_1.similarity(word_3)
print(f"Distance btn Word 1 and 2: {word_1_word_2}")
print(f"Distance btn Word 1 and 3: {word_1_word_3}")

Distance btn Word 1 and 2: 0.27085748314857483
Distance btn Word 1 and 3: 0.8141297101974487


- Cosine similarity

![](https://datascience-enthusiast.com/figures/cosine_sim.png)

- Formula for calculating cosine similarity between two vectors

![](https://miro.medium.com/max/1400/1*LfW66-WsYkFqWc4XYJbEJg.png)

## StanfordNLP

Another library that shares some functionality with spaCy is StanfordNLP. [StanfordNLP](https://stanfordnlp.github.io/stanfordnlp/), distinct from Stanford’s Java-based [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library, is a [Python library](https://github.com/stanfordnlp/stanfordnlp) developed on the PyTorch framework. It provides a fully neural NLP pipeline, which includes advanced features such as tokenization (capable of recognizing multi-word units), lemmatization, part-of-speech tagging (incorporating morphological features), and state-of-the-art dependency parsing. These components were specifically designed and trained for the [CoNLL-2018 shared task](https://nlp.stanford.edu/pubs/qi2018universal.pdf). While it does not include named entity recognition, StanfordNLP excels in dependency parsing and additionally offers a Python interface to CoreNLP, facilitating integration into Python projects.

This version provides a clearer distinction between the two Stanford libraries and emphasizes the specific capabilities and strengths of StanfordNLP, enhancing the reader's understanding of its purpose and utility.

> **`stanfordnlp` has been renamed to `stanza`.**

In [None]:
# @title stanza Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    !pip install stanza -qq

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.1/990.1 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m85.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m72.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m597.6 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import stanza

In [None]:
stanza.download("zh-hant") # Download the traditional Chinese model for Stanza.

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: zh-hant (Traditional_Chinese) ...


Downloading https://huggingface.co/stanfordnlp/stanza-zh-hant/resolve/v1.8.0/models/default.zip:   0%|        …

INFO:stanza:Downloaded file to /root/stanza_resources/zh-hant/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


In [None]:
stf_nlp = stanza.Pipeline('zh-hant') # Initialize the Stanza pipeline for traditional Chinese to handle various NLP tasks.

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: zh-hant (Traditional_Chinese):
| Processor | Package      |
----------------------------
| tokenize  | gsd          |
| pos       | gsd_nocharlm |
| lemma     | gsd_nocharlm |
| depparse  | gsd_nocharlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


In [None]:
text_zh = "中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。"
# Process the text with the Stanza pipeline to extract linguistic information.
doc = stf_nlp(text_zh)
type(doc)

Different models often produce different tokenization results, which in turn would have impact on POS and DEP tagging.

- Here're the results based on StandfordNLP.

In [None]:
words_data = []
for i, sent in enumerate(doc.sentences):
    for word in sent.words:
        # Prepare and append a dictionary with details about each word to the list.
        words_data.append({
            'Sentence Number': i + 1,
            'Text': word.text,
            'Lemma': word.lemma,
            'POS': word.pos,
            'Head Index': word.head,
            'Dependency Relation': word.deprel
        })

df_words = pd.DataFrame(words_data)
df_words

Unnamed: 0,Sentence Number,Text,Lemma,POS,Head Index,Dependency Relation
0,1,中山,中山,PROPN,12,nmod
1,1,大學,大學,NOUN,12,nmod
2,1,人文,人文,NOUN,12,nmod
3,1,暨,暨,CCONJ,5,cc
4,1,科技,科技,NOUN,3,conj
...,...,...,...,...,...,...
78,2,兒童,兒童,NOUN,39,obj
79,2,的,的,PART,39,mark:rel
80,2,遊戲,遊戲,NOUN,43,compound
81,2,權,權,PART,36,obj


# Assignment


## Analyze Enlgish

- Input: any English news article of your choice
- Ouput:
    - A list of unique lemmas of all verbs in lower case
    - A list of unique tuples of (NER text, NER label)




In [None]:
# Change this to any other article of your choice.

en_input = """
Taipei, April 7 (CNA) Health and Welfare Minister Chen Shih-chung (陳時中) said Thursday that COVID-19 contact tracing has been partially suspended in Taiwan and a new disease control model is being put in place, amid a rise in domestic cases.

The immediate suspension of contract tracing applies only to travelers who test positive for COVID-19 in Taiwan, either on arrival at the airport or during mandatory quarantine, Chen said.

That decision was made in a bid to free up resources to monitor the growing number of domestic COVID-19 cases, he said at a press briefing, after he reported 531 new cases -- 382 domestically transmitted and 149 imported.

Chen said contact tracing on new imported cases will only be done if any of them are believed to be linked to COVID-19 clusters at quarantine hotels or quarantine centers in Taiwan.

Prior to Thursday, Taiwan had been reporting its contact tracing information on imported COVID-19 cases via the World Health Organization's International Health Regulations (IHR) mechanism, he said.

Regarding the recent daily rise in domestic infections, Chen said the current goal is to bring the situation under control, even though it is impossible to achieve zero new domestic cases at this time.

Despite the recent spike, the daily number of domestic COVID-19 cases in Taiwan is still low compared to many other countries, he said, citing as an example the 534 new cases per 100,000 population reported in South Korea on Tuesday.

Once people in Taiwan stick together and do their part to prevent the spread of the virus, the situation will be manageable, Chen said.

Based on the trajectory of COVID-19 Omicron outbreaks observed in many other countries around the world, he said, it is likely that the infections in Taiwan will peak in a month or two.

"We do not expect the outbreak to stop growing now, but we hope it will rise slowly, so that Taiwan's medical capacity will not be overloaded," Chen said.

Meanwhile, earlier in the day, the Cabinet announced that Taiwan was adopting a new model for the control of COVID-19 infections.

Under the "new Taiwan model," the country has let go of its goal to achieve zero COVID-19 cases, but this does not mean allowing the pandemic go unmanaged, Cabinet spokesman Lo Ping-cheng (羅秉成) said, citing Premier Su Tseng-chang (蘇貞昌).

In a meeting earlier with Ministry of Health and Welfare (MOHW) officials, Premier Su said that as Taiwan moves towards a new stage of epidemic prevention, he hopes that the central and local governments will work together to gradually open up the country, in the interests of its people and economy, according to Lo.

In a report presented to the Cabinet on Thursday, the MOHW said Taiwan will continue to actively manage the COVID-19 situation, while steadily opening up its borders, in consideration of national economic factors and the people's livelihood.
"""

In [None]:
# Start by turning a text into a spaCy Doc object
en_doc = en(en_input)

In [None]:
#===Write your code below and save the output as `verbs`.===#

verbs = set(token.lemma_.lower() for token in en_doc if token.pos_ == "VERB")
verbs
# verbs =

{'accord',
 'achieve',
 'adopt',
 'allow',
 'announce',
 'apply',
 'base',
 'believe',
 'bring',
 'cite',
 'compare',
 'continue',
 'do',
 'expect',
 'free',
 'go',
 'grow',
 'hope',
 'import',
 'let',
 'link',
 'make',
 'manage',
 'mean',
 'monitor',
 'move',
 'observe',
 'open',
 'overload',
 'peak',
 'present',
 'prevent',
 'put',
 'regard',
 'report',
 'rise',
 'say',
 'stick',
 'stop',
 'suspend',
 'test',
 'trace',
 'transmit',
 'work'}

In [None]:
#===Write your code below and save the output as `en_ents`.===#
en_ents = set((ent.text, ent.label_) for ent in en_doc.ents)
en_ents
# en_ents =

[('Taipei', 'GPE'),
 ('April 7', 'DATE'),
 ('CNA) Health and', 'ORG'),
 ('Welfare', 'ORG'),
 ('Chen Shih-chung', 'PERSON'),
 ('Thursday', 'DATE'),
 ('COVID-19', 'PERSON'),
 ('Taiwan', 'GPE'),
 ('Chen', 'PERSON'),
 ('COVID-19', 'PRODUCT'),
 ('531', 'CARDINAL'),
 ('382', 'CARDINAL'),
 ('149', 'CARDINAL'),
 ("the World Health Organization's", 'ORG'),
 ('daily', 'DATE'),
 ('zero', 'CARDINAL'),
 ('534', 'CARDINAL'),
 ('100,000', 'CARDINAL'),
 ('South Korea', 'GPE'),
 ('Tuesday', 'DATE'),
 ('COVID-19 Omicron', 'PERSON'),
 ('a month', 'DATE'),
 ('two', 'CARDINAL'),
 ('earlier in the day', 'DATE'),
 ('Cabinet', 'ORG'),
 ('Lo Ping-cheng', 'PERSON'),
 ('Su Tseng-chang', 'PERSON'),
 ('Ministry of Health and Welfare', 'ORG'),
 ('Su', 'PERSON'),
 ('Lo', 'ORG'),
 ('MOHW', 'ORG')]

## Analyze Chinese

- Input 1: any Chinese news article from Taiwan media of your choice
- Ouput 1:
    - A list of unique tokens except for punctuations
    - A list of unique tuples of (NER text, NER label)

In [None]:
# Change this to any other article of your choice.

zh_input = """
本土個案昨增三八二例再創新高，確診案例遍及十九縣市，境外增一四九例，單日破五百案例。

中央疫情指揮中心指揮官陳時中表示，個案數還會往上升，預估一至兩個月內達最高峰，疫情將持續到六月底且還不會到尾聲；他也首度鬆口「清零不可能」，未來將走向與病毒共存，下周擬試辦「輕症在家」隔離，同時也將調整停課標準。

擬以居家快篩取代停課
國內確診個案上升，全國累積十四縣市共一三九所校園停課，各縣市對畢業旅行、戶外教學是否取消標準不一，教師團體認為政府應該明確表態，否則會造成校園恐慌或影響學生受教權。陳時中表示，未來必然走向與病毒共存，下周將與教育部檢討停課標準，縮小匡列範圍，並在合理範圍以居家快篩來取代停課。

疫情一至兩個月達高峰
指揮中心比照韓國、紐西蘭及香港疫情發展，推估未來一至兩個月確診案例將飆至最高峰，陳時中表示，目前Omicron確診數仍算低，但規模難以預估，未來單日恐超過一千五百例，屆時致死率、個案數飆高或疫情高峰下不來，社會將承擔不起，因此仍須積極因應，朝「緩坡上升」方向努力。

輕症在家指引至今沒譜
總統蔡英文日前宣布防疫以「減災」為目標避免醫療量能超載，「輕症在家」隔離為其中配套，單日確診數若達一千五百人將啟動。指揮中心下周將擬定「輕症在家照護指引」，如設醫療遠距平台、送藥、戶政及警政系統聯繫、關懷中心運作等，若未遵守隔離規定將有罰則，並因應地方疫情升溫和醫療量能吃緊，將從新北試辦。

立委賴惠員、蔣萬安昨於衛環委員會質詢時，質疑各地方早已喊話希望指揮中心相關指引快出來，但至今連社區溝通、徵求試辦的地方政府意願統統沒有譜。

台北市長柯文哲昨晚也在臉書表示「北市防疫旅館量能告急」，因每天確診人數不斷攀升，近期有近萬名移工入境，幾乎把北市的防疫旅館量能占滿。北市已發出徵召令，徵用加強版防疫專責旅館，讓輕症、低危險確診者入住。

總統府、監察院傳確診者
本土疫情多點爆發，公務機關包括總統府、監察院、台北市議會都傳出有確診者，北中南八大行業確診人數也驟增，未來場所是否新增禁令，指揮中心將再討論。

境外移入昨增一四九例個案，有七十八例為航班落地採檢陽性，越南有十七例居冠。指揮中心表示，目前落地採檢陽性率約落在四％到五％間，即日起越南航空、越捷航空及越竹航空等三航空公司班機，增加「搭機前六小時內抗原快篩報告」才可入境。
"""

In [None]:
# Start by turning a text into a spaCy Doc object
zh_doc = zh(zh_input)

In [None]:
#===Write your code below and save the output as `zh_toks`.===#
zh_toks = set(token.text.lower() for token in zh_doc if not token.is_punct)
zh_toks
# zh_toks =

['輕症',
 '紐西蘭',
 '發出',
 '罰則',
 '校園',
 '一',
 '啟動',
 '再',
 '也',
 '教師',
 '與',
 '新高',
 '照護',
 '單日',
 '越南',
 '表態',
 '其中',
 '尾聲',
 '質疑',
 '規定',
 '擬定',
 '最高峰',
 '療量',
 '政府',
 '應該',
 '\n\n',
 '造成',
 '高峰',
 '傳出',
 '吃緊',
 '飆高',
 '徵求',
 '可',
 '所',
 '爆發',
 '六月',
 '未',
 '把',
 '表示',
 '下',
 '連社',
 '攀升',
 '遠距',
 '避免',
 '昨增',
 '社會',
 '間',
 '一三九',
 '居冠',
 '旅行',
 '可能',
 '於衛',
 '小',
 '個案',
 '越竹',
 '但',
 '累積',
 '防疫',
 '早已',
 '機關',
 '增加',
 '加強版',
 '喊話',
 '抗原',
 '十七',
 '例遍',
 '疫情',
 '統沒',
 '是否',
 '因',
 '昨晚',
 '地方',
 '報告',
 '為目',
 '確診',
 '搭機',
 '緩坡',
 '檢討',
 '推估',
 '走向',
 '譜',
 '送藥',
 '診數',
 '匡列',
 '例為',
 '屆時',
 '指揮官',
 '共存',
 '學生',
 '否則',
 '及',
 '陳時',
 '詢時',
 '仍',
 '到',
 '擬以',
 '篩來',
 '認為',
 '還',
 '平台',
 '將與',
 '區溝通',
 '沒譜',
 '至',
 '合理',
 '點',
 '如',
 '數還',
 '臉書',
 '移工',
 '都',
 '萬名',
 '遵守',
 '必然',
 '居家',
 '一四九例',
 '會到',
 '三',
 '來將',
 '快篩',
 '並',
 '出來',
 '隔離',
 '越捷',
 '賴惠員',
 '意願',
 '新增',
 '即',
 '恐慌',
 '前',
 '柯文哲',
 '運作',
 '度',
 '為',
 '低',
 '試辦',
 '將調',
 '航空',
 '關懷',
 '周擬',
 '台北市',
 '他',
 '超過',
 '日起',
 '香港',
 '

In [None]:
#===Write your code below and save the output as `zh_ents`.===#
zh_ents = set((ent.text, ent.label_) for ent in zh_doc.ents)
zh_ents
# zh_ents =

{('一三九', 'DATE'),
 ('一千五百', 'CARDINAL'),
 ('一四九例', 'CARDINAL'),
 ('一至兩個月', 'DATE'),
 ('七十八', 'CARDINAL'),
 ('三', 'CARDINAL'),
 ('三八二', 'CARDINAL'),
 ('下周', 'DATE'),
 ('中央疫情指揮中心', 'ORG'),
 ('五百', 'CARDINAL'),
 ('五％', 'CARDINAL'),
 ('六月', 'DATE'),
 ('北市', 'GPE'),
 ('十七', 'CARDINAL'),
 ('十四縣', 'CARDINAL'),
 ('台北市', 'GPE'),
 ('周將', 'DATE'),
 ('周擬', 'PERSON'),
 ('四％', 'FAC'),
 ('將調', 'PERSON'),
 ('將飆', 'PERSON'),
 ('底且', 'PERSON'),
 ('徵用加強版防疫專責', 'ORG'),
 ('教育部', 'ORG'),
 ('新北', 'GPE'),
 ('日前', 'DATE'),
 ('昨晚', 'TIME'),
 ('監察院', 'ORG'),
 ('紐西蘭', 'GPE'),
 ('蔡英文', 'PERSON'),
 ('規模難', 'ORG'),
 ('警政系統聯繫', 'ORG'),
 ('越南', 'GPE'),
 ('醫療量', 'ORG'),
 ('陽性率', 'PERSON'),
 ('韓國', 'GPE'),
 ('預估一至兩個月', 'DATE'),
 ('香港', 'GPE'),
 ('驟增', 'PERSON')}

- Input 2: Simplified version of Input 1 (Use `opencc` to do the conversion.)
- Ouput 2:
    - A list of unique tokens except for punctuations
    - A list of unique tuples of (NER text, NER label)

In [None]:
# @title opencc Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    !pip install opencc -qq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/779.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.0/779.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m655.4/779.8 kB[0m [31m11.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m779.8/779.8 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[0m

In [None]:
import opencc

In [None]:
converter = opencc.OpenCC('t2s.json')
sim_zh_input = converter.convert(zh_input)
sim_zh_input

'\n本土个案昨增三八二例再创新高，确诊案例遍及十九县市，境外增一四九例，单日破五百案例。\n\n中央疫情指挥中心指挥官陈时中表示，个案数还会往上升，预估一至两个月内达最高峰，疫情将持续到六月底且还不会到尾声；他也首度松口「清零不可能」，未来将走向与病毒共存，下周拟试办「轻症在家」隔离，同时也将调整停课标准。\n\n拟以居家快筛取代停课\n国内确诊个案上升，全国累积十四县市共一三九所校园停课，各县市对毕业旅行、户外教学是否取消标准不一，教师团体认为政府应该明确表态，否则会造成校园恐慌或影响学生受教权。陈时中表示，未来必然走向与病毒共存，下周将与教育部检讨停课标准，缩小匡列范围，并在合理范围以居家快筛来取代停课。\n\n疫情一至两个月达高峰\n指挥中心比照韩国、纽西兰及香港疫情发展，推估未来一至两个月确诊案例将飙至最高峰，陈时中表示，目前Omicron确诊数仍算低，但规模难以预估，未来单日恐超过一千五百例，届时致死率、个案数飙高或疫情高峰下不来，社会将承担不起，因此仍须积极因应，朝「缓坡上升」方向努力。\n\n轻症在家指引至今没谱\n总统蔡英文日前宣布防疫以「减灾」为目标避免医疗量能超载，「轻症在家」隔离为其中配套，单日确诊数若达一千五百人将启动。指挥中心下周将拟定「轻症在家照护指引」，如设医疗远距平台、送药、户政及警政系统联系、关怀中心运作等，若未遵守隔离规定将有罚则，并因应地方疫情升温和医疗量能吃紧，将从新北试办。\n\n立委赖惠员、蒋万安昨于卫环委员会质询时，质疑各地方早已喊话希望指挥中心相关指引快出来，但至今连社区沟通、征求试办的地方政府意愿统统没有谱。\n\n台北市长柯文哲昨晚也在脸书表示「北市防疫旅馆量能告急」，因每天确诊人数不断攀升，近期有近万名移工入境，几乎把北市的防疫旅馆量能占满。北市已发出征召令，征用加强版防疫专责旅馆，让轻症、低危险确诊者入住。\n\n总统府、监察院传确诊者\n本土疫情多点爆发，公务机关包括总统府、监察院、台北市议会都传出有确诊者，北中南八大行业确诊人数也骤增，未来场所是否新增禁令，指挥中心将再讨论。\n\n境外移入昨增一四九例个案，有七十八例为航班落地采检阳性，越南有十七例居冠。指挥中心表示，目前落地采检阳性率约落在四％到五％间，即日起越南航空、越捷航空及越竹航空等三航空公司班机，增加「搭机前六小时内抗原快筛报告」才可入境。\n

In [None]:
# Start by turning a text into a spaCy Doc object
sim_zh_doc = zh(sim_zh_input)

In [None]:
#===Write your code below and save the output as `sim_zh_toks`.===#
sim_zh_toks = set(token.text.lower() for token in sim_zh_doc if not token.is_punct)
sim_zh_toks
# sim_zh_toks =

{'\n',
 '\n\n',
 'omicron确',
 '一',
 '一三九',
 '一千五百',
 '一千五百例',
 '一四九',
 '一四九例',
 '七十八',
 '万',
 '三',
 '三八二',
 '上升',
 '下',
 '下周',
 '不',
 '不一',
 '不断',
 '与',
 '专责',
 '两',
 '个',
 '个案',
 '中',
 '中南',
 '中央',
 '中心',
 '为',
 '也',
 '五百',
 '五％',
 '人',
 '人数',
 '仍',
 '从',
 '他',
 '以',
 '会',
 '传出',
 '传确',
 '但',
 '低',
 '例',
 '例为',
 '停课',
 '入住',
 '入境',
 '全国',
 '八大',
 '公务',
 '公司',
 '六',
 '六月',
 '共',
 '共存',
 '关怀',
 '其中',
 '内',
 '内达',
 '再',
 '减灾',
 '几乎',
 '出来',
 '创新',
 '到',
 '前',
 '加强版',
 '努力',
 '包括',
 '北',
 '北市',
 '匡列',
 '医疗',
 '医疗量',
 '十七',
 '十九',
 '十四',
 '升温',
 '单',
 '单日',
 '占满',
 '卫环',
 '危险',
 '即',
 '县',
 '及',
 '发出',
 '发展',
 '取代',
 '取消',
 '受教权',
 '可',
 '可能',
 '台北',
 '台北市',
 '吃紧',
 '各',
 '各县',
 '合理',
 '同时',
 '名',
 '否则',
 '启动',
 '告急',
 '周拟',
 '和',
 '喊话',
 '四％',
 '因',
 '因应',
 '因此',
 '团体',
 '国内',
 '在',
 '在家',
 '地方',
 '地方政府',
 '场所',
 '境外',
 '增',
 '增加',
 '多',
 '如',
 '委员会',
 '学生',
 '宣布',
 '对',
 '将',
 '小时',
 '尾声',
 '居冠',
 '居家',
 '届时',
 '已',
 '市',
 '市长',
 '希望',
 '平台',
 '并',
 '应该',
 '底且',
 '度',
 '影响',
 '往',
 '征召令'

Evaluate whether `zh_toks` is equal to `sim_zh_toks`.

In [None]:
zh_toks == sim_zh_toks

False

In [None]:
#===Write your code below and save the output as `sim_zh_ents`.===#

sim_zh_ents = set((ent.text, ent.label_) for ent in sim_zh_doc.ents)
sim_zh_ents
# sim_zh_ents =

{('一三九', 'DATE'),
 ('一千五百人', 'CARDINAL'),
 ('一四九例', 'CARDINAL'),
 ('一至两个月', 'DATE'),
 ('七十八', 'CARDINAL'),
 ('三', 'CARDINAL'),
 ('三八二', 'CARDINAL'),
 ('下周', 'DATE'),
 ('中南', 'GPE'),
 ('中央疫情指挥中心', 'ORG'),
 ('五百', 'CARDINAL'),
 ('五％', 'PERCENT'),
 ('六小时', 'TIME'),
 ('六月', 'DATE'),
 ('内达', 'GPE'),
 ('北市', 'GPE'),
 ('十七', 'CARDINAL'),
 ('十九', 'CARDINAL'),
 ('十四县', 'CARDINAL'),
 ('台北', 'GPE'),
 ('台北市议会', 'ORG'),
 ('周拟', 'PERSON'),
 ('四％', 'FAC'),
 ('底且', 'PERSON'),
 ('总统府', 'ORG'),
 ('教育部', 'ORG'),
 ('新北', 'GPE'),
 ('日前', 'DATE'),
 ('日确诊数', 'ORG'),
 ('昨晚', 'TIME'),
 ('未来一至两个月', 'DATE'),
 ('柯文哲', 'PERSON'),
 ('监察院', 'ORG'),
 ('确诊者', 'FAC'),
 ('约落', 'PERSON'),
 ('纽西兰', 'GPE'),
 ('蒋万安昨于卫环委员会', 'ORG'),
 ('蔡英文', 'PERSON'),
 ('赖惠员', 'PERSON'),
 ('越南', 'GPE'),
 ('近万', 'CARDINAL'),
 ('陈时', 'PERSON'),
 ('陈时中', 'PERSON'),
 ('韩国', 'GPE'),
 ('香港', 'GPE')}

Evaluate whether `zh_ents` is equal to `sim_zh_ents`.

In [None]:
zh_ents == sim_zh_ents

False

## Reference

1. https://ckip.iis.sinica.edu.tw/
2. https://github.com/APCLab/jieba-tw
2. https://corenlp.run/
3. https://github.com/Embedding/Chinese-Word-Vectors
4. https://github.com/stanfordnlp/GloVe
5. https://radimrehurek.com/gensim/
7. https://github.com/sloria/textblob

