<a href="https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W01-use-pretrained-models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is adapted by [Haowen Jiang](https://howard-haowen.rohan.tw/) from [this one](https://github.com/nlptown/nlp-notebooks/blob/master/NLP%20with%20pretrained%20models%20-%20spaCy%20and%20StanfordNLP.ipynb) included in the [nlptown
/nlp-notebooks](https://github.com/nlptown/nlp-notebooks) repo. It is meant for the 2022 [NLP Workshop at NSYSU](https://howard-haowen.rohan.tw/NLP-demos/nsysu_workshop).

In [None]:
from datetime import date

today = date.today()
print("Last updated:", today)

Last updated: 2022-04-13


# NLP with pretrained models - spaCy and StanfordNLP

In [None]:
!pip install -U pip setuptools wheel
!pip install -U spacy
!python -m spacy download en_core_web_md
!python -m spacy download zh_core_web_md

![](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [None]:
import spacy
from IPython.display import HTML, display
import tabulate

In [None]:
spacy.info()

{'location': '/usr/local/lib/python3.7/dist-packages/spacy',
 'pipelines': {},
 'platform': 'Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic',
 'python_version': '3.7.13',
 'spacy_version': '3.2.4'}

- To get you started, play with [this Web App](https://share.streamlit.io/howard-haowen/spacy-streamlit/app.py) that I created, which is powered by spaCy.

## English NLP

In [None]:
en = spacy.load("en_core_web_md")

In [None]:
text = ("Donald John Trump (born June 14, 1946) is the 45th and current president of "
        "the United States.  Before entering politics, he was a businessman and television personality.")
print(text)

Donald John Trump (born June 14, 1946) is the 45th and current president of the United States.  Before entering politics, he was a businessman and television personality.


In [None]:
doc_en = en(text)

First spaCy splits your document into sentences, and the sentences in tokens.

- doc.sents = generator 
- type(doc.sents)


In [None]:
list(doc_en.sents)

[Donald John Trump (born June 14, 1946) is the 45th and current president of the United States.,
  Before entering politics, he was a businessman and television personality.]

In [None]:
tokens = [token.text for token in doc_en]
print(tokens)

['Donald', 'John', 'Trump', '(', 'born', 'June', '14', ',', '1946', ')', 'is', 'the', '45th', 'and', 'current', 'president', 'of', 'the', 'United', 'States', '.', ' ', 'Before', 'entering', 'politics', ',', 'he', 'was', 'a', 'businessman', 'and', 'television', 'personality', '.']


In addition, spaCy also identifies a number of linguistic features for every token. The most basic of these are the lemma, and two types of parts-of-speech tags: the `pos_` attribute contains the [Universal POS tags](https://universaldependencies.org/u/pos/) from the [Universal Dependencies](https://universaldependencies.org/), while the `tag_` attribute contains more fine-grained, language-specific part-of-speech tags.

In [None]:
features = [[t.orth_, t.lemma_, t.pos_, t.tag_] for t in doc_en]
display(HTML(tabulate.tabulate(features, tablefmt='html')))

0,1,2,3
Donald,Donald,PROPN,NNP
John,John,PROPN,NNP
Trump,Trump,PROPN,NNP
(,(,PUNCT,-LRB-
born,bear,VERB,VBN
June,June,PROPN,NNP
14,14,NUM,CD
",",",",PUNCT,","
1946,1946,NUM,CD
),),PUNCT,-RRB-


Next, spaCy also offers pre-trained models for named entity recognition. Their results can be found on the `ent_iob_` and `ent_type` attributes. The `ent_type` attribute tells us what type of entity the token refers to. In the English models, these entity types follow the [OntoNotes standard](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf). In our example, we see that `Donald John Trump` refers to a person, `June 14, 1946` to a date, `45th` to an ordinal number, and `the United States` to a geo-political entity (GPE). 

The letters on the `ent_iob_` attribute give the position of the token in the entity. `O` means the token is outside of an entity, `B` means the token is at the beginning of an entity, and `I` means it is inside an entity (at any position except for the beginning). In this way, we can tell apart several entities of the same type that immediately follow each other. Together these letters form the so-called `BIO` tagging scheme. There are other tagging schemes, such as `BILUO`, which also has letters for the last position and single (unique) tokens in an entity, but the BIO scheme gives you all the information you need.  

In [None]:
entities = [(t.orth_, t.ent_iob_, t.ent_type_) for t in doc_en]
display(HTML(tabulate.tabulate(entities, tablefmt='html')))

0,1,2
Donald,B,PERSON
John,I,PERSON
Trump,I,PERSON
(,O,
born,O,
June,B,DATE
14,I,DATE
",",I,DATE
1946,I,DATE
),O,


You can also access the entities directly on the `ents` attribute of the document: 

In [None]:
print([(ent.text, ent.label_) for ent in doc_en.ents])

[('Donald John Trump', 'PERSON'), ('June 14, 1946', 'DATE'), ('45th', 'ORDINAL'), ('the United States', 'GPE')]


spaCy also contains a dependency parser, which analyzes the grammatical relations between the tokens. 

In [None]:
syntax = [[token.text, token.dep_, token.head.text ] for token in doc_en]
display(HTML(tabulate.tabulate(syntax, tablefmt='html')))

0,1,2
Donald,compound,Trump
John,compound,Trump
Trump,nsubj,is
(,punct,Trump
born,acl,Trump
June,npadvmod,born
14,nummod,June
",",punct,June
1946,nummod,June
),punct,Trump


Finally, the English spaCy model contains a morphological parser.

In [None]:
features = [[t.text, t.morph] for t in doc_en]
display(HTML(tabulate.tabulate(features, tablefmt='html')))

0,1
Donald,Number=Sing
John,Number=Sing
Trump,Number=Sing
(,PunctSide=Ini|PunctType=Brck
born,Aspect=Perf|Tense=Past|VerbForm=Part
June,Number=Sing
14,NumType=Card
",",PunctType=Comm
1946,NumType=Card
),PunctSide=Fin|PunctType=Brck


## Multilingual NLP

SpaCy doesn't only have models for English, but also for many other languages.

In [None]:
zh = spacy.load("zh_core_web_md")

In [None]:
text_zh = "中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。"
doc_zh = zh(text_zh)

The tokens in the Chinese document have the same attributes as those in the English one. Take care, however, because the functionality of the models can differ across languages. Here are three main differences between the English and the Chinese model: 

- The Chinese model does not offer lemmatization: the `text` attribute is identical to the `orth_` attribute.

In [None]:
list(doc_zh.sents)

[中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。,
 活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。]

In [None]:
tok_text = [tok.text for tok in doc_zh]
tok_orth = [tok.orth_ for tok in doc_zh]
print(tok_text)
print(tok_orth)

['中山', '大學', '人文', '暨', '科技', '跨領域', '學士', '學位', '學程', '助理', '教授', '宋世祥', '表示', '，', '2021年', '聖誕', '節', '假期', '期間', '，', '師生', '舉辦', '「', '街頭', '玩童', '～鹽', '埕兒', '童街', '區遊', '戲日', '」', '成果', '展', '。', '活動', '中', '可', '看見', '學生', '運用', '贊助', '單位', '瑞儀', '教育', '基金會', '致贈', '的', '廢棄木', '棧板', '，', '製作', '了', '6', '具', '兒童', '創意', '遊具', '，', '一方面', '展示', '學習', '成果', '，', '也', '希望', '藉此', '呼籲', '高雄', '民眾', '重視', '兒童', '的', '遊戲權', '。']
['中山', '大學', '人文', '暨', '科技', '跨領域', '學士', '學位', '學程', '助理', '教授', '宋世祥', '表示', '，', '2021年', '聖誕', '節', '假期', '期間', '，', '師生', '舉辦', '「', '街頭', '玩童', '～鹽', '埕兒', '童街', '區遊', '戲日', '」', '成果', '展', '。', '活動', '中', '可', '看見', '學生', '運用', '贊助', '單位', '瑞儀', '教育', '基金會', '致贈', '的', '廢棄木', '棧板', '，', '製作', '了', '6', '具', '兒童', '創意', '遊具', '，', '一方面', '展示', '學習', '成果', '，', '也', '希望', '藉此', '呼籲', '高雄', '民眾', '重視', '兒童', '的', '遊戲權', '。']


- The Chinese model has a very different fine-grained part-of-speech tags on the `tag_` attribute.

In [None]:
for tok in list(doc_zh.sents)[1]:
    print(f"{tok.text} >>> {tok.pos_}")

活動 >>> NOUN
中 >>> PART
可 >>> VERB
看見 >>> VERB
學生 >>> NOUN
運用 >>> VERB
贊助 >>> NOUN
單位 >>> NOUN
瑞儀 >>> PROPN
教育 >>> NOUN
基金會 >>> NOUN
致贈 >>> NOUN
的 >>> PART
廢棄木 >>> NOUN
棧板 >>> NOUN
， >>> PUNCT
製作 >>> VERB
了 >>> PART
6 >>> NUM
具 >>> NUM
兒童 >>> NOUN
創意 >>> ADJ
遊具 >>> NOUN
， >>> PUNCT
一方面 >>> ADV
展示 >>> VERB
學習 >>> NOUN
成果 >>> NOUN
， >>> PUNCT
也 >>> ADV
希望 >>> VERB
藉此 >>> ADV
呼籲 >>> VERB
高雄 >>> PROPN
民眾 >>> NOUN
重視 >>> VERB
兒童 >>> NOUN
的 >>> PART
遊戲權 >>> PROPN
。 >>> PUNCT


In [None]:
for tok in list(doc_zh.sents)[1]:
    print(f"{tok.text} >>> {tok.tag_} | {spacy.explain(tok.tag_)}")

活動 >>> NN | noun, singular or mass
中 >>> LC | localizer
可 >>> VV | other verb
看見 >>> VV | other verb
學生 >>> NN | noun, singular or mass
運用 >>> VV | other verb
贊助 >>> NN | noun, singular or mass
單位 >>> NN | noun, singular or mass
瑞儀 >>> NR | proper noun
教育 >>> NN | noun, singular or mass
基金會 >>> NN | noun, singular or mass
致贈 >>> NN | noun, singular or mass
的 >>> DEG | associative 的
廢棄木 >>> NN | noun, singular or mass
棧板 >>> NN | noun, singular or mass
， >>> PU | punctuation
製作 >>> VV | other verb
了 >>> AS | aspect marker
6 >>> CD | cardinal number
具 >>> M | measure word
兒童 >>> NN | noun, singular or mass
創意 >>> JJ | adjective (English), other noun-modifier (Chinese)
遊具 >>> NN | noun, singular or mass
， >>> PU | punctuation
一方面 >>> AD | adverb
展示 >>> VV | other verb
學習 >>> NN | noun, singular or mass
成果 >>> NN | noun, singular or mass
， >>> PU | punctuation
也 >>> AD | adverb
希望 >>> VV | other verb
藉此 >>> AD | adverb
呼籲 >>> VV | other verb
高雄 >>> NR | proper noun
民眾 >>> NN | noun, singul

- The Chinese model has different entity types (PER, LOC and ORG) than the English one. 

This is a result of the training corpora that were used to build the models, whose annotation guidelines may be very different.

In [None]:
info = [(t.text, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_zh]
display(HTML(tabulate.tabulate(info, tablefmt='html')))

0,1,2,3,4
中山,PROPN,NR,B,ORG
大學,NOUN,NN,I,ORG
人文,NOUN,NN,I,ORG
暨,CCONJ,CC,I,ORG
科技,NOUN,NN,I,ORG
跨領域,ADJ,JJ,I,ORG
學士,NOUN,NN,O,
學位,NOUN,NN,O,
學程,NOUN,NN,O,
助理,NOUN,NN,O,


## Visualization

In [None]:
from spacy import displacy

In [None]:
displacy.render(doc_zh, style='ent',jupyter=True, options={'distance':130})

In [None]:
text = "我想要三份2號餐"
doc = zh(text)
displacy.render(doc, style='dep',jupyter=True, options={'distance':130})

## DataFrame + spaCy = dframcy

In [None]:
!pip install dframcy

In [None]:
from dframcy import DframCy

In [None]:
nlp = spacy.load('zh_core_web_md')
dframcy = DframCy(nlp)
doc = dframcy.nlp(text_zh)
annotation_dataframe = dframcy.to_dataframe(doc)
annotation_dataframe

Unnamed: 0,token_text,token_start,token_end,token_pos_,token_tag_,token_dep_,token_head,token_ent_type_
0,中山,0,2,PROPN,NR,compound:nn,大學,ORG
1,大學,2,4,NOUN,NN,nmod:assmod,科技,ORG
2,人文,4,6,NOUN,NN,conj,科技,ORG
3,暨,6,7,CCONJ,CC,cc,科技,ORG
4,科技,7,9,NOUN,NN,compound:nn,跨領域,ORG
...,...,...,...,...,...,...,...,...
69,重視,128,130,VERB,VV,ccomp,呼籲,
70,兒童,130,132,NOUN,NN,dobj,重視,
71,的,132,133,PART,DEC,mark,重視,
72,遊戲權,133,136,PROPN,NR,dobj,呼籲,


Once annotations are stored as a DataFrame object, filtering can be easily done by leveraging the power of `pandas` syntax. 

In [None]:
nn_filt = annotation_dataframe['token_tag_'] == 'NN'
dobj_filt = annotation_dataframe['token_dep_'] == 'dobj'
annotation_dataframe[(nn_filt) & dobj_filt]

Unnamed: 0,token_text,token_start,token_end,token_pos_,token_tag_,token_dep_,token_head,token_ent_type_
29,戲日,58,60,NOUN,NN,dobj,舉辦,
32,展,63,64,NOUN,NN,dobj,舉辦,
48,棧板,92,94,NOUN,NN,dobj,運用,
56,遊具,104,106,NOUN,NN,dobj,製作,
61,成果,114,116,NOUN,NN,dobj,展示,
70,兒童,130,132,NOUN,NN,dobj,重視,


## Vectors

In [None]:
doc = zh("教授")
tok = doc[0]
tok.vector

array([ 2.2328  , -1.1713  , -3.3528  , -1.1691  , -0.26724 ,  4.4476  ,
       -0.66089 ,  2.6248  , -1.5367  , -2.8449  , -4.0233  ,  1.5727  ,
        1.978   ,  2.7964  ,  1.003   ,  0.29978 ,  0.056525,  3.7048  ,
        2.0446  ,  2.2452  , -5.7184  ,  0.77814 , -1.8383  , -0.017231,
       -1.91    , -6.4355  , -4.6737  , -0.13519 ,  0.66087 , -1.6718  ,
        3.5934  ,  2.3382  , -4.5406  ,  1.6124  , -2.2361  , -6.0387  ,
       -3.4078  ,  1.1304  ,  0.80933 ,  1.9734  ,  2.3314  , -0.9882  ,
       -1.1947  ,  2.2628  , -1.3687  , -6.4278  ,  0.15906 ,  0.047335,
       -2.8157  , -1.6407  ,  2.4385  , -0.84336 ,  3.081   ,  5.9188  ,
       -1.3019  ,  1.2971  ,  7.2325  ,  2.9722  , -0.45552 ,  1.5148  ,
       -1.1193  ,  3.8739  ,  1.482   , -2.4657  ,  1.4627  , -3.562   ,
       -2.1737  , -1.4306  ,  3.4363  , -1.2796  , -1.4106  ,  2.2146  ,
        2.9325  , -2.5172  ,  2.7192  , -0.84556 , -2.5362  ,  2.2079  ,
       -3.2217  , -2.2081  ,  4.6204  ,  0.98445 , 

In [None]:
tok.vector.shape

(300,)

In [None]:
word_1 = nlp.vocab["高興"]
word_2 = nlp.vocab["高雄"]
word_3 = nlp.vocab["開心"]
word_1_word_2 = word_1.similarity(word_2)
word_1_word_3 = word_1.similarity(word_3)
print(f"Distance btn Word 1 and 2: {word_1_word_2}")
print(f"Distance btn Word 1 and 3: {word_1_word_3}")

Distance btn Word 1 and 2: 0.27085748314857483
Distance btn Word 1 and 3: 0.8141297101974487


- Cosine similarity 

![](https://datascience-enthusiast.com/figures/cosine_sim.png)

- Formula for calculating cosine similarity between two vectors

![](https://miro.medium.com/max/1400/1*LfW66-WsYkFqWc4XYJbEJg.png)

## StanfordNLP

Another library whose functionality overlaps with that of spaCy is StanfordNLP. [StanfordNLP](https://stanfordnlp.github.io/stanfordnlp/), not to be confused with Stanford's Java [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library, is a [Python library](https://github.com/stanfordnlp/stanfordnlp) built on top of PyTorch that offers a fully neural pipeline with tokenization (including multi-word units), lemmatization, part-of-speech tagging (including morphological features) and dependency parsing. These components were built and trained for the [CoNLL-2018 shared task](https://nlp.stanford.edu/pubs/qi2018universal.pdf). There are no named entities, but the quality of the dependency parsing is state of the art. On top of that, it also offers a Python interface to CoreNLP. 

**`stanfordnlp` has been renamed to `stanza`.**

In [None]:
!pip install stanza

In [None]:
import stanza

In [None]:
stanza.download("zh-hant")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2022-04-08 07:17:38 INFO: Downloading default packages for language: zh-hant (Traditional_Chinese)...
2022-04-08 07:17:38 INFO: File exists: /root/stanza_resources/zh-hant/default.zip.
2022-04-08 07:17:41 INFO: Finished downloading models and saved to /root/stanza_resources.


In [None]:
stf_nlp = stanza.Pipeline('zh-hant')

2022-04-08 07:17:42 INFO: Loading these models for language: zh-hant (Traditional_Chinese):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| pos       | gsd     |
| lemma     | gsd     |
| depparse  | gsd     |

2022-04-08 07:17:42 INFO: Use device: cpu
2022-04-08 07:17:42 INFO: Loading: tokenize
2022-04-08 07:17:42 INFO: Loading: pos
2022-04-08 07:17:42 INFO: Loading: lemma
2022-04-08 07:17:42 INFO: Loading: depparse
2022-04-08 07:17:43 INFO: Done loading processors!


In [None]:
text_zh = "中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。"
doc = stf_nlp(text_zh)
type(doc)

stanza.models.common.doc.Document

Different models often produce different tokenization results, which in turn would have impact on POS and DEP tagging. 

- Here're the results based on StandfordNLP.

In [None]:
for i, sent in enumerate(doc.sentences):
    print("[Sentence {}]".format(i+1))
    for word in sent.words:
        print("{:12s}\t{:12s}\t{:6s}\t{:d}\t{:12s}".format(\
              word.text, word.lemma, word.pos, word.head, word.deprel))
    print("")

[Sentence 1]
中山          	中山          	PROPN 	3	nmod        
大學          	大學          	NOUN  	3	nmod        
人文          	人文          	NOUN  	12	nmod        
暨           	暨           	CCONJ 	6	cc          
科技          	科技          	NOUN  	6	compound    
跨           	跨           	PART  	3	conj        
領域          	領域          	NOUN  	12	nmod        
學士          	學士          	NOUN  	12	nmod        
學位          	學位          	NOUN  	12	nmod        
學程          	學程          	NOUN  	12	nmod        
助理          	助理          	NOUN  	12	nmod        
教授          	教授          	NOUN  	15	nsubj       
宋           	宋           	PROPN 	12	appos       
世祥          	世祥          	PROPN 	13	flat:name   
表示          	表示          	VERB  	0	root        
，           	，           	PUNCT 	15	punct       
2021        	2021        	NUM   	18	nummod      
年           	年           	NOUN  	22	clf         
聖誕          	聖誕          	PROPN 	20	compound    
節           	節           	PART  	22	nmod        
假期          	

# Assignment


## Analyze Enlgish

- Input: any English news article of your choice 
- Ouput:
    - A list of unique lemmas of all verbs in lower case
    - A list of unique tuples of (NER text, NER label)




In [None]:
# Change this to any other article of your choice.

en_input = """
Taipei, April 7 (CNA) Health and Welfare Minister Chen Shih-chung (陳時中) said Thursday that COVID-19 contact tracing has been partially suspended in Taiwan and a new disease control model is being put in place, amid a rise in domestic cases.

The immediate suspension of contract tracing applies only to travelers who test positive for COVID-19 in Taiwan, either on arrival at the airport or during mandatory quarantine, Chen said.

That decision was made in a bid to free up resources to monitor the growing number of domestic COVID-19 cases, he said at a press briefing, after he reported 531 new cases -- 382 domestically transmitted and 149 imported.

Chen said contact tracing on new imported cases will only be done if any of them are believed to be linked to COVID-19 clusters at quarantine hotels or quarantine centers in Taiwan.

Prior to Thursday, Taiwan had been reporting its contact tracing information on imported COVID-19 cases via the World Health Organization's International Health Regulations (IHR) mechanism, he said.

Regarding the recent daily rise in domestic infections, Chen said the current goal is to bring the situation under control, even though it is impossible to achieve zero new domestic cases at this time.

Despite the recent spike, the daily number of domestic COVID-19 cases in Taiwan is still low compared to many other countries, he said, citing as an example the 534 new cases per 100,000 population reported in South Korea on Tuesday.

Once people in Taiwan stick together and do their part to prevent the spread of the virus, the situation will be manageable, Chen said.

Based on the trajectory of COVID-19 Omicron outbreaks observed in many other countries around the world, he said, it is likely that the infections in Taiwan will peak in a month or two.

"We do not expect the outbreak to stop growing now, but we hope it will rise slowly, so that Taiwan's medical capacity will not be overloaded," Chen said.

Meanwhile, earlier in the day, the Cabinet announced that Taiwan was adopting a new model for the control of COVID-19 infections.

Under the "new Taiwan model," the country has let go of its goal to achieve zero COVID-19 cases, but this does not mean allowing the pandemic go unmanaged, Cabinet spokesman Lo Ping-cheng (羅秉成) said, citing Premier Su Tseng-chang (蘇貞昌).

In a meeting earlier with Ministry of Health and Welfare (MOHW) officials, Premier Su said that as Taiwan moves towards a new stage of epidemic prevention, he hopes that the central and local governments will work together to gradually open up the country, in the interests of its people and economy, according to Lo.

In a report presented to the Cabinet on Thursday, the MOHW said Taiwan will continue to actively manage the COVID-19 situation, while steadily opening up its borders, in consideration of national economic factors and the people's livelihood.
"""

In [None]:
# Start by turning a text into a spaCy Doc object
# en_doc = en(en_input)

In [None]:
#===Write your code below and save the output as `verbs`.===#


# verbs = 

['say',
 'do',
 'accord',
 'bring',
 'monitor',
 'import',
 'free',
 'achieve',
 'manage',
 'apply',
 'suspend',
 'move',
 'put',
 'go',
 'regard',
 'let',
 'observe',
 'stop',
 'trace',
 'allow',
 'make',
 'work',
 'peak',
 'report',
 'mean',
 'test',
 'adopt',
 'stick',
 'compare',
 'announce',
 'grow',
 'prevent',
 'present',
 'base',
 'continue',
 'hope',
 'open',
 'transmit',
 'link',
 'believe',
 'overload',
 'expect',
 'cite',
 'rise']

In [None]:
#===Write your code below and save the output as `en_ents`.===#


# en_ents =

[('Taipei', 'GPE'),
 ('April 7', 'DATE'),
 ('CNA) Health and', 'ORG'),
 ('Welfare', 'ORG'),
 ('Chen Shih-chung', 'PERSON'),
 ('Thursday', 'DATE'),
 ('COVID-19', 'PERSON'),
 ('Taiwan', 'GPE'),
 ('Chen', 'PERSON'),
 ('COVID-19', 'PRODUCT'),
 ('531', 'CARDINAL'),
 ('382', 'CARDINAL'),
 ('149', 'CARDINAL'),
 ("the World Health Organization's", 'ORG'),
 ('daily', 'DATE'),
 ('zero', 'CARDINAL'),
 ('534', 'CARDINAL'),
 ('100,000', 'CARDINAL'),
 ('South Korea', 'GPE'),
 ('Tuesday', 'DATE'),
 ('COVID-19 Omicron', 'PERSON'),
 ('a month', 'DATE'),
 ('two', 'CARDINAL'),
 ('earlier in the day', 'DATE'),
 ('Cabinet', 'ORG'),
 ('Lo Ping-cheng', 'PERSON'),
 ('Su Tseng-chang', 'PERSON'),
 ('Ministry of Health and Welfare', 'ORG'),
 ('Su', 'PERSON'),
 ('Lo', 'ORG'),
 ('MOHW', 'ORG')]

## Analyze Chinese

- Input 1: any Chinese news article from Taiwan media of your choice 
- Ouput 1:
    - A list of unique tokens except for punctuations
    - A list of unique tuples of (NER text, NER label)

In [None]:
# Change this to any other article of your choice.

zh_input = """
本土個案昨增三八二例再創新高，確診案例遍及十九縣市，境外增一四九例，單日破五百案例。

中央疫情指揮中心指揮官陳時中表示，個案數還會往上升，預估一至兩個月內達最高峰，疫情將持續到六月底且還不會到尾聲；他也首度鬆口「清零不可能」，未來將走向與病毒共存，下周擬試辦「輕症在家」隔離，同時也將調整停課標準。

擬以居家快篩取代停課
國內確診個案上升，全國累積十四縣市共一三九所校園停課，各縣市對畢業旅行、戶外教學是否取消標準不一，教師團體認為政府應該明確表態，否則會造成校園恐慌或影響學生受教權。陳時中表示，未來必然走向與病毒共存，下周將與教育部檢討停課標準，縮小匡列範圍，並在合理範圍以居家快篩來取代停課。

疫情一至兩個月達高峰
指揮中心比照韓國、紐西蘭及香港疫情發展，推估未來一至兩個月確診案例將飆至最高峰，陳時中表示，目前Omicron確診數仍算低，但規模難以預估，未來單日恐超過一千五百例，屆時致死率、個案數飆高或疫情高峰下不來，社會將承擔不起，因此仍須積極因應，朝「緩坡上升」方向努力。

輕症在家指引至今沒譜
總統蔡英文日前宣布防疫以「減災」為目標避免醫療量能超載，「輕症在家」隔離為其中配套，單日確診數若達一千五百人將啟動。指揮中心下周將擬定「輕症在家照護指引」，如設醫療遠距平台、送藥、戶政及警政系統聯繫、關懷中心運作等，若未遵守隔離規定將有罰則，並因應地方疫情升溫和醫療量能吃緊，將從新北試辦。

立委賴惠員、蔣萬安昨於衛環委員會質詢時，質疑各地方早已喊話希望指揮中心相關指引快出來，但至今連社區溝通、徵求試辦的地方政府意願統統沒有譜。

台北市長柯文哲昨晚也在臉書表示「北市防疫旅館量能告急」，因每天確診人數不斷攀升，近期有近萬名移工入境，幾乎把北市的防疫旅館量能占滿。北市已發出徵召令，徵用加強版防疫專責旅館，讓輕症、低危險確診者入住。

總統府、監察院傳確診者
本土疫情多點爆發，公務機關包括總統府、監察院、台北市議會都傳出有確診者，北中南八大行業確診人數也驟增，未來場所是否新增禁令，指揮中心將再討論。

境外移入昨增一四九例個案，有七十八例為航班落地採檢陽性，越南有十七例居冠。指揮中心表示，目前落地採檢陽性率約落在四％到五％間，即日起越南航空、越捷航空及越竹航空等三航空公司班機，增加「搭機前六小時內抗原快篩報告」才可入境。
"""

In [None]:
# Start by turning a text into a spaCy Doc object
# zh_doc = zh(zh_input)

In [None]:
#===Write your code below and save the output as `zh_toks`.===#

# zh_toks = 

['輕症',
 '紐西蘭',
 '發出',
 '罰則',
 '校園',
 '一',
 '啟動',
 '再',
 '也',
 '教師',
 '與',
 '新高',
 '照護',
 '單日',
 '越南',
 '表態',
 '其中',
 '尾聲',
 '質疑',
 '規定',
 '擬定',
 '最高峰',
 '療量',
 '政府',
 '應該',
 '\n\n',
 '造成',
 '高峰',
 '傳出',
 '吃緊',
 '飆高',
 '徵求',
 '可',
 '所',
 '爆發',
 '六月',
 '未',
 '把',
 '表示',
 '下',
 '連社',
 '攀升',
 '遠距',
 '避免',
 '昨增',
 '社會',
 '間',
 '一三九',
 '居冠',
 '旅行',
 '可能',
 '於衛',
 '小',
 '個案',
 '越竹',
 '但',
 '累積',
 '防疫',
 '早已',
 '機關',
 '增加',
 '加強版',
 '喊話',
 '抗原',
 '十七',
 '例遍',
 '疫情',
 '統沒',
 '是否',
 '因',
 '昨晚',
 '地方',
 '報告',
 '為目',
 '確診',
 '搭機',
 '緩坡',
 '檢討',
 '推估',
 '走向',
 '譜',
 '送藥',
 '診數',
 '匡列',
 '例為',
 '屆時',
 '指揮官',
 '共存',
 '學生',
 '否則',
 '及',
 '陳時',
 '詢時',
 '仍',
 '到',
 '擬以',
 '篩來',
 '認為',
 '還',
 '平台',
 '將與',
 '區溝通',
 '沒譜',
 '至',
 '合理',
 '點',
 '如',
 '數還',
 '臉書',
 '移工',
 '都',
 '萬名',
 '遵守',
 '必然',
 '居家',
 '一四九例',
 '會到',
 '三',
 '來將',
 '快篩',
 '並',
 '出來',
 '隔離',
 '越捷',
 '賴惠員',
 '意願',
 '新增',
 '即',
 '恐慌',
 '前',
 '柯文哲',
 '運作',
 '度',
 '為',
 '低',
 '試辦',
 '將調',
 '航空',
 '關懷',
 '周擬',
 '台北市',
 '他',
 '超過',
 '日起',
 '香港',
 '

In [None]:
#===Write your code below and save the output as `zh_ents`.===#


# zh_ents =

[('三八二', 'CARDINAL'),
 ('日破', 'GPE'),
 ('五百', 'CARDINAL'),
 ('一至兩個月', 'DATE'),
 ('六月', 'DATE'),
 ('底且', 'PERSON'),
 ('周擬', 'PERSON'),
 ('將調', 'PERSON'),
 ('十四縣', 'CARDINAL'),
 ('一三九所校園停課', 'FAC'),
 ('下周', 'DATE'),
 ('教育部', 'ORG'),
 ('韓國', 'GPE'),
 ('紐西蘭', 'GPE'),
 ('香港', 'GPE'),
 ('沒譜', 'ORG'),
 ('蔡英文', 'PERSON'),
 ('日前', 'DATE'),
 ('一千五百人', 'CARDINAL'),
 ('關懷中心', 'ORG'),
 ('新北', 'GPE'),
 ('台北市', 'GPE'),
 ('柯文哲', 'PERSON'),
 ('昨晚', 'TIME'),
 ('北市防疫旅館量', 'EVENT'),
 ('近萬名', 'CARDINAL'),
 ('北市', 'GPE'),
 ('徵召令', 'PERSON'),
 ('徵用加強版防疫專責', 'ORG'),
 ('監察院', 'ORG'),
 ('中南', 'LOC'),
 ('驟增', 'PERSON'),
 ('一四九', 'CARDINAL'),
 ('七十八', 'CARDINAL'),
 ('越南', 'GPE'),
 ('十七', 'CARDINAL'),
 ('四％到', 'DATE'),
 ('五％', 'CARDINAL'),
 ('越南航空、越捷航空', 'ORG'),
 ('越竹航空', 'ORG')]

- Input 2: Simplified version of Input 1 (Use `opencc` to do the conversion.)
- Ouput 2:
    - A list of unique tokens except for punctuations
    - A list of unique tuples of (NER text, NER label)

In [None]:
!pip install opencc

Collecting opencc
  Downloading OpenCC-1.1.3-cp37-cp37m-manylinux1_x86_64.whl (766 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/766.2 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m766.2/766.2 KB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: opencc
Successfully installed opencc-1.1.3
[0m

In [None]:
import opencc

In [None]:
converter = opencc.OpenCC('t2s.json')
sim_zh_input = converter.convert(zh_input)
sim_zh_input

'\n本土个案昨增三八二例再创新高，确诊案例遍及十九县市，境外增一四九例，单日破五百案例。\n\n中央疫情指挥中心指挥官陈时中表示，个案数还会往上升，预估一至两个月内达最高峰，疫情将持续到六月底且还不会到尾声；他也首度松口「清零不可能」，未来将走向与病毒共存，下周拟试办「轻症在家」隔离，同时也将调整停课标准。\n\n拟以居家快筛取代停课\n国内确诊个案上升，全国累积十四县市共一三九所校园停课，各县市对毕业旅行、户外教学是否取消标准不一，教师团体认为政府应该明确表态，否则会造成校园恐慌或影响学生受教权。陈时中表示，未来必然走向与病毒共存，下周将与教育部检讨停课标准，缩小匡列范围，并在合理范围以居家快筛来取代停课。\n\n疫情一至两个月达高峰\n指挥中心比照韩国、纽西兰及香港疫情发展，推估未来一至两个月确诊案例将飙至最高峰，陈时中表示，目前Omicron确诊数仍算低，但规模难以预估，未来单日恐超过一千五百例，届时致死率、个案数飙高或疫情高峰下不来，社会将承担不起，因此仍须积极因应，朝「缓坡上升」方向努力。\n\n轻症在家指引至今没谱\n总统蔡英文日前宣布防疫以「减灾」为目标避免医疗量能超载，「轻症在家」隔离为其中配套，单日确诊数若达一千五百人将启动。指挥中心下周将拟定「轻症在家照护指引」，如设医疗远距平台、送药、户政及警政系统联系、关怀中心运作等，若未遵守隔离规定将有罚则，并因应地方疫情升温和医疗量能吃紧，将从新北试办。\n\n立委赖惠员、蒋万安昨于卫环委员会质询时，质疑各地方早已喊话希望指挥中心相关指引快出来，但至今连社区沟通、征求试办的地方政府意愿统统没有谱。\n\n台北市长柯文哲昨晚也在脸书表示「北市防疫旅馆量能告急」，因每天确诊人数不断攀升，近期有近万名移工入境，几乎把北市的防疫旅馆量能占满。北市已发出征召令，征用加强版防疫专责旅馆，让轻症、低危险确诊者入住。\n\n总统府、监察院传确诊者\n本土疫情多点爆发，公务机关包括总统府、监察院、台北市议会都传出有确诊者，北中南八大行业确诊人数也骤增，未来场所是否新增禁令，指挥中心将再讨论。\n\n境外移入昨增一四九例个案，有七十八例为航班落地采检阳性，越南有十七例居冠。指挥中心表示，目前落地采检阳性率约落在四％到五％间，即日起越南航空、越捷航空及越竹航空等三航空公司班机，增加「搭机前六小时内抗原快筛报告」才可入境。\n

In [None]:
# Start by turning a text into a spaCy Doc object
# sim_zh_doc = zh(sim_zh_input)

In [None]:
#===Write your code below and save the output as `sim_zh_toks`.===#


# sim_zh_toks = 

['对',
 '诊者',
 '议会',
 '一',
 '再',
 '也',
 '超载',
 '日确',
 '场所',
 '否则',
 '越南',
 '会',
 '台北',
 '其中',
 '脸书',
 '最高峰',
 '联系',
 '政府',
 '\n\n',
 '造成',
 '高峰',
 '社区',
 '缓坡',
 '例为',
 '可',
 '所',
 '六月',
 '未',
 '应该',
 '把',
 '表示',
 '须',
 '关怀',
 '下',
 '攀升',
 '避免',
 '昨增',
 '高',
 '一三九',
 '教学',
 '骤增',
 '居冠',
 '旅行',
 '月',
 '可能',
 '预估',
 '统统',
 '越竹',
 '与',
 '但',
 '届时',
 '防疫',
 '早已',
 '难以',
 '增加',
 '将',
 '轻症',
 '抗原',
 '万',
 '来',
 '承担',
 '公务',
 '十七',
 '危险',
 '疫情',
 '是否',
 '因',
 '持续',
 '旅馆',
 '传出',
 '昨晚',
 '地方',
 'Omicron确',
 '讨论',
 '推估',
 '指挥官',
 '送药',
 '运作',
 '超过',
 '出来',
 '走向',
 '征召令',
 '蒋万安',
 '停课',
 '市长',
 '匡列',
 '相关',
 '同时',
 '班机',
 '共存',
 '创新',
 '飙高',
 '及',
 '周拟',
 '检讨',
 '仍',
 '到',
 '陈时',
 '还',
 '平台',
 '设',
 '至',
 '系统',
 '名',
 '合理',
 '如',
 '内',
 '移工',
 '都',
 '遵守',
 '必然',
 '让',
 '几乎',
 '确诊',
 '积极',
 '居家',
 '一四九例',
 '赖惠员',
 '目标',
 '纽西兰',
 '三',
 '升温',
 '点',
 '达',
 '越捷',
 '个',
 '新增',
 '昨于',
 '爆发',
 '即',
 '恐慌',
 '前',
 '认为',
 '报告',
 '柯文哲',
 '时',
 '度',
 '飙至',
 '标准',
 '低',
 '规模',
 '航空',
 '学生',
 '台北市',
 '他',
 '日起',

Evaluate whether `zh_toks` is equal to `sim_zh_toks`.

In [None]:
# zh_toks == sim_zh_toks

False

In [None]:
#===Write your code below and save the output as `sim_zh_ents`.===#


# sim_zh_ents =

[('三八二例', 'CARDINAL'),
 ('十九', 'CARDINAL'),
 ('五百', 'CARDINAL'),
 ('陈时中', 'PERSON'),
 ('一至两个月', 'DATE'),
 ('内达', 'GPE'),
 ('六月', 'DATE'),
 ('底且', 'PERSON'),
 ('周拟', 'PERSON'),
 ('十四县', 'CARDINAL'),
 ('陈时', 'PERSON'),
 ('下周', 'DATE'),
 ('教育部', 'ORG'),
 ('韩国', 'GPE'),
 ('纽西兰', 'GPE'),
 ('香港', 'GPE'),
 ('一千五百例', 'CARDINAL'),
 ('蔡英文', 'PERSON'),
 ('一千五百人', 'CARDINAL'),
 ('新北', 'GPE'),
 ('赖惠员', 'PERSON'),
 ('蒋万安昨于卫环委员会', 'ORG'),
 ('台北', 'GPE'),
 ('柯文哲', 'PERSON'),
 ('昨晚', 'TIME'),
 ('北市', 'GPE'),
 ('旅馆量', 'GPE'),
 ('近万', 'CARDINAL'),
 ('监察院', 'ORG'),
 ('总统府', 'ORG'),
 ('台北市议会', 'ORG'),
 ('确诊者', 'FAC'),
 ('中南', 'LOC'),
 ('一四九', 'CARDINAL'),
 ('七十八', 'CARDINAL'),
 ('越南', 'GPE'),
 ('十七', 'CARDINAL'),
 ('约落', 'PERSON'),
 ('四％到', 'DATE'),
 ('五％', 'CARDINAL'),
 ('越南航空、越捷航空', 'ORG'),
 ('越竹航空', 'ORG'),
 ('六小时', 'TIME')]

Evaluate whether `zh_ents` is equal to `sim_zh_ents`.

In [None]:
# zh_ents == sim_zh_ents

False