# NLP: Applications

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/phonchi/ModularPython/blob/master/NLP-use-pretrained-models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/phonchi/ModularPython/blob/master/NLP-use-pretrained-models.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

This notebook is adapted by [Haowen Jiang](https://howard-haowen.rohan.tw/) from [this one](https://github.com/nlptown/nlp-notebooks/blob/master/NLP%20with%20pretrained%20models%20-%20spaCy%20and%20StanfordNLP.ipynb) included in the [nlptown
/nlp-notebooks](https://github.com/nlptown/nlp-notebooks) repo. It is meant for the 2022 [NLP Workshop at NSYSU](https://howard-haowen.rohan.tw/NLP-demos/nsysu_workshop).

In [None]:
from datetime import date

today = date.today()
print("Last updated:", today)

Last updated: 2024-06-11


# üìò NLP with pretrained models - spaCy

In [3]:
# @title spaCy Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    #!pip install -U pip setuptools wheel -qq
    #!pip install -U spacy -qq
    !python -m spacy download en_core_web_md -qq # downloads the medium-sized English language model
    !python -m spacy download zh_core_web_md -qq # downloads the medium-sized Chinese language model

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m42.8/42.8 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m‚ö† Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting zh-core-web-md==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_md-3.7.0/zh_core_web_md-3.7.0-py3-none-any.whl (

![](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

In [1]:
import spacy
import pandas as pd

In [2]:
spacy.info()

{'spacy_version': '3.7.5',
 'location': '/usr/local/lib/python3.10/dist-packages/spacy',
 'platform': 'Linux-6.1.85+-x86_64-with-glibc2.35',
 'python_version': '3.10.12',
 'pipelines': {'en_core_web_md': '3.7.1',
  'zh_core_web_md': '3.7.0',
  'en_core_web_sm': '3.7.1'}}

- To get you started, play with [this Web App](https://share.streamlit.io/howard-haowen/spacy-streamlit/app.py) that I created, which is powered by spaCy.

## English NLP

In [3]:
en = spacy.load("en_core_web_md") # Loading the spaCy Model which includes vocabulary, syntax models, and entities.
df_metadata = pd.DataFrame([en.meta])
df_metadata.T

Unnamed: 0,0
lang,en
name,core_web_md
version,3.7.1
description,English pipeline optimized for CPU. Components...
author,Explosion
email,contact@explosion.ai
url,https://explosion.ai
license,MIT
spacy_version,">=3.7.2,<3.8.0"
spacy_git_version,bd2c17e20


In [4]:
text = ("Donald John Trump (born June 14, 1946) is the 45th and previous president of "
     "the United States.  Before entering politics, he was a businessman and television personality.")
print(text)

Donald John Trump (born June 14, 1946) is the 45th and previous president of the United States.  Before entering politics, he was a businessman and television personality.


Here, the text about Donald Trump is processed by the spaCy model, creating a `Doc` object (A `Doc` object is a sequence of Token objects representing a lexical token) `doc_en` that contains all the information about the text's structure and content.

In [5]:
doc_en = en(text)

In [6]:
tokens = [token.text for token in doc_en]
print(tokens)

['Donald', 'John', 'Trump', '(', 'born', 'June', '14', ',', '1946', ')', 'is', 'the', '45th', 'and', 'previous', 'president', 'of', 'the', 'United', 'States', '.', ' ', 'Before', 'entering', 'politics', ',', 'he', 'was', 'a', 'businessman', 'and', 'television', 'personality', '.']


SpaCy also splits your document into sentences. In spaCy, the `.sents` property is used to extract sentences from the Doc object

In [8]:
sentences = list(doc_en.sents)
len(sentences), sentences

(2,
 [Donald John Trump (born June 14, 1946) is the 45th and previous president of the United States.  ,
  Before entering politics, he was a businessman and television personality.])

### Part-of-Speech tagging

In addition, spaCy identifies a variety of linguistic features for each token. Among the foundational features are the lemma and two types of parts-of-speech (POS) tags. The `pos_` attribute encompasses the [Universal POS tags](https://universaldependencies.org/u/pos/) derived from the [Universal Dependencies](https://universaldependencies.org/) framework, which provide a consistent categorization of word types across languages. On the other hand, the `tag_` attribute offers more detailed, language-specific POS tags that capture finer grammatical distinctions.

> Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence.

In [9]:
# orthographic representation, lemma, coarse-grained part-of-speech (pos_), and fine-grained part-of-speech (tag_).
features = [
    {'Text': token.orth_, 'Lemma': token.lemma_, 'POS': token.pos_, 'Detailed POS': token.tag_, 'Explain': spacy.explain(token.tag_)}
    for token in doc_en
]

df_features = pd.DataFrame(features)
df_features

Unnamed: 0,Text,Lemma,POS,Detailed POS,Explain
0,Donald,Donald,PROPN,NNP,"noun, proper singular"
1,John,John,PROPN,NNP,"noun, proper singular"
2,Trump,Trump,PROPN,NNP,"noun, proper singular"
3,(,(,PUNCT,-LRB-,left round bracket
4,born,bear,VERB,VBN,"verb, past participle"
5,June,June,PROPN,NNP,"noun, proper singular"
6,14,14,NUM,CD,cardinal number
7,",",",",PUNCT,",","punctuation mark, comma"
8,1946,1946,NUM,CD,cardinal number
9,),),PUNCT,-RRB-,right round bracket


### Named-Entity Recognition

Next, spaCy includes pre-trained models for named entity recognition (NER). The outcomes of these models are reflected in the `ent_iob_` and `ent_type` attributes. The `ent_type` attribute specifies the category of entity identified by the model, such as a person, date, ordinal number, or geopolitical entity (GPE). For instance, in English models adhering to the [OntoNotes standard](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf), "Donald John Trump" is recognized as a person, "June 14, 1946" as a date, "45th" as an ordinal number, and "the United States" as a GPE.

The `ent_iob_` attribute (inside-outside-beginning (IOB) tagging) indicates the token's position within an entity: `O` for outside any entity, `B` for the beginning of an entity, and `I` for inside an entity (but not at the beginning). This notation is part of the `BIO` tagging scheme, which helps differentiate between consecutive entities of the same type.

> Other schemes like `BILUO` include additional designations for the last token of an entity and for unique, standalone entity tokens, providing detailed positional information within entity sequences.

In [14]:
# Extracting named entity information from each token in the document
entities = [
    {'Text': token.orth_, 'IOB Tag': token.ent_iob_, 'Entity Type': token.ent_type_, 'Explain': spacy.explain(token.ent_type_)} #_ is to get the string
    for token in doc_en  # Iterate over each token
]

df_entities = pd.DataFrame(entities)
df_entities



Unnamed: 0,Text,IOB Tag,Entity Type,Explain
0,Donald,B,PERSON,"People, including fictional"
1,John,I,PERSON,"People, including fictional"
2,Trump,I,PERSON,"People, including fictional"
3,(,O,,
4,born,O,,
5,June,B,DATE,Absolute or relative dates or periods
6,14,I,DATE,Absolute or relative dates or periods
7,",",I,DATE,Absolute or relative dates or periods
8,1946,I,DATE,Absolute or relative dates or periods
9,),O,,


You can also access the entities directly on the `ents` attribute of the document:

In [11]:
print([(ent.text, ent.label_) for ent in doc_en.ents])

[('Donald John Trump', 'PERSON'), ('June 14, 1946', 'DATE'), ('45th', 'ORDINAL'), ('the United States', 'GPE')]


### Dependency Parsing

spaCy also contains a dependency parser, which analyzes the grammatical relations between the tokens.

> Dependency parsing is the process of extracting the dependency graph of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the root of the sentence. All other words are linked to the headword. Specifically,the dependencies can be mapped in a directed graph representation where words are the nodes and rammatical relationships are the edges.

In [15]:
# Extracting syntax or dependency parsing information from each token
syntax = [
    {'Token': token.text, 'Dependency': token.dep_, 'Head': token.head.text}
    for token in doc_en  # Iterate over each token in the document
]

df_syntax = pd.DataFrame(syntax)
df_syntax

Unnamed: 0,Token,Dependency,Head
0,Donald,compound,Trump
1,John,compound,Trump
2,Trump,nsubj,is
3,(,punct,Trump
4,born,acl,Trump
5,June,npadvmod,born
6,14,nummod,June
7,",",punct,June
8,1946,nummod,June
9,),punct,Trump


Finally, the English spaCy model contains a morphological parser.

In [16]:
# Extracting morphological features from each token in the document
features = [
    {'Token': token.text, 'Morphological Features': token.morph}
    for token in doc_en  # Iterate over each token
]

df_features = pd.DataFrame(features)
df_features

Unnamed: 0,Token,Morphological Features
0,Donald,(Number=Sing)
1,John,(Number=Sing)
2,Trump,(Number=Sing)
3,(,"(PunctSide=Ini, PunctType=Brck)"
4,born,"(Aspect=Perf, Tense=Past, VerbForm=Part)"
5,June,(Number=Sing)
6,14,(NumType=Card)
7,",",(PunctType=Comm)
8,1946,(NumType=Card)
9,),"(PunctSide=Fin, PunctType=Brck)"


## Multilingual NLP

SpaCy doesn't only have models for English, but also for many other languages.

In [17]:
zh = spacy.load("zh_core_web_md")
df_metadata = pd.DataFrame([en.meta])
df_metadata.T

Unnamed: 0,0
lang,en
name,core_web_md
version,3.7.1
description,English pipeline optimized for CPU. Components...
author,Explosion
email,contact@explosion.ai
url,https://explosion.ai
license,MIT
spacy_version,">=3.7.2,<3.8.0"
spacy_git_version,bd2c17e20


In [18]:
text_zh = "‰∏≠Â±±Â§ßÂ≠∏‰∫∫ÊñáÊö®ÁßëÊäÄË∑®È†òÂüüÂ≠∏Â£´Â≠∏‰ΩçÂ≠∏Á®ãÂä©ÁêÜÊïôÊéàÂÆã‰∏ñÁ••Ë°®Á§∫Ôºå2021Âπ¥ËÅñË™ïÁØÄÂÅáÊúüÊúüÈñìÔºåÂ∏´ÁîüËàâËæ¶„ÄåË°óÈ†≠Áé©Á´•ÔΩûÈπΩÂüïÂÖíÁ´•Ë°óÂçÄÈÅäÊà≤Êó•„ÄçÊàêÊûúÂ±ï„ÄÇÊ¥ªÂãï‰∏≠ÂèØÁúãË¶ãÂ≠∏ÁîüÈÅãÁî®Ë¥äÂä©ÂñÆ‰ΩçÁëûÂÑÄÊïôËÇ≤Âü∫ÈáëÊúÉËá¥Ë¥àÁöÑÂª¢Ê£ÑÊú®Ê£ßÊùøÔºåË£Ω‰Ωú‰∫Ü6ÂÖ∑ÂÖíÁ´•ÂâµÊÑèÈÅäÂÖ∑Ôºå‰∏ÄÊñπÈù¢Â±ïÁ§∫Â≠∏ÁøíÊàêÊûúÔºå‰πüÂ∏åÊúõËóâÊ≠§ÂëºÁ±≤È´òÈõÑÊ∞ëÁúæÈáçË¶ñÂÖíÁ´•ÁöÑÈÅäÊà≤Ê¨ä„ÄÇ"
doc_zh = zh(text_zh)

The tokens in the Chinese document share the same attribute structure as those in the English document in spaCy. However, the functionalities of the models can vary significantly between languages. One key difference to note is in the handling of lemmatization:

- **Lack of Lemmatization in Chinese Model**: Unlike the English model, the Chinese model does not provide lemmatization. This means that the `text` attribute, which represents the surface form of the word, is identical to the `orth_` attribute, which represents the exact orthography of the word in the document.

This distinction is important to consider when performing text processing tasks, as it affects the depth of linguistic analysis available for each language.

In [19]:
list(doc_zh.sents)

[‰∏≠Â±±Â§ßÂ≠∏‰∫∫ÊñáÊö®ÁßëÊäÄË∑®È†òÂüüÂ≠∏Â£´Â≠∏‰ΩçÂ≠∏Á®ãÂä©ÁêÜÊïôÊéàÂÆã‰∏ñÁ••Ë°®Á§∫Ôºå2021Âπ¥ËÅñË™ïÁØÄÂÅáÊúüÊúüÈñìÔºåÂ∏´ÁîüËàâËæ¶„ÄåË°óÈ†≠Áé©Á´•ÔΩûÈπΩÂüïÂÖíÁ´•Ë°óÂçÄÈÅäÊà≤Êó•„ÄçÊàêÊûúÂ±ï„ÄÇ,
 Ê¥ªÂãï‰∏≠ÂèØÁúãË¶ãÂ≠∏ÁîüÈÅãÁî®Ë¥äÂä©ÂñÆ‰ΩçÁëûÂÑÄÊïôËÇ≤Âü∫ÈáëÊúÉËá¥Ë¥àÁöÑÂª¢Ê£ÑÊú®Ê£ßÊùøÔºåË£Ω‰Ωú‰∫Ü6ÂÖ∑ÂÖíÁ´•ÂâµÊÑèÈÅäÂÖ∑Ôºå‰∏ÄÊñπÈù¢Â±ïÁ§∫Â≠∏ÁøíÊàêÊûúÔºå‰πüÂ∏åÊúõËóâÊ≠§ÂëºÁ±≤È´òÈõÑÊ∞ëÁúæÈáçË¶ñÂÖíÁ´•ÁöÑÈÅäÊà≤Ê¨ä„ÄÇ]

In [20]:
tok_text = [tok.text for tok in doc_zh]
tok_orth = [tok.orth_ for tok in doc_zh]
print(tok_text)
print(tok_orth)

['‰∏≠Â±±', 'Â§ßÂ≠∏', '‰∫∫Êñá', 'Êö®', 'ÁßëÊäÄ', 'Ë∑®È†òÂüü', 'Â≠∏Â£´', 'Â≠∏‰Ωç', 'Â≠∏Á®ã', 'Âä©ÁêÜ', 'ÊïôÊéà', 'ÂÆã‰∏ñÁ••', 'Ë°®Á§∫', 'Ôºå', '2021Âπ¥', 'ËÅñË™ï', 'ÁØÄ', 'ÂÅáÊúü', 'ÊúüÈñì', 'Ôºå', 'Â∏´Áîü', 'ËàâËæ¶', '„Äå', 'Ë°óÈ†≠', 'Áé©Á´•', 'ÔΩûÈπΩ', 'ÂüïÂÖí', 'Á´•Ë°ó', 'ÂçÄÈÅä', 'Êà≤Êó•', '„Äç', 'ÊàêÊûú', 'Â±ï', '„ÄÇ', 'Ê¥ªÂãï', '‰∏≠', 'ÂèØ', 'ÁúãË¶ã', 'Â≠∏Áîü', 'ÈÅãÁî®', 'Ë¥äÂä©', 'ÂñÆ‰Ωç', 'ÁëûÂÑÄ', 'ÊïôËÇ≤', 'Âü∫ÈáëÊúÉ', 'Ëá¥Ë¥à', 'ÁöÑ', 'Âª¢Ê£ÑÊú®', 'Ê£ßÊùø', 'Ôºå', 'Ë£Ω‰Ωú', '‰∫Ü', '6', 'ÂÖ∑', 'ÂÖíÁ´•', 'ÂâµÊÑè', 'ÈÅäÂÖ∑', 'Ôºå', '‰∏ÄÊñπÈù¢', 'Â±ïÁ§∫', 'Â≠∏Áøí', 'ÊàêÊûú', 'Ôºå', '‰πü', 'Â∏åÊúõ', 'ËóâÊ≠§', 'ÂëºÁ±≤', 'È´òÈõÑ', 'Ê∞ëÁúæ', 'ÈáçË¶ñ', 'ÂÖíÁ´•', 'ÁöÑ', 'ÈÅäÊà≤Ê¨ä', '„ÄÇ']
['‰∏≠Â±±', 'Â§ßÂ≠∏', '‰∫∫Êñá', 'Êö®', 'ÁßëÊäÄ', 'Ë∑®È†òÂüü', 'Â≠∏Â£´', 'Â≠∏‰Ωç', 'Â≠∏Á®ã', 'Âä©ÁêÜ', 'ÊïôÊéà', 'ÂÆã‰∏ñÁ••', 'Ë°®Á§∫', 'Ôºå', '2021Âπ¥', 'ËÅñË™ï', 'ÁØÄ', 'ÂÅáÊúü', 'ÊúüÈñì', 'Ôºå', 'Â∏´Áîü', 'ËàâËæ¶', '„Äå', 'Ë°óÈ†≠', 'Áé©Á´•', 'ÔΩûÈπΩ', 'ÂüïÂÖí', 'Á´•Ë°ó', 'ÂçÄÈÅä', 'Êà≤Êó•', '„Äç', 'Ê

In [21]:
for tok in list(doc_zh.sents)[1]: # The second sentence
    print(f"{tok.text} >>> {tok.pos_}")

Ê¥ªÂãï >>> NOUN
‰∏≠ >>> PART
ÂèØ >>> VERB
ÁúãË¶ã >>> VERB
Â≠∏Áîü >>> NOUN
ÈÅãÁî® >>> VERB
Ë¥äÂä© >>> VERB
ÂñÆ‰Ωç >>> NOUN
ÁëûÂÑÄ >>> PROPN
ÊïôËÇ≤ >>> NOUN
Âü∫ÈáëÊúÉ >>> NOUN
Ëá¥Ë¥à >>> VERB
ÁöÑ >>> PART
Âª¢Ê£ÑÊú® >>> NOUN
Ê£ßÊùø >>> NOUN
Ôºå >>> PUNCT
Ë£Ω‰Ωú >>> VERB
‰∫Ü >>> PART
6 >>> NUM
ÂÖ∑ >>> NUM
ÂÖíÁ´• >>> NOUN
ÂâµÊÑè >>> ADJ
ÈÅäÂÖ∑ >>> NOUN
Ôºå >>> PUNCT
‰∏ÄÊñπÈù¢ >>> ADV
Â±ïÁ§∫ >>> VERB
Â≠∏Áøí >>> ADJ
ÊàêÊûú >>> NOUN
Ôºå >>> PUNCT
‰πü >>> ADV
Â∏åÊúõ >>> VERB
ËóâÊ≠§ >>> ADV
ÂëºÁ±≤ >>> VERB
È´òÈõÑ >>> PROPN
Ê∞ëÁúæ >>> NOUN
ÈáçË¶ñ >>> VERB
ÂÖíÁ´• >>> NOUN
ÁöÑ >>> PART
ÈÅäÊà≤Ê¨ä >>> NOUN
„ÄÇ >>> PUNCT


- The Chinese model has a very different fine-grained part-of-speech tags on the `tag_` attribute.

In [22]:
# Printing each token's text, detailed POS tag, and an explanation of the tag.
for tok in list(doc_zh.sents)[1]:
    print(f"{tok.text} >>> {tok.tag_} | {spacy.explain(tok.tag_)}")

Ê¥ªÂãï >>> NN | noun, singular or mass
‰∏≠ >>> LC | localizer
ÂèØ >>> VV | other verb
ÁúãË¶ã >>> VV | other verb
Â≠∏Áîü >>> NN | noun, singular or mass
ÈÅãÁî® >>> VV | other verb
Ë¥äÂä© >>> VV | other verb
ÂñÆ‰Ωç >>> NN | noun, singular or mass
ÁëûÂÑÄ >>> NR | proper noun
ÊïôËÇ≤ >>> NN | noun, singular or mass
Âü∫ÈáëÊúÉ >>> NN | noun, singular or mass
Ëá¥Ë¥à >>> VV | other verb
ÁöÑ >>> DEC | ÁöÑ in a relative clause
Âª¢Ê£ÑÊú® >>> NN | noun, singular or mass
Ê£ßÊùø >>> NN | noun, singular or mass
Ôºå >>> PU | punctuation
Ë£Ω‰Ωú >>> VV | other verb
‰∫Ü >>> AS | aspect marker
6 >>> CD | cardinal number
ÂÖ∑ >>> M | measure word
ÂÖíÁ´• >>> NN | noun, singular or mass
ÂâµÊÑè >>> JJ | adjective (English), other noun-modifier (Chinese)
ÈÅäÂÖ∑ >>> NN | noun, singular or mass
Ôºå >>> PU | punctuation
‰∏ÄÊñπÈù¢ >>> AD | adverb
Â±ïÁ§∫ >>> VV | other verb
Â≠∏Áøí >>> JJ | adjective (English), other noun-modifier (Chinese)
ÊàêÊûú >>> NN | noun, singular or mass
Ôºå >>> PU | punctuation
‰πü >>> AD | a

- The Chinese model has different entity types (PER, LOC and ORG) than the English one.

This is a result of the training corpora that were used to build the models, whose annotation guidelines may be very different.

In [23]:
info = [(t.text, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_zh]
df_info = pd.DataFrame(info, columns=['Text', 'POS', 'Tag', 'IOB Tag', 'Entity Type'])
df_info

Unnamed: 0,Text,POS,Tag,IOB Tag,Entity Type
0,‰∏≠Â±±,PROPN,NR,B,ORG
1,Â§ßÂ≠∏,NOUN,NN,I,ORG
2,‰∫∫Êñá,NOUN,NN,I,ORG
3,Êö®,CCONJ,CC,I,ORG
4,ÁßëÊäÄ,NOUN,NN,I,ORG
...,...,...,...,...,...
69,ÈáçË¶ñ,VERB,VV,O,
70,ÂÖíÁ´•,NOUN,NN,O,
71,ÁöÑ,PART,DEG,O,
72,ÈÅäÊà≤Ê¨ä,NOUN,NN,O,


## Visualization

In [24]:
from spacy import displacy

In [25]:
displacy.render(doc_zh, style='ent',jupyter=True, options={'distance':130})

In [26]:
text = "ÊàëÊÉ≥Ë¶Å‰∏â‰ªΩ2ËôüÈ§ê"
doc = zh(text)
displacy.render(doc, style='dep',jupyter=True, options={'distance':130})

## DataFrame + spaCy = dframcy

In [28]:
# @title dframcy Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    !pip install dframcy -qq

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m156.0/156.0 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m231.6/231.6 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m3.9/3.9 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m86.8/86.8 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m98.7/98.7 kB[0m [31m12.0 MB/s[0m eta [36m0:00:0

In [29]:
from dframcy import DframCy

In [30]:
nlp = spacy.load('zh_core_web_md')
# Initialize DframCy with the spaCy NLP model to integrate with pandas DataFrame.
dframcy = DframCy(nlp)
# Process the Chinese text using the NLP model to create a spaCy document.
doc = dframcy.nlp(text_zh)
# Convert the NLP document annotations to a pandas DataFrame for easier analysis.
annotation_dataframe = dframcy.to_dataframe(doc)
annotation_dataframe

Unnamed: 0,token_text,token_start,token_end,token_pos_,token_tag_,token_dep_,token_head,token_ent_type_
0,‰∏≠Â±±,0,2,PROPN,NR,compound:nn,Â§ßÂ≠∏,ORG
1,Â§ßÂ≠∏,2,4,NOUN,NN,nmod:assmod,Ë∑®È†òÂüü,ORG
2,‰∫∫Êñá,4,6,NOUN,NN,conj,Ë∑®È†òÂüü,ORG
3,Êö®,6,7,CCONJ,CC,cc,Ë∑®È†òÂüü,ORG
4,ÁßëÊäÄ,7,9,NOUN,NN,compound:nn,Ë∑®È†òÂüü,ORG
...,...,...,...,...,...,...,...,...
69,ÈáçË¶ñ,128,130,VERB,VV,ccomp,ÂëºÁ±≤,
70,ÂÖíÁ´•,130,132,NOUN,NN,nmod:assmod,ÈÅäÊà≤Ê¨ä,
71,ÁöÑ,132,133,PART,DEG,case,ÂÖíÁ´•,
72,ÈÅäÊà≤Ê¨ä,133,136,NOUN,NN,dobj,ÈáçË¶ñ,


Once annotations are stored as a DataFrame object, filtering can be easily done by leveraging the power of `pandas` syntax.

In [31]:
# Create a filter for rows where the part-of-speech tag is 'NN' (noun).
nn_filt = annotation_dataframe['token_tag_'] == 'NN'
# Create a filter for rows where the dependency label is 'dobj' (direct object).
dobj_filt = annotation_dataframe['token_dep_'] == 'dobj'
# Get rows where the token is a noun and serves as a direct object.
annotation_dataframe[(nn_filt) & dobj_filt]

Unnamed: 0,token_text,token_start,token_end,token_pos_,token_tag_,token_dep_,token_head,token_ent_type_
32,Â±ï,63,64,NOUN,NN,dobj,ËàâËæ¶,
48,Ê£ßÊùø,92,94,NOUN,NN,dobj,ÁúãË¶ã,
56,ÈÅäÂÖ∑,104,106,NOUN,NN,dobj,Ë£Ω‰Ωú,
61,ÊàêÊûú,114,116,NOUN,NN,dobj,Â±ïÁ§∫,
72,ÈÅäÊà≤Ê¨ä,133,136,NOUN,NN,dobj,ÈáçË¶ñ,


## Vectors

In [32]:
doc = zh("ÊïôÊéà")
tok = doc[0]
tok.vector

array([ 2.2328  , -1.1713  , -3.3528  , -1.1691  , -0.26724 ,  4.4476  ,
       -0.66089 ,  2.6248  , -1.5367  , -2.8449  , -4.0233  ,  1.5727  ,
        1.978   ,  2.7964  ,  1.003   ,  0.29978 ,  0.056525,  3.7048  ,
        2.0446  ,  2.2452  , -5.7184  ,  0.77814 , -1.8383  , -0.017231,
       -1.91    , -6.4355  , -4.6737  , -0.13519 ,  0.66087 , -1.6718  ,
        3.5934  ,  2.3382  , -4.5406  ,  1.6124  , -2.2361  , -6.0387  ,
       -3.4078  ,  1.1304  ,  0.80933 ,  1.9734  ,  2.3314  , -0.9882  ,
       -1.1947  ,  2.2628  , -1.3687  , -6.4278  ,  0.15906 ,  0.047335,
       -2.8157  , -1.6407  ,  2.4385  , -0.84336 ,  3.081   ,  5.9188  ,
       -1.3019  ,  1.2971  ,  7.2325  ,  2.9722  , -0.45552 ,  1.5148  ,
       -1.1193  ,  3.8739  ,  1.482   , -2.4657  ,  1.4627  , -3.562   ,
       -2.1737  , -1.4306  ,  3.4363  , -1.2796  , -1.4106  ,  2.2146  ,
        2.9325  , -2.5172  ,  2.7192  , -0.84556 , -2.5362  ,  2.2079  ,
       -3.2217  , -2.2081  ,  4.6204  ,  0.98445 , 

In [33]:
tok.vector.shape

(300,)

In [34]:
word_1 = nlp.vocab["È´òËàà"]
word_2 = nlp.vocab["È´òÈõÑ"]
word_3 = nlp.vocab["ÈñãÂøÉ"]
word_1_word_2 = word_1.similarity(word_2)
word_1_word_3 = word_1.similarity(word_3)
print(f"Distance btn Word 1 and 2: {word_1_word_2}")
print(f"Distance btn Word 1 and 3: {word_1_word_3}")

Distance btn Word 1 and 2: 0.27085748314857483
Distance btn Word 1 and 3: 0.8141297101974487


- Cosine similarity

![](https://datascience-enthusiast.com/figures/cosine_sim.png)

- Formula for calculating cosine similarity between two vectors

![](https://miro.medium.com/max/1400/1*LfW66-WsYkFqWc4XYJbEJg.png)

## üîç Supplementary: StanfordNLP

Another library that shares some functionality with spaCy is StanfordNLP. [StanfordNLP](https://stanfordnlp.github.io/stanfordnlp/), distinct from Stanford‚Äôs Java-based [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library, is a [Python library](https://github.com/stanfordnlp/stanfordnlp) developed on the PyTorch framework. It provides a fully neural NLP pipeline, which includes advanced features such as tokenization (capable of recognizing multi-word units), lemmatization, part-of-speech tagging (incorporating morphological features), and state-of-the-art dependency parsing. These components were specifically designed and trained for the [CoNLL-2018 shared task](https://nlp.stanford.edu/pubs/qi2018universal.pdf). While it does not include named entity recognition, StanfordNLP excels in dependency parsing and additionally offers a Python interface to CoreNLP, facilitating integration into Python projects.

This version provides a clearer distinction between the two Stanford libraries and emphasizes the specific capabilities and strengths of StanfordNLP, enhancing the reader's understanding of its purpose and utility.

> **`stanfordnlp` has been renamed to `stanza`.**

In [None]:
# @title stanza Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    !pip install stanza -qq

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m990.1/990.1 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m410.6/410.6 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m14.1/14.1 MB[0m [31m85.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m23.7/23.7 MB[0m [31m72.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m823.6/823.6 kB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m


In [None]:
import stanza

In [None]:
stanza.download("zh-hant") # Download the traditional Chinese model for Stanza.

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   ‚Ä¶

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: zh-hant (Traditional_Chinese) ...


Downloading https://huggingface.co/stanfordnlp/stanza-zh-hant/resolve/v1.8.0/models/default.zip:   0%|        ‚Ä¶

INFO:stanza:Downloaded file to /root/stanza_resources/zh-hant/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


In [None]:
stf_nlp = stanza.Pipeline('zh-hant') # Initialize the Stanza pipeline for traditional Chinese to handle various NLP tasks.

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   ‚Ä¶

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: zh-hant (Traditional_Chinese):
| Processor | Package      |
----------------------------
| tokenize  | gsd          |
| pos       | gsd_nocharlm |
| lemma     | gsd_nocharlm |
| depparse  | gsd_nocharlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


In [None]:
text_zh = "‰∏≠Â±±Â§ßÂ≠∏‰∫∫ÊñáÊö®ÁßëÊäÄË∑®È†òÂüüÂ≠∏Â£´Â≠∏‰ΩçÂ≠∏Á®ãÂä©ÁêÜÊïôÊéàÂÆã‰∏ñÁ••Ë°®Á§∫Ôºå2021Âπ¥ËÅñË™ïÁØÄÂÅáÊúüÊúüÈñìÔºåÂ∏´ÁîüËàâËæ¶„ÄåË°óÈ†≠Áé©Á´•ÔΩûÈπΩÂüïÂÖíÁ´•Ë°óÂçÄÈÅäÊà≤Êó•„ÄçÊàêÊûúÂ±ï„ÄÇÊ¥ªÂãï‰∏≠ÂèØÁúãË¶ãÂ≠∏ÁîüÈÅãÁî®Ë¥äÂä©ÂñÆ‰ΩçÁëûÂÑÄÊïôËÇ≤Âü∫ÈáëÊúÉËá¥Ë¥àÁöÑÂª¢Ê£ÑÊú®Ê£ßÊùøÔºåË£Ω‰Ωú‰∫Ü6ÂÖ∑ÂÖíÁ´•ÂâµÊÑèÈÅäÂÖ∑Ôºå‰∏ÄÊñπÈù¢Â±ïÁ§∫Â≠∏ÁøíÊàêÊûúÔºå‰πüÂ∏åÊúõËóâÊ≠§ÂëºÁ±≤È´òÈõÑÊ∞ëÁúæÈáçË¶ñÂÖíÁ´•ÁöÑÈÅäÊà≤Ê¨ä„ÄÇ"
# Process the text with the Stanza pipeline to extract linguistic information.
doc = stf_nlp(text_zh)
type(doc)

Different models often produce different tokenization results, which in turn would have impact on POS and DEP tagging.

- Here're the results based on StandfordNLP.

In [None]:
words_data = []
for i, sent in enumerate(doc.sentences):
    for word in sent.words:
        # Prepare and append a dictionary with details about each word to the list.
        words_data.append({
            'Sentence Number': i + 1,
            'Text': word.text,
            'Lemma': word.lemma,
            'POS': word.pos,
            'Head Index': word.head,
            'Dependency Relation': word.deprel
        })

df_words = pd.DataFrame(words_data)
df_words

Unnamed: 0,Sentence Number,Text,Lemma,POS,Head Index,Dependency Relation
0,1,‰∏≠Â±±,‰∏≠Â±±,PROPN,12,nmod
1,1,Â§ßÂ≠∏,Â§ßÂ≠∏,NOUN,12,nmod
2,1,‰∫∫Êñá,‰∫∫Êñá,NOUN,12,nmod
3,1,Êö®,Êö®,CCONJ,5,cc
4,1,ÁßëÊäÄ,ÁßëÊäÄ,NOUN,3,conj
...,...,...,...,...,...,...
78,2,ÂÖíÁ´•,ÂÖíÁ´•,NOUN,39,obj
79,2,ÁöÑ,ÁöÑ,PART,39,mark:rel
80,2,ÈÅäÊà≤,ÈÅäÊà≤,NOUN,43,compound
81,2,Ê¨ä,Ê¨ä,PART,36,obj


## Assignment


### Analyze Enlgish

- Input: any English news article of your choice
- Ouput:
    - A list of unique lemmas of all verbs in lower case
    - A list of unique tuples of (NER text, NER label)




In [None]:
# Change this to any other article of your choice.

en_input = """
Taipei, April 7 (CNA) Health and Welfare Minister Chen Shih-chung (Èô≥ÊôÇ‰∏≠) said Thursday that COVID-19 contact tracing has been partially suspended in Taiwan and a new disease control model is being put in place, amid a rise in domestic cases.

The immediate suspension of contract tracing applies only to travelers who test positive for COVID-19 in Taiwan, either on arrival at the airport or during mandatory quarantine, Chen said.

That decision was made in a bid to free up resources to monitor the growing number of domestic COVID-19 cases, he said at a press briefing, after he reported 531 new cases -- 382 domestically transmitted and 149 imported.

Chen said contact tracing on new imported cases will only be done if any of them are believed to be linked to COVID-19 clusters at quarantine hotels or quarantine centers in Taiwan.

Prior to Thursday, Taiwan had been reporting its contact tracing information on imported COVID-19 cases via the World Health Organization's International Health Regulations (IHR) mechanism, he said.

Regarding the recent daily rise in domestic infections, Chen said the current goal is to bring the situation under control, even though it is impossible to achieve zero new domestic cases at this time.

Despite the recent spike, the daily number of domestic COVID-19 cases in Taiwan is still low compared to many other countries, he said, citing as an example the 534 new cases per 100,000 population reported in South Korea on Tuesday.

Once people in Taiwan stick together and do their part to prevent the spread of the virus, the situation will be manageable, Chen said.

Based on the trajectory of COVID-19 Omicron outbreaks observed in many other countries around the world, he said, it is likely that the infections in Taiwan will peak in a month or two.

"We do not expect the outbreak to stop growing now, but we hope it will rise slowly, so that Taiwan's medical capacity will not be overloaded," Chen said.

Meanwhile, earlier in the day, the Cabinet announced that Taiwan was adopting a new model for the control of COVID-19 infections.

Under the "new Taiwan model," the country has let go of its goal to achieve zero COVID-19 cases, but this does not mean allowing the pandemic go unmanaged, Cabinet spokesman Lo Ping-cheng (ÁæÖÁßâÊàê) said, citing Premier Su Tseng-chang (ËòáË≤ûÊòå).

In a meeting earlier with Ministry of Health and Welfare (MOHW) officials, Premier Su said that as Taiwan moves towards a new stage of epidemic prevention, he hopes that the central and local governments will work together to gradually open up the country, in the interests of its people and economy, according to Lo.

In a report presented to the Cabinet on Thursday, the MOHW said Taiwan will continue to actively manage the COVID-19 situation, while steadily opening up its borders, in consideration of national economic factors and the people's livelihood.
"""

In [None]:
# Start by turning a text into a spaCy Doc object
en_doc = en(en_input)

In [None]:
#===Write your code below and save the output as `verbs`.===#

verbs = set(token.lemma_.lower() for token in en_doc if token.pos_ == "VERB")
verbs
# verbs =

{'accord',
 'achieve',
 'adopt',
 'allow',
 'announce',
 'apply',
 'base',
 'believe',
 'bring',
 'cite',
 'compare',
 'continue',
 'do',
 'expect',
 'free',
 'go',
 'grow',
 'hope',
 'import',
 'let',
 'link',
 'make',
 'manage',
 'mean',
 'monitor',
 'move',
 'observe',
 'open',
 'overload',
 'peak',
 'present',
 'prevent',
 'put',
 'regard',
 'report',
 'rise',
 'say',
 'stick',
 'stop',
 'suspend',
 'test',
 'trace',
 'transmit',
 'work'}

In [None]:
#===Write your code below and save the output as `en_ents`.===#
en_ents = set((ent.text, ent.label_) for ent in en_doc.ents)
en_ents
# en_ents =

[('Taipei', 'GPE'),
 ('April 7', 'DATE'),
 ('CNA) Health and', 'ORG'),
 ('Welfare', 'ORG'),
 ('Chen Shih-chung', 'PERSON'),
 ('Thursday', 'DATE'),
 ('COVID-19', 'PERSON'),
 ('Taiwan', 'GPE'),
 ('Chen', 'PERSON'),
 ('COVID-19', 'PRODUCT'),
 ('531', 'CARDINAL'),
 ('382', 'CARDINAL'),
 ('149', 'CARDINAL'),
 ("the World Health Organization's", 'ORG'),
 ('daily', 'DATE'),
 ('zero', 'CARDINAL'),
 ('534', 'CARDINAL'),
 ('100,000', 'CARDINAL'),
 ('South Korea', 'GPE'),
 ('Tuesday', 'DATE'),
 ('COVID-19 Omicron', 'PERSON'),
 ('a month', 'DATE'),
 ('two', 'CARDINAL'),
 ('earlier in the day', 'DATE'),
 ('Cabinet', 'ORG'),
 ('Lo Ping-cheng', 'PERSON'),
 ('Su Tseng-chang', 'PERSON'),
 ('Ministry of Health and Welfare', 'ORG'),
 ('Su', 'PERSON'),
 ('Lo', 'ORG'),
 ('MOHW', 'ORG')]

### Analyze Chinese

- Input 1: any Chinese news article from Taiwan media of your choice
- Ouput 1:
    - A list of unique tokens except for punctuations
    - A list of unique tuples of (NER text, NER label)

In [None]:
# Change this to any other article of your choice.

zh_input = """
Êú¨ÂúüÂÄãÊ°àÊò®Â¢û‰∏âÂÖ´‰∫å‰æãÂÜçÂâµÊñ∞È´òÔºåÁ¢∫Ë®∫Ê°à‰æãÈÅçÂèäÂçÅ‰πùÁ∏£Â∏ÇÔºåÂ¢ÉÂ§ñÂ¢û‰∏ÄÂõõ‰πù‰æãÔºåÂñÆÊó•Á†¥‰∫îÁôæÊ°à‰æã„ÄÇ

‰∏≠Â§ÆÁñ´ÊÉÖÊåáÊèÆ‰∏≠ÂøÉÊåáÊèÆÂÆòÈô≥ÊôÇ‰∏≠Ë°®Á§∫ÔºåÂÄãÊ°àÊï∏ÈÇÑÊúÉÂæÄ‰∏äÂçáÔºåÈ†ê‰º∞‰∏ÄËá≥ÂÖ©ÂÄãÊúàÂÖßÈÅîÊúÄÈ´òÂ≥∞ÔºåÁñ´ÊÉÖÂ∞áÊåÅÁ∫åÂà∞ÂÖ≠ÊúàÂ∫ï‰∏îÈÇÑ‰∏çÊúÉÂà∞Â∞æËÅ≤Ôºõ‰ªñ‰πüÈ¶ñÂ∫¶È¨ÜÂè£„ÄåÊ∏ÖÈõ∂‰∏çÂèØËÉΩ„ÄçÔºåÊú™‰æÜÂ∞áËµ∞ÂêëËàáÁóÖÊØíÂÖ±Â≠òÔºå‰∏ãÂë®Êì¨Ë©¶Ëæ¶„ÄåËºïÁóáÂú®ÂÆ∂„ÄçÈöîÈõ¢ÔºåÂêåÊôÇ‰πüÂ∞áË™øÊï¥ÂÅúË™≤Ê®ôÊ∫ñ„ÄÇ

Êì¨‰ª•Â±ÖÂÆ∂Âø´ÁØ©Âèñ‰ª£ÂÅúË™≤
ÂúãÂÖßÁ¢∫Ë®∫ÂÄãÊ°à‰∏äÂçáÔºåÂÖ®ÂúãÁ¥ØÁ©çÂçÅÂõõÁ∏£Â∏ÇÂÖ±‰∏Ä‰∏â‰πùÊâÄÊ†°ÂúíÂÅúË™≤ÔºåÂêÑÁ∏£Â∏ÇÂ∞çÁï¢Ê•≠ÊóÖË°å„ÄÅÊà∂Â§ñÊïôÂ≠∏ÊòØÂê¶ÂèñÊ∂àÊ®ôÊ∫ñ‰∏ç‰∏ÄÔºåÊïôÂ∏´ÂúòÈ´îË™çÁÇ∫ÊîøÂ∫úÊáâË©≤ÊòéÁ¢∫Ë°®ÊÖãÔºåÂê¶ÂâáÊúÉÈÄ†ÊàêÊ†°ÂúíÊÅêÊÖåÊàñÂΩ±ÈüøÂ≠∏ÁîüÂèóÊïôÊ¨ä„ÄÇÈô≥ÊôÇ‰∏≠Ë°®Á§∫ÔºåÊú™‰æÜÂøÖÁÑ∂Ëµ∞ÂêëËàáÁóÖÊØíÂÖ±Â≠òÔºå‰∏ãÂë®Â∞áËàáÊïôËÇ≤ÈÉ®Ê™¢Ë®éÂÅúË™≤Ê®ôÊ∫ñÔºåÁ∏ÆÂ∞èÂå°ÂàóÁØÑÂúçÔºå‰∏¶Âú®ÂêàÁêÜÁØÑÂúç‰ª•Â±ÖÂÆ∂Âø´ÁØ©‰æÜÂèñ‰ª£ÂÅúË™≤„ÄÇ

Áñ´ÊÉÖ‰∏ÄËá≥ÂÖ©ÂÄãÊúàÈÅîÈ´òÂ≥∞
ÊåáÊèÆ‰∏≠ÂøÉÊØîÁÖßÈüìÂúã„ÄÅÁ¥êË•øËò≠ÂèäÈ¶ôÊ∏ØÁñ´ÊÉÖÁôºÂ±ïÔºåÊé®‰º∞Êú™‰æÜ‰∏ÄËá≥ÂÖ©ÂÄãÊúàÁ¢∫Ë®∫Ê°à‰æãÂ∞áÈ£ÜËá≥ÊúÄÈ´òÂ≥∞ÔºåÈô≥ÊôÇ‰∏≠Ë°®Á§∫ÔºåÁõÆÂâçOmicronÁ¢∫Ë®∫Êï∏‰ªçÁÆó‰ΩéÔºå‰ΩÜË¶èÊ®°Èõ£‰ª•È†ê‰º∞ÔºåÊú™‰æÜÂñÆÊó•ÊÅêË∂ÖÈÅé‰∏ÄÂçÉ‰∫îÁôæ‰æãÔºåÂ±ÜÊôÇËá¥Ê≠ªÁéá„ÄÅÂÄãÊ°àÊï∏È£ÜÈ´òÊàñÁñ´ÊÉÖÈ´òÂ≥∞‰∏ã‰∏ç‰æÜÔºåÁ§æÊúÉÂ∞áÊâøÊìî‰∏çËµ∑ÔºåÂõ†Ê≠§‰ªçÈ†àÁ©çÊ•µÂõ†ÊáâÔºåÊúù„ÄåÁ∑©Âù°‰∏äÂçá„ÄçÊñπÂêëÂä™Âäõ„ÄÇ

ËºïÁóáÂú®ÂÆ∂ÊåáÂºïËá≥‰ªäÊ≤íË≠ú
Á∏ΩÁµ±Ëî°Ëã±ÊñáÊó•ÂâçÂÆ£Â∏ÉÈò≤Áñ´‰ª•„ÄåÊ∏õÁÅΩ„ÄçÁÇ∫ÁõÆÊ®ôÈÅøÂÖçÈÜ´ÁôÇÈáèËÉΩË∂ÖËºâÔºå„ÄåËºïÁóáÂú®ÂÆ∂„ÄçÈöîÈõ¢ÁÇ∫ÂÖ∂‰∏≠ÈÖçÂ•óÔºåÂñÆÊó•Á¢∫Ë®∫Êï∏Ëã•ÈÅî‰∏ÄÂçÉ‰∫îÁôæ‰∫∫Â∞áÂïüÂãï„ÄÇÊåáÊèÆ‰∏≠ÂøÉ‰∏ãÂë®Â∞áÊì¨ÂÆö„ÄåËºïÁóáÂú®ÂÆ∂ÁÖßË≠∑ÊåáÂºï„ÄçÔºåÂ¶ÇË®≠ÈÜ´ÁôÇÈÅ†Ë∑ùÂπ≥Âè∞„ÄÅÈÄÅËó•„ÄÅÊà∂ÊîøÂèäË≠¶ÊîøÁ≥ªÁµ±ËÅØÁπ´„ÄÅÈóúÊá∑‰∏≠ÂøÉÈÅã‰ΩúÁ≠âÔºåËã•Êú™ÈÅµÂÆàÈöîÈõ¢Ë¶èÂÆöÂ∞áÊúâÁΩ∞ÂâáÔºå‰∏¶Âõ†ÊáâÂú∞ÊñπÁñ´ÊÉÖÂçáÊ∫´ÂíåÈÜ´ÁôÇÈáèËÉΩÂêÉÁ∑äÔºåÂ∞áÂæûÊñ∞ÂåóË©¶Ëæ¶„ÄÇ

Á´ãÂßîË≥¥ÊÉ†Âì°„ÄÅËî£Ëê¨ÂÆâÊò®ÊñºË°õÁí∞ÂßîÂì°ÊúÉË≥™Ë©¢ÊôÇÔºåË≥™ÁñëÂêÑÂú∞ÊñπÊó©Â∑≤ÂñäË©±Â∏åÊúõÊåáÊèÆ‰∏≠ÂøÉÁõ∏ÈóúÊåáÂºïÂø´Âá∫‰æÜÔºå‰ΩÜËá≥‰ªäÈÄ£Á§æÂçÄÊ∫ùÈÄö„ÄÅÂæµÊ±ÇË©¶Ëæ¶ÁöÑÂú∞ÊñπÊîøÂ∫úÊÑèÈ°òÁµ±Áµ±Ê≤íÊúâË≠ú„ÄÇ

Âè∞ÂåóÂ∏ÇÈï∑ÊüØÊñáÂì≤Êò®Êôö‰πüÂú®ËáâÊõ∏Ë°®Á§∫„ÄåÂåóÂ∏ÇÈò≤Áñ´ÊóÖÈ§®ÈáèËÉΩÂëäÊÄ•„ÄçÔºåÂõ†ÊØèÂ§©Á¢∫Ë®∫‰∫∫Êï∏‰∏çÊñ∑ÊîÄÂçáÔºåËøëÊúüÊúâËøëËê¨ÂêçÁßªÂ∑•ÂÖ•Â¢ÉÔºåÂπæ‰πéÊääÂåóÂ∏ÇÁöÑÈò≤Áñ´ÊóÖÈ§®ÈáèËÉΩÂç†Êªø„ÄÇÂåóÂ∏ÇÂ∑≤ÁôºÂá∫ÂæµÂè¨‰ª§ÔºåÂæµÁî®Âä†Âº∑ÁâàÈò≤Áñ´Â∞àË≤¨ÊóÖÈ§®ÔºåËÆìËºïÁóá„ÄÅ‰ΩéÂç±Èö™Á¢∫Ë®∫ËÄÖÂÖ•‰Ωè„ÄÇ

Á∏ΩÁµ±Â∫ú„ÄÅÁõ£ÂØüÈô¢ÂÇ≥Á¢∫Ë®∫ËÄÖ
Êú¨ÂúüÁñ´ÊÉÖÂ§öÈªûÁàÜÁôºÔºåÂÖ¨ÂãôÊ©üÈóúÂåÖÊã¨Á∏ΩÁµ±Â∫ú„ÄÅÁõ£ÂØüÈô¢„ÄÅÂè∞ÂåóÂ∏ÇË≠∞ÊúÉÈÉΩÂÇ≥Âá∫ÊúâÁ¢∫Ë®∫ËÄÖÔºåÂåó‰∏≠ÂçóÂÖ´Â§ßË°åÊ•≠Á¢∫Ë®∫‰∫∫Êï∏‰πüÈ©üÂ¢ûÔºåÊú™‰æÜÂ†¥ÊâÄÊòØÂê¶Êñ∞Â¢ûÁ¶Å‰ª§ÔºåÊåáÊèÆ‰∏≠ÂøÉÂ∞áÂÜçË®éË´ñ„ÄÇ

Â¢ÉÂ§ñÁßªÂÖ•Êò®Â¢û‰∏ÄÂõõ‰πù‰æãÂÄãÊ°àÔºåÊúâ‰∏ÉÂçÅÂÖ´‰æãÁÇ∫Ëà™Áè≠ËêΩÂú∞Êé°Ê™¢ÈôΩÊÄßÔºåË∂äÂçóÊúâÂçÅ‰∏É‰æãÂ±ÖÂÜ†„ÄÇÊåáÊèÆ‰∏≠ÂøÉË°®Á§∫ÔºåÁõÆÂâçËêΩÂú∞Êé°Ê™¢ÈôΩÊÄßÁéáÁ¥ÑËêΩÂú®ÂõõÔºÖÂà∞‰∫îÔºÖÈñìÔºåÂç≥Êó•Ëµ∑Ë∂äÂçóËà™Á©∫„ÄÅË∂äÊç∑Ëà™Á©∫ÂèäË∂äÁ´πËà™Á©∫Á≠â‰∏âËà™Á©∫ÂÖ¨Âè∏Áè≠Ê©üÔºåÂ¢ûÂä†„ÄåÊê≠Ê©üÂâçÂÖ≠Â∞èÊôÇÂÖßÊäóÂéüÂø´ÁØ©Â†±Âëä„ÄçÊâçÂèØÂÖ•Â¢É„ÄÇ
"""

In [None]:
# Start by turning a text into a spaCy Doc object
zh_doc = zh(zh_input)

In [None]:
#===Write your code below and save the output as `zh_toks`.===#
zh_toks = set(token.text.lower() for token in zh_doc if not token.is_punct)
zh_toks
# zh_toks =

['ËºïÁóá',
 'Á¥êË•øËò≠',
 'ÁôºÂá∫',
 'ÁΩ∞Ââá',
 'Ê†°Âúí',
 '‰∏Ä',
 'ÂïüÂãï',
 'ÂÜç',
 '‰πü',
 'ÊïôÂ∏´',
 'Ëàá',
 'Êñ∞È´ò',
 'ÁÖßË≠∑',
 'ÂñÆÊó•',
 'Ë∂äÂçó',
 'Ë°®ÊÖã',
 'ÂÖ∂‰∏≠',
 'Â∞æËÅ≤',
 'Ë≥™Áñë',
 'Ë¶èÂÆö',
 'Êì¨ÂÆö',
 'ÊúÄÈ´òÂ≥∞',
 'ÁôÇÈáè',
 'ÊîøÂ∫ú',
 'ÊáâË©≤',
 '\n\n',
 'ÈÄ†Êàê',
 'È´òÂ≥∞',
 'ÂÇ≥Âá∫',
 'ÂêÉÁ∑ä',
 'È£ÜÈ´ò',
 'ÂæµÊ±Ç',
 'ÂèØ',
 'ÊâÄ',
 'ÁàÜÁôº',
 'ÂÖ≠Êúà',
 'Êú™',
 'Êää',
 'Ë°®Á§∫',
 '‰∏ã',
 'ÈÄ£Á§æ',
 'ÊîÄÂçá',
 'ÈÅ†Ë∑ù',
 'ÈÅøÂÖç',
 'Êò®Â¢û',
 'Á§æÊúÉ',
 'Èñì',
 '‰∏Ä‰∏â‰πù',
 'Â±ÖÂÜ†',
 'ÊóÖË°å',
 'ÂèØËÉΩ',
 'ÊñºË°õ',
 'Â∞è',
 'ÂÄãÊ°à',
 'Ë∂äÁ´π',
 '‰ΩÜ',
 'Á¥ØÁ©ç',
 'Èò≤Áñ´',
 'Êó©Â∑≤',
 'Ê©üÈóú',
 'Â¢ûÂä†',
 'Âä†Âº∑Áâà',
 'ÂñäË©±',
 'ÊäóÂéü',
 'ÂçÅ‰∏É',
 '‰æãÈÅç',
 'Áñ´ÊÉÖ',
 'Áµ±Ê≤í',
 'ÊòØÂê¶',
 'Âõ†',
 'Êò®Êôö',
 'Âú∞Êñπ',
 'Â†±Âëä',
 'ÁÇ∫ÁõÆ',
 'Á¢∫Ë®∫',
 'Êê≠Ê©ü',
 'Á∑©Âù°',
 'Ê™¢Ë®é',
 'Êé®‰º∞',
 'Ëµ∞Âêë',
 'Ë≠ú',
 'ÈÄÅËó•',
 'Ë®∫Êï∏',
 'Âå°Âàó',
 '‰æãÁÇ∫',
 'Â±ÜÊôÇ',
 'ÊåáÊèÆÂÆò',
 'ÂÖ±Â≠ò',
 'Â≠∏Áîü',
 'Âê¶Ââá',
 'Âèä',
 'Èô≥ÊôÇ',
 'Ë©¢ÊôÇ',
 '‰ªç',
 

In [None]:
#===Write your code below and save the output as `zh_ents`.===#
zh_ents = set((ent.text, ent.label_) for ent in zh_doc.ents)
zh_ents
# zh_ents =

{('‰∏Ä‰∏â‰πù', 'DATE'),
 ('‰∏ÄÂçÉ‰∫îÁôæ', 'CARDINAL'),
 ('‰∏ÄÂõõ‰πù‰æã', 'CARDINAL'),
 ('‰∏ÄËá≥ÂÖ©ÂÄãÊúà', 'DATE'),
 ('‰∏ÉÂçÅÂÖ´', 'CARDINAL'),
 ('‰∏â', 'CARDINAL'),
 ('‰∏âÂÖ´‰∫å', 'CARDINAL'),
 ('‰∏ãÂë®', 'DATE'),
 ('‰∏≠Â§ÆÁñ´ÊÉÖÊåáÊèÆ‰∏≠ÂøÉ', 'ORG'),
 ('‰∫îÁôæ', 'CARDINAL'),
 ('‰∫îÔºÖ', 'CARDINAL'),
 ('ÂÖ≠Êúà', 'DATE'),
 ('ÂåóÂ∏Ç', 'GPE'),
 ('ÂçÅ‰∏É', 'CARDINAL'),
 ('ÂçÅÂõõÁ∏£', 'CARDINAL'),
 ('Âè∞ÂåóÂ∏Ç', 'GPE'),
 ('Âë®Â∞á', 'DATE'),
 ('Âë®Êì¨', 'PERSON'),
 ('ÂõõÔºÖ', 'FAC'),
 ('Â∞áË™ø', 'PERSON'),
 ('Â∞áÈ£Ü', 'PERSON'),
 ('Â∫ï‰∏î', 'PERSON'),
 ('ÂæµÁî®Âä†Âº∑ÁâàÈò≤Áñ´Â∞àË≤¨', 'ORG'),
 ('ÊïôËÇ≤ÈÉ®', 'ORG'),
 ('Êñ∞Âåó', 'GPE'),
 ('Êó•Ââç', 'DATE'),
 ('Êò®Êôö', 'TIME'),
 ('Áõ£ÂØüÈô¢', 'ORG'),
 ('Á¥êË•øËò≠', 'GPE'),
 ('Ëî°Ëã±Êñá', 'PERSON'),
 ('Ë¶èÊ®°Èõ£', 'ORG'),
 ('Ë≠¶ÊîøÁ≥ªÁµ±ËÅØÁπ´', 'ORG'),
 ('Ë∂äÂçó', 'GPE'),
 ('ÈÜ´ÁôÇÈáè', 'ORG'),
 ('ÈôΩÊÄßÁéá', 'PERSON'),
 ('ÈüìÂúã', 'GPE'),
 ('È†ê‰º∞‰∏ÄËá≥ÂÖ©ÂÄãÊúà', 'DATE'),
 ('È¶ôÊ∏Ø', 'GPE'),
 ('È©üÂ¢û', 'PERSON')}

- Input 2: Simplified version of Input 1 (Use `opencc` to do the conversion.)
- Ouput 2:
    - A list of unique tokens except for punctuations
    - A list of unique tuples of (NER text, NER label)

In [None]:
# @title opencc Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    !pip install opencc -qq

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/779.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[90m‚ï∫[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m297.0/779.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m[90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m655.4/779.8 kB[0m [31m11.2 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m779.8/779.8 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[0m

In [None]:
import opencc

In [None]:
converter = opencc.OpenCC('t2s.json')
sim_zh_input = converter.convert(zh_input)
sim_zh_input

'\nÊú¨Âúü‰∏™Ê°àÊò®Â¢û‰∏âÂÖ´‰∫å‰æãÂÜçÂàõÊñ∞È´òÔºåÁ°ÆËØäÊ°à‰æãÈÅçÂèäÂçÅ‰πùÂéøÂ∏ÇÔºåÂ¢ÉÂ§ñÂ¢û‰∏ÄÂõõ‰πù‰æãÔºåÂçïÊó•Á†¥‰∫îÁôæÊ°à‰æã„ÄÇ\n\n‰∏≠Â§ÆÁñ´ÊÉÖÊåáÊå•‰∏≠ÂøÉÊåáÊå•ÂÆòÈôàÊó∂‰∏≠Ë°®Á§∫Ôºå‰∏™Ê°àÊï∞Ëøò‰ºöÂæÄ‰∏äÂçáÔºåÈ¢Ñ‰º∞‰∏ÄËá≥‰∏§‰∏™ÊúàÂÜÖËææÊúÄÈ´òÂ≥∞ÔºåÁñ´ÊÉÖÂ∞ÜÊåÅÁª≠Âà∞ÂÖ≠ÊúàÂ∫ï‰∏îËøò‰∏ç‰ºöÂà∞Â∞æÂ£∞Ôºõ‰ªñ‰πüÈ¶ñÂ∫¶ÊùæÂè£„ÄåÊ∏ÖÈõ∂‰∏çÂèØËÉΩ„ÄçÔºåÊú™Êù•Â∞ÜËµ∞Âêë‰∏éÁóÖÊØíÂÖ±Â≠òÔºå‰∏ãÂë®ÊãüËØïÂäû„ÄåËΩªÁóáÂú®ÂÆ∂„ÄçÈöîÁ¶ªÔºåÂêåÊó∂‰πüÂ∞ÜË∞ÉÊï¥ÂÅúËØæÊ†áÂáÜ„ÄÇ\n\nÊãü‰ª•Â±ÖÂÆ∂Âø´Á≠õÂèñ‰ª£ÂÅúËØæ\nÂõΩÂÜÖÁ°ÆËØä‰∏™Ê°à‰∏äÂçáÔºåÂÖ®ÂõΩÁ¥ØÁßØÂçÅÂõõÂéøÂ∏ÇÂÖ±‰∏Ä‰∏â‰πùÊâÄÊ†°Âõ≠ÂÅúËØæÔºåÂêÑÂéøÂ∏ÇÂØπÊØï‰∏öÊóÖË°å„ÄÅÊà∑Â§ñÊïôÂ≠¶ÊòØÂê¶ÂèñÊ∂àÊ†áÂáÜ‰∏ç‰∏ÄÔºåÊïôÂ∏àÂõ¢‰ΩìËÆ§‰∏∫ÊîøÂ∫úÂ∫îËØ•ÊòéÁ°ÆË°®ÊÄÅÔºåÂê¶Âàô‰ºöÈÄ†ÊàêÊ†°Âõ≠ÊÅêÊÖåÊàñÂΩ±ÂìçÂ≠¶ÁîüÂèóÊïôÊùÉ„ÄÇÈôàÊó∂‰∏≠Ë°®Á§∫ÔºåÊú™Êù•ÂøÖÁÑ∂Ëµ∞Âêë‰∏éÁóÖÊØíÂÖ±Â≠òÔºå‰∏ãÂë®Â∞Ü‰∏éÊïôËÇ≤ÈÉ®Ê£ÄËÆ®ÂÅúËØæÊ†áÂáÜÔºåÁº©Â∞èÂå°ÂàóËåÉÂõ¥ÔºåÂπ∂Âú®ÂêàÁêÜËåÉÂõ¥‰ª•Â±ÖÂÆ∂Âø´Á≠õÊù•Âèñ‰ª£ÂÅúËØæ„ÄÇ\n\nÁñ´ÊÉÖ‰∏ÄËá≥‰∏§‰∏™ÊúàËææÈ´òÂ≥∞\nÊåáÊå•‰∏≠ÂøÉÊØîÁÖßÈü©ÂõΩ„ÄÅÁ∫ΩË•øÂÖ∞ÂèäÈ¶ôÊ∏ØÁñ´ÊÉÖÂèëÂ±ïÔºåÊé®

In [None]:
# Start by turning a text into a spaCy Doc object
sim_zh_doc = zh(sim_zh_input)

In [None]:
#===Write your code below and save the output as `sim_zh_toks`.===#
sim_zh_toks = set(token.text.lower() for token in sim_zh_doc if not token.is_punct)
sim_zh_toks
# sim_zh_toks =

{'\n',
 '\n\n',
 'omicronÁ°Æ',
 '‰∏Ä',
 '‰∏Ä‰∏â‰πù',
 '‰∏ÄÂçÉ‰∫îÁôæ',
 '‰∏ÄÂçÉ‰∫îÁôæ‰æã',
 '‰∏ÄÂõõ‰πù',
 '‰∏ÄÂõõ‰πù‰æã',
 '‰∏ÉÂçÅÂÖ´',
 '‰∏á',
 '‰∏â',
 '‰∏âÂÖ´‰∫å',
 '‰∏äÂçá',
 '‰∏ã',
 '‰∏ãÂë®',
 '‰∏ç',
 '‰∏ç‰∏Ä',
 '‰∏çÊñ≠',
 '‰∏é',
 '‰∏ìË¥£',
 '‰∏§',
 '‰∏™',
 '‰∏™Ê°à',
 '‰∏≠',
 '‰∏≠Âçó',
 '‰∏≠Â§Æ',
 '‰∏≠ÂøÉ',
 '‰∏∫',
 '‰πü',
 '‰∫îÁôæ',
 '‰∫îÔºÖ',
 '‰∫∫',
 '‰∫∫Êï∞',
 '‰ªç',
 '‰ªé',
 '‰ªñ',
 '‰ª•',
 '‰ºö',
 '‰º†Âá∫',
 '‰º†Á°Æ',
 '‰ΩÜ',
 '‰Ωé',
 '‰æã',
 '‰æã‰∏∫',
 'ÂÅúËØæ',
 'ÂÖ•‰Ωè',
 'ÂÖ•Â¢É',
 'ÂÖ®ÂõΩ',
 'ÂÖ´Â§ß',
 'ÂÖ¨Âä°',
 'ÂÖ¨Âè∏',
 'ÂÖ≠',
 'ÂÖ≠Êúà',
 'ÂÖ±',
 'ÂÖ±Â≠ò',
 'ÂÖ≥ÊÄÄ',
 'ÂÖ∂‰∏≠',
 'ÂÜÖ',
 'ÂÜÖËææ',
 'ÂÜç',
 'ÂáèÁÅæ',
 'Âá†‰πé',
 'Âá∫Êù•',
 'ÂàõÊñ∞',
 'Âà∞',
 'Ââç',
 'Âä†Âº∫Áâà',
 'Âä™Âäõ',
 'ÂåÖÊã¨',
 'Âåó',
 'ÂåóÂ∏Ç',
 'Âå°Âàó',
 'ÂåªÁñó',
 'ÂåªÁñóÈáè',
 'ÂçÅ‰∏É',
 'ÂçÅ‰πù',
 'ÂçÅÂõõ',
 'ÂçáÊ∏©',
 'Âçï',
 'ÂçïÊó•',
 'Âç†Êª°',
 'Âç´ÁéØ',
 'Âç±Èô©',
 'Âç≥',
 'Âéø',
 'Âèä',
 'ÂèëÂá∫',
 'ÂèëÂ±ï',
 'Âèñ‰ª£',
 'ÂèñÊ∂à',
 'ÂèóÊïôÊùÉ',
 'ÂèØ',
 'ÂèØËÉΩ',
 'Âè∞Âåó',
 'Âè∞ÂåóÂ∏Ç

Evaluate whether `zh_toks` is equal to `sim_zh_toks`.

In [None]:
zh_toks == sim_zh_toks

False

In [None]:
#===Write your code below and save the output as `sim_zh_ents`.===#

sim_zh_ents = set((ent.text, ent.label_) for ent in sim_zh_doc.ents)
sim_zh_ents
# sim_zh_ents =

{('‰∏Ä‰∏â‰πù', 'DATE'),
 ('‰∏ÄÂçÉ‰∫îÁôæ‰∫∫', 'CARDINAL'),
 ('‰∏ÄÂõõ‰πù‰æã', 'CARDINAL'),
 ('‰∏ÄËá≥‰∏§‰∏™Êúà', 'DATE'),
 ('‰∏ÉÂçÅÂÖ´', 'CARDINAL'),
 ('‰∏â', 'CARDINAL'),
 ('‰∏âÂÖ´‰∫å', 'CARDINAL'),
 ('‰∏ãÂë®', 'DATE'),
 ('‰∏≠Âçó', 'GPE'),
 ('‰∏≠Â§ÆÁñ´ÊÉÖÊåáÊå•‰∏≠ÂøÉ', 'ORG'),
 ('‰∫îÁôæ', 'CARDINAL'),
 ('‰∫îÔºÖ', 'PERCENT'),
 ('ÂÖ≠Â∞èÊó∂', 'TIME'),
 ('ÂÖ≠Êúà', 'DATE'),
 ('ÂÜÖËææ', 'GPE'),
 ('ÂåóÂ∏Ç', 'GPE'),
 ('ÂçÅ‰∏É', 'CARDINAL'),
 ('ÂçÅ‰πù', 'CARDINAL'),
 ('ÂçÅÂõõÂéø', 'CARDINAL'),
 ('Âè∞Âåó', 'GPE'),
 ('Âè∞ÂåóÂ∏ÇËÆÆ‰ºö', 'ORG'),
 ('Âë®Êãü', 'PERSON'),
 ('ÂõõÔºÖ', 'FAC'),
 ('Â∫ï‰∏î', 'PERSON'),
 ('ÊÄªÁªüÂ∫ú', 'ORG'),
 ('ÊïôËÇ≤ÈÉ®', 'ORG'),
 ('Êñ∞Âåó', 'GPE'),
 ('Êó•Ââç', 'DATE'),
 ('Êó•Á°ÆËØäÊï∞', 'ORG'),
 ('Êò®Êôö', 'TIME'),
 ('Êú™Êù•‰∏ÄËá≥‰∏§‰∏™Êúà', 'DATE'),
 ('ÊüØÊñáÂì≤', 'PERSON'),
 ('ÁõëÂØüÈô¢', 'ORG'),
 ('Á°ÆËØäËÄÖ', 'FAC'),
 ('Á∫¶ËêΩ', 'PERSON'),
 ('Á∫ΩË•øÂÖ∞', 'GPE'),
 ('Ëíã‰∏áÂÆâÊò®‰∫éÂç´ÁéØÂßîÂëò‰ºö', 'ORG'),
 ('Ëî°Ëã±Êñá', 'PERSON'),
 ('ËµñÊÉ†Âëò', 'PERSON'),
 ('Ë∂äÂçó', '

Evaluate whether `zh_ents` is equal to `sim_zh_ents`.

In [None]:
zh_ents == sim_zh_ents

False

## üìö Reference

1. https://ckip.iis.sinica.edu.tw/
2. https://github.com/APCLab/jieba-tw
2. https://corenlp.run/
3. https://github.com/Embedding/Chinese-Word-Vectors
4. https://github.com/stanfordnlp/GloVe
5. https://radimrehurek.com/gensim/
7. https://github.com/sloria/textblob

