# NLP: Applications

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/phonchi/ModularPython/blob/master/NLP-use-pretrained-models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/phonchi/ModularPython/blob/master/NLP-use-pretrained-models.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

This notebook is adapted by [Haowen Jiang](https://howard-haowen.rohan.tw/) from [this one](https://github.com/nlptown/nlp-notebooks/blob/master/NLP%20with%20pretrained%20models%20-%20spaCy%20and%20StanfordNLP.ipynb) included in the [nlptown
/nlp-notebooks](https://github.com/nlptown/nlp-notebooks) repo. It is meant for the 2022 [NLP Workshop at NSYSU](https://howard-haowen.github.io/NLP-demos/nsysu_workshop).

In [None]:
from datetime import date

today = date.today()
print("Last updated:", today)

# 📘 NLP with pretrained models - spaCy

In [None]:
# @title spaCy Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    #!pip install -U pip setuptools wheel -qq
    #!pip install -U spacy -qq
    !python -m spacy download en_core_web_md -qq # downloads the medium-sized English language model
    !python -m spacy download zh_core_web_md -qq # downloads the medium-sized Chinese language model

![](https://spacy.io/images/pipeline.svg)

In [None]:
import pandas as pd
import numpy as np
import spacy

In [None]:
spacy.info()

- To get you started, play with [this Web App](https://share.streamlit.io/howard-haowen/spacy-streamlit/app.py) that I created, which is powered by spaCy.

## English NLP

In [None]:
en = spacy.load("en_core_web_md") # Loading the spaCy Model which includes vocabulary, syntax models, and entities.
df_metadata = pd.DataFrame([en.meta])
df_metadata.T

In [None]:
text = ("Donald John Trump (born June 14, 1946) is the 45th and previous president of "
     "the United States.  Before entering politics, he was a businessman and television personality.")
print(text)

Here, the text about Donald Trump is processed by the spaCy model, creating a `Doc` object (A `Doc` object is a sequence of Token objects representing a lexical token) `doc_en` that contains all the information about the text's structure and content.

In [None]:
doc_en = en(text)

In [None]:
tokens = [token.text for token in doc_en]
print(tokens)

SpaCy also splits your document into sentences. In spaCy, the `.sents` property is used to extract sentences from the Doc object

In [None]:
sentences = list(doc_en.sents)
len(sentences), sentences

### Part-of-Speech tagging

In addition, spaCy identifies a variety of linguistic features for each token. Among the foundational features are the lemma and two types of parts-of-speech (POS) tags. The `pos_` attribute encompasses the [Universal POS tags](https://universaldependencies.org/u/pos/) derived from the [Universal Dependencies](https://universaldependencies.org/) framework, which provide a consistent categorization of word types across languages. On the other hand, the `tag_` attribute offers more detailed, language-specific POS tags that capture finer grammatical distinctions.

> Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence.

In [None]:
# orthographic representation, lemma, coarse-grained part-of-speech (pos_), and fine-grained part-of-speech (tag_).
features = [
    {'Text': token.orth_, 'Lemma': token.lemma_, 'POS': token.pos_, 'Detailed POS': token.tag_, 'Explain': spacy.explain(token.tag_)}
    for token in doc_en
]

df_features = pd.DataFrame(features)
df_features

### Named-Entity Recognition

Next, spaCy includes pre-trained models for named entity recognition (NER). The outcomes of these models are reflected in the `ent_iob_` and `ent_type` attributes. The `ent_type` attribute specifies the category of entity identified by the model, such as a person, date, ordinal number, or geopolitical entity (GPE). For instance, in English models adhering to the [OntoNotes standard](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf), "Donald John Trump" is recognized as a person, "June 14, 1946" as a date, "45th" as an ordinal number, and "the United States" as a GPE.

The `ent_iob_` attribute (inside-outside-beginning (IOB) tagging) indicates the token's position within an entity: `O` for outside any entity, `B` for the beginning of an entity, and `I` for inside an entity (but not at the beginning). This notation is part of the `BIO` tagging scheme, which helps differentiate between consecutive entities of the same type.

> Other schemes like `BILUO` include additional designations for the last token of an entity and for unique, standalone entity tokens, providing detailed positional information within entity sequences.

In [None]:
# Extracting named entity information from each token in the document
entities = [
    {'Text': token.orth_, 'IOB Tag': token.ent_iob_, 'Entity Type': token.ent_type_, 'Explain': spacy.explain(token.ent_type_)} #_ is to get the string
    for token in doc_en  # Iterate over each token
]

df_entities = pd.DataFrame(entities)
df_entities

You can also access the entities directly on the `ents` attribute of the document:

In [None]:
print([(ent.text, ent.label_) for ent in doc_en.ents])

### Dependency Parsing

spaCy also contains a dependency parser, which analyzes the grammatical relations between the tokens.

> Dependency parsing is the process of extracting the dependency graph of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the root of the sentence. All other words are linked to the headword. Specifically,the dependencies can be mapped in a directed graph representation where words are the nodes and grammatical relationships are the edges.

In [None]:
# Extracting syntax or dependency parsing information from each token
syntax = [
    {'Token': token.text, 'Dependency': token.dep_, 'Head': token.head.text, 'Explain': spacy.explain(token.dep_)}
    for token in doc_en  # Iterate over each token in the document
]

df_syntax = pd.DataFrame(syntax)
df_syntax

Finally, the English spaCy model contains a morphological parser.

In [None]:
# Extracting morphological features from each token in the document
features = [
    {'Token': token.text, 'Morphological Features': token.morph}
    for token in doc_en  # Iterate over each token
]

df_features = pd.DataFrame(features)
df_features

## Multilingual NLP

SpaCy doesn't only have models for English, but also for many other languages.

In [None]:
zh = spacy.load("zh_core_web_md")
df_metadata = pd.DataFrame([en.meta])
df_metadata.T

In [None]:
text_zh = "中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。"
doc_zh = zh(text_zh)

The tokens in the Chinese document share the same attribute structure as those in the English document in spaCy. However, the functionalities of the models can vary significantly between languages. One key difference to note is in the handling of lemmatization:

- **Lack of Lemmatization in Chinese Model**: Unlike the English model, the Chinese model does not provide lemmatization.

This distinction is important to consider when performing text processing tasks, as it affects the depth of linguistic analysis available for each language.

In [None]:
list(doc_zh.sents)

In [None]:
tok_text = [tok.text for tok in doc_zh]
tok_orth = [tok.orth_ for tok in doc_zh]
print(tok_text)
print(tok_orth)

In [None]:
for tok in list(doc_zh.sents)[1]: # The second sentence
    print(f"{tok.text} >>> {tok.pos_}")

- The Chinese model has a very different fine-grained part-of-speech tags on the `tag_` attribute.

In [None]:
# Printing each token's text, detailed POS tag, and an explanation of the tag.
for tok in list(doc_zh.sents)[1]:
    print(f"{tok.text} >>> {tok.tag_} | {spacy.explain(tok.tag_)}")

- The Chinese model has different entity types (PER, LOC and ORG) than the English one.

This is a result of the training corpora that were used to build the models, whose annotation guidelines may be very different.

In [None]:
info = [(t.text, t.pos_, t.tag_, t.ent_iob_, t.ent_type_) for t in doc_zh]
df_info = pd.DataFrame(info, columns=['Text', 'POS', 'Tag', 'IOB Tag', 'Entity Type'])
df_info

## Visualization

In [None]:
from spacy import displacy

In [None]:
displacy.render(doc_zh, style='ent',jupyter=True, options={'distance':130})

In [None]:
text = "我想要三份2號餐"
doc = zh(text)
displacy.render(doc, style='dep',jupyter=True, options={'distance':130})

## DataFrame + spaCy = dframcy

In [None]:
# @title dframcy Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    !pip install dframcy -qq

In [None]:
from dframcy import DframCy

In [None]:
nlp = spacy.load('zh_core_web_md')
# Initialize DframCy with the spaCy NLP model to integrate with pandas DataFrame.
dframcy = DframCy(nlp)
# Process the Chinese text using the NLP model to create a spaCy document.
doc = dframcy.nlp(text_zh)
# Convert the NLP document annotations to a pandas DataFrame for easier analysis.
annotation_dataframe = dframcy.to_dataframe(doc)
annotation_dataframe

Once annotations are stored as a DataFrame object, filtering can be easily done by leveraging the power of `pandas` syntax.

In [None]:
# Create a filter for rows where the part-of-speech tag is 'NN' (noun).
nn_filt = annotation_dataframe['token_tag_'] == 'NN'
# Create a filter for rows where the dependency label is 'dobj' (direct object).
dobj_filt = annotation_dataframe['token_dep_'] == 'dobj'
# Get rows where the token is a noun and serves as a direct object.
annotation_dataframe[(nn_filt) & dobj_filt]

## Vectors

In [None]:
doc = zh("教授")
tok = doc[0]
tok.vector

In [None]:
tok.vector.shape

In [None]:
import numpy as np

# Function to calculate cosine similarity
def cosine_similarity(vec1, vec2):
    # Ensure the vectors are not only zeros
    if np.all(vec1 == 0) or np.all(vec2 == 0):
        return 0.0
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))


# Define the words and retrieve vectors
word_1 = zh("高興").vector
word_2 = zh("高雄").vector
word_3 = zh("開心").vector

# Calculate similarities
word_1_word_2_similarity = cosine_similarity(word_1, word_2)
word_1_word_3_similarity = cosine_similarity(word_1, word_3)

# Print the results
print(f"Distance between '高興' and '高雄': {word_1_word_2_similarity}")
print(f"Distance between '高興' and '開心': {word_1_word_3_similarity}")

- Cosine similarity

![](https://zhangruochi.com/Operations-on-word-vectors-Debiasing/2019/03/28/images/cosine_sim.png)

- Formula for calculating cosine similarity between two vectors

![](https://miro.medium.com/max/1400/1*LfW66-WsYkFqWc4XYJbEJg.png)

## 🔍 Supplementary: StanfordNLP

Another library that shares some functionality with spaCy is StanfordNLP. [StanfordNLP](https://stanfordnlp.github.io/stanfordnlp/), distinct from Stanford’s Java-based [CoreNLP](https://stanfordnlp.github.io/CoreNLP/) library, is a [Python library](https://github.com/stanfordnlp/stanfordnlp) developed on the PyTorch framework. It provides a fully neural NLP pipeline, which includes advanced features such as tokenization (capable of recognizing multi-word units), lemmatization, part-of-speech tagging (incorporating morphological features), and state-of-the-art dependency parsing. These components were specifically designed and trained for the [CoNLL-2018 shared task](https://nlp.stanford.edu/pubs/qi2018universal.pdf). While it does not include named entity recognition, StanfordNLP excels in dependency parsing and additionally offers a Python interface to CoreNLP, facilitating integration into Python projects.

This version provides a clearer distinction between the two Stanford libraries and emphasizes the specific capabilities and strengths of StanfordNLP, enhancing the reader's understanding of its purpose and utility.

> **`stanfordnlp` has been renamed to `stanza`.**

In [None]:
# @title stanza Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    !pip install stanza -qq

In [None]:
import stanza

In [None]:
stanza.download("zh-hant") # Download the traditional Chinese model for Stanza.

In [None]:
stf_nlp = stanza.Pipeline('zh-hant') # Initialize the Stanza pipeline for traditional Chinese to handle various NLP tasks.

In [None]:
text_zh = "中山大學人文暨科技跨領域學士學位學程助理教授宋世祥表示，2021年聖誕節假期期間，師生舉辦「街頭玩童～鹽埕兒童街區遊戲日」成果展。活動中可看見學生運用贊助單位瑞儀教育基金會致贈的廢棄木棧板，製作了6具兒童創意遊具，一方面展示學習成果，也希望藉此呼籲高雄民眾重視兒童的遊戲權。"
# Process the text with the Stanza pipeline to extract linguistic information.
doc = stf_nlp(text_zh)
type(doc)

Different models often produce different tokenization results, which in turn would have impact on POS and DEP tagging.

- Here're the results based on StandfordNLP.

In [None]:
words_data = []
for i, sent in enumerate(doc.sentences):
    for word in sent.words:
        # Prepare and append a dictionary with details about each word to the list.
        words_data.append({
            'Sentence Number': i + 1,
            'Text': word.text,
            'Lemma': word.lemma,
            'POS': word.pos,
            'Head Index': word.head,
            'Dependency Relation': word.deprel
        })

df_words = pd.DataFrame(words_data)
df_words

## 🔍 Supplementary: Assignment


### Analyze Enlgish

- Input: any English news article of your choice
- Ouput:
    - A list of unique lemmas of all verbs in lower case
    - A list of unique tuples of (NER text, NER label)




In [None]:
# Change this to any other article of your choice.

en_input = """
Taipei, April 7 (CNA) Health and Welfare Minister Chen Shih-chung (陳時中) said Thursday that COVID-19 contact tracing has been partially suspended in Taiwan and a new disease control model is being put in place, amid a rise in domestic cases.

The immediate suspension of contract tracing applies only to travelers who test positive for COVID-19 in Taiwan, either on arrival at the airport or during mandatory quarantine, Chen said.

That decision was made in a bid to free up resources to monitor the growing number of domestic COVID-19 cases, he said at a press briefing, after he reported 531 new cases -- 382 domestically transmitted and 149 imported.

Chen said contact tracing on new imported cases will only be done if any of them are believed to be linked to COVID-19 clusters at quarantine hotels or quarantine centers in Taiwan.

Prior to Thursday, Taiwan had been reporting its contact tracing information on imported COVID-19 cases via the World Health Organization's International Health Regulations (IHR) mechanism, he said.

Regarding the recent daily rise in domestic infections, Chen said the current goal is to bring the situation under control, even though it is impossible to achieve zero new domestic cases at this time.

Despite the recent spike, the daily number of domestic COVID-19 cases in Taiwan is still low compared to many other countries, he said, citing as an example the 534 new cases per 100,000 population reported in South Korea on Tuesday.

Once people in Taiwan stick together and do their part to prevent the spread of the virus, the situation will be manageable, Chen said.

Based on the trajectory of COVID-19 Omicron outbreaks observed in many other countries around the world, he said, it is likely that the infections in Taiwan will peak in a month or two.

"We do not expect the outbreak to stop growing now, but we hope it will rise slowly, so that Taiwan's medical capacity will not be overloaded," Chen said.

Meanwhile, earlier in the day, the Cabinet announced that Taiwan was adopting a new model for the control of COVID-19 infections.

Under the "new Taiwan model," the country has let go of its goal to achieve zero COVID-19 cases, but this does not mean allowing the pandemic go unmanaged, Cabinet spokesman Lo Ping-cheng (羅秉成) said, citing Premier Su Tseng-chang (蘇貞昌).

In a meeting earlier with Ministry of Health and Welfare (MOHW) officials, Premier Su said that as Taiwan moves towards a new stage of epidemic prevention, he hopes that the central and local governments will work together to gradually open up the country, in the interests of its people and economy, according to Lo.

In a report presented to the Cabinet on Thursday, the MOHW said Taiwan will continue to actively manage the COVID-19 situation, while steadily opening up its borders, in consideration of national economic factors and the people's livelihood.
"""

In [None]:
# Start by turning a text into a spaCy Doc object
en_doc = en(en_input)

In [None]:
#===Write your code below and save the output as `verbs`.===#


# verbs =

In [None]:
#===Write your code below and save the output as `en_ents`.===#

# en_ents =

### Analyze Chinese

- Input 1: any Chinese news article from Taiwan media of your choice
- Ouput 1:
    - A list of unique tokens except for punctuations
    - A list of unique tuples of (NER text, NER label)

In [None]:
# Change this to any other article of your choice.

zh_input = """
本土個案昨增三八二例再創新高，確診案例遍及十九縣市，境外增一四九例，單日破五百案例。

中央疫情指揮中心指揮官陳時中表示，個案數還會往上升，預估一至兩個月內達最高峰，疫情將持續到六月底且還不會到尾聲；他也首度鬆口「清零不可能」，未來將走向與病毒共存，下周擬試辦「輕症在家」隔離，同時也將調整停課標準。

擬以居家快篩取代停課
國內確診個案上升，全國累積十四縣市共一三九所校園停課，各縣市對畢業旅行、戶外教學是否取消標準不一，教師團體認為政府應該明確表態，否則會造成校園恐慌或影響學生受教權。陳時中表示，未來必然走向與病毒共存，下周將與教育部檢討停課標準，縮小匡列範圍，並在合理範圍以居家快篩來取代停課。

疫情一至兩個月達高峰
指揮中心比照韓國、紐西蘭及香港疫情發展，推估未來一至兩個月確診案例將飆至最高峰，陳時中表示，目前Omicron確診數仍算低，但規模難以預估，未來單日恐超過一千五百例，屆時致死率、個案數飆高或疫情高峰下不來，社會將承擔不起，因此仍須積極因應，朝「緩坡上升」方向努力。

輕症在家指引至今沒譜
總統蔡英文日前宣布防疫以「減災」為目標避免醫療量能超載，「輕症在家」隔離為其中配套，單日確診數若達一千五百人將啟動。指揮中心下周將擬定「輕症在家照護指引」，如設醫療遠距平台、送藥、戶政及警政系統聯繫、關懷中心運作等，若未遵守隔離規定將有罰則，並因應地方疫情升溫和醫療量能吃緊，將從新北試辦。

立委賴惠員、蔣萬安昨於衛環委員會質詢時，質疑各地方早已喊話希望指揮中心相關指引快出來，但至今連社區溝通、徵求試辦的地方政府意願統統沒有譜。

台北市長柯文哲昨晚也在臉書表示「北市防疫旅館量能告急」，因每天確診人數不斷攀升，近期有近萬名移工入境，幾乎把北市的防疫旅館量能占滿。北市已發出徵召令，徵用加強版防疫專責旅館，讓輕症、低危險確診者入住。

總統府、監察院傳確診者
本土疫情多點爆發，公務機關包括總統府、監察院、台北市議會都傳出有確診者，北中南八大行業確診人數也驟增，未來場所是否新增禁令，指揮中心將再討論。

境外移入昨增一四九例個案，有七十八例為航班落地採檢陽性，越南有十七例居冠。指揮中心表示，目前落地採檢陽性率約落在四％到五％間，即日起越南航空、越捷航空及越竹航空等三航空公司班機，增加「搭機前六小時內抗原快篩報告」才可入境。
"""

In [None]:
# Start by turning a text into a spaCy Doc object
zh_doc = zh(zh_input)

In [None]:
#===Write your code below and save the output as `zh_toks`.===#

# zh_toks =

In [None]:
#===Write your code below and save the output as `zh_ents`.===#

# zh_ents =

- Input 2: Simplified version of Input 1 (Use `opencc` to do the conversion.)
- Ouput 2:
    - A list of unique tokens except for punctuations
    - A list of unique tuples of (NER text, NER label)

In [None]:
# @title opencc Installation { display-mode: "form" }

INSTALL = True # @param {type:"boolean"}

if INSTALL:
    !pip install opencc -qq

In [None]:
import opencc

In [None]:
converter = opencc.OpenCC('t2s.json')
sim_zh_input = converter.convert(zh_input)
sim_zh_input

In [None]:
# Start by turning a text into a spaCy Doc object
sim_zh_doc = zh(sim_zh_input)

In [None]:
#===Write your code below and save the output as `sim_zh_toks`.===#

# sim_zh_toks =

Evaluate whether `zh_toks` is equal to `sim_zh_toks`.

In [None]:
zh_toks == sim_zh_toks

In [None]:
#===Write your code below and save the output as `sim_zh_ents`.===#

sim_zh_ents = set((ent.text, ent.label_) for ent in sim_zh_doc.ents)
sim_zh_ents
# sim_zh_ents =

Evaluate whether `zh_ents` is equal to `sim_zh_ents`.

In [None]:
zh_ents == sim_zh_ents

## 📚 Reference

1. https://ckip.iis.sinica.edu.tw/
2. https://github.com/APCLab/jieba-tw
2. https://corenlp.run/
3. https://github.com/Embedding/Chinese-Word-Vectors
4. https://github.com/stanfordnlp/GloVe
5. https://radimrehurek.com/gensim/
7. https://github.com/sloria/textblob

