## ASSIGNMENT 9 - NAMED ENTITY RECOGNITION

Spacy is a powerful natural language processing (NLP) library in Python, known for its speed and efficiency in handling large volumes of text data. It offers robust features for tokenization, part-of-speech tagging, named entity recognition, and dependency parsing, making it a popular choice for NLP tasks.

## LOADING SPACY MODELS FOR ENGLISH AND FRENCH

In [1]:
# Load Spacy models for processing text in Chinese and English.

import spacy
nlp = spacy.load("zh_core_web_lg")
nlp = spacy.load("en_core_web_lg")

### TOKENIZE TEXT

In [63]:
# Tokenize the given text and print each token along with its attributes.

text = "NER could recognize the term Netflix in a document and classify it as a company"
doc = nlp(text)
for token in doc:
    print(token, end=" | ")

NER | could | recognize | the | term | Netflix | in | a | document | and | classify | it | as | a | company | 

### GENERATE DATAFRAME FOR TOKEN VISUALIZATION

In [64]:
# Generate a dataframe for visualizing spaCy tokens with options to include or exclude punctuation.

import pandas as pd

def display_nlp(doc, include_punct=False):
    #Generate data frame for visualization of spaCy tokens.
    rows = []
    for i, t in enumerate(doc):
        if not t.is_punct or include_punct:
            row = {'token': i,  'text': t.text, 'lemma_': t.lemma_, 
                   'is_stop': t.is_stop, 'is_alpha': t.is_alpha,
                   'pos_': t.pos_, 'dep_': t.dep_, 
                   'ent_type_': t.ent_type_, 'ent_iob_': t.ent_iob_}
            rows.append(row)
    
    df = pd.DataFrame(rows).set_index('token')
    df.index.name = None
    return df
display_nlp(doc)

Unnamed: 0,text,lemma_,is_stop,is_alpha,pos_,dep_,ent_type_,ent_iob_
0,NER,NER,False,True,PROPN,nsubj,ORG,B
1,could,could,True,True,AUX,aux,,O
2,recognize,recognize,False,True,VERB,ROOT,,O
3,the,the,True,True,DET,det,,O
4,term,term,False,True,NOUN,dobj,,O
5,Netflix,Netflix,False,True,PROPN,appos,ORG,B
6,in,in,True,True,ADP,prep,,O
7,a,a,True,True,DET,det,,O
8,document,document,False,True,NOUN,pobj,,O
9,and,and,True,True,CCONJ,cc,,O


### FILTER OUT STOP WORDS AND PUNCTUATION

Stopwords are common words like "and," "the," and "is" that are often filtered out during text analysis to focus on meaningful content. Punctuation marks, such as commas, periods, and exclamation points, are symbols used to organize and convey meaning in written language, often removed or processed separately in text processing tasks.

In [65]:
# Extract non-stop words and non-punctuation tokens from the given text.

text = "NER utilizes natural language processing (NLP) to tag entities based on predefined parameters"
doc = nlp(text)

non_stop = [t for t in doc if not t.is_stop and not t.is_punct]
print(non_stop)

[NER, utilizes, natural, language, processing, NLP, tag, entities, based, predefined, parameters]


### EXTRACT NOUNS FROM TEXT

In [66]:
# Extract nouns and proper nouns from the given text.

text = "NER utilizes natural language processing (NLP) to tag entities based on predefined parameters"
doc = nlp(text)

nouns = [t for t in doc if t.pos_ in ['NOUN', 'PROPN']]
print(nouns)

[NER, language, processing, NLP, entities, parameters]


### IDENTIFY ENTITIES IN TEXT

In [67]:
# Print identified entities along with their labels.

text = "NER could recognize the term Netflix in a document and classify it as a company"
doc = nlp(text)

for ent in doc.ents:
    print(f"({ent.text}, {ent.label_})", end=" ")

(NER, ORG) (Netflix, ORG) 

### IDENTIFY ENTITIES IN TEXT

Identifying entities in text involves recognizing and categorizing specific pieces of information such as names of people, organizations, locations, dates, and numerical expressions. This process often utilizes techniques like named entity recognition (NER) to automatically extract and classify these entities within a given text corpus, enabling deeper semantic analysis and information retrieval.

In [68]:
# Print identified entities along with their labels.

text = "NER can be applied to invoices to automate the identification of account IDs, shipping and billing addresses, and invoice amounts." 
doc = nlp(text)

for ent in doc.ents:
    print(f"({ent.text}, {ent.label_})", end=" ")

(NER, ORG) 

### VISUALIZE ENTITIES IN TEXT

In [69]:
# Render a visualization of the identified entities in the text.

from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

## 1. CONVERT URL TO TEXT AND COUNT ENTITIES

In [27]:
# Convert the content of a given URL into text and count the identified entities.

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.mingpao.com')
article = nlp(ny_bb)
len(article.ents)

87

### VISUALIZE ENTITIES IN TEXT

In [10]:
# Render a visualization of the identified entities in the extracted article text.

displacy.render(article, style='ent', jupyter=True)

### COUNT ENTITY LABELS

In [11]:
# Count the occurrence of each entity label in the extracted article text.

from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'ORG': 26,
         'PERSON': 12,
         'CARDINAL': 4,
         'GPE': 4,
         'QUANTITY': 1,
         'EVENT': 1,
         'TIME': 1,
         'PRODUCT': 1})

### COUNT MOST COMMON ENTITIES

In [12]:
# Count the most common entities in the extracted article text.

items = [x.text for x in article.ents]
Counter(items).most_common(5)

[('新股遞表半年內上市比例跌至12', 3),
 ('2024年03月25日星期一 \u3000\u3000\u3000 ', 1),
 ('熱門搜尋', 1),
 ('香港怎麼辦系列 日月掠影 麥明詩婚禮 鼻敏感藥 彩色渠蓋 特色配電箱 【', 1),
 ('圖輯】維港兩岸各有活動\u3000復活節帽子巡遊北角至灣仔海濱舉行\u3000尖沙嘴海旁辦基層墟市', 1)]

### PRINT SPECIFIC SENTENCE

In [13]:
# Print the 21st sentence from the extracted article text.

sentences = [x for x in article.sents]
print(sentences[20])

TopGear Lotus Eletre電動蓮花 MingWatch HK 東方表行銅鑼灣Fashion Walk全新形象店開幕 全球國際品牌正式登陸新店


### VISUALIZE ENTITIES IN SPECIFIC SENTENCE

In [14]:
# Render a visualization of the identified entities in the 21st sentence of the extracted article text.

displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')

### EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [15]:
# Extract words along with their parts of speech and lemmas from the 21st sentence of the extracted article text, excluding stop words and punctuation.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[20])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('TopGear', 'PROPN', 'TopGear'),
 ('Lotus', 'PROPN', 'Lotus'),
 ('Eletre電動蓮花', 'PROPN', 'Eletre電動蓮花'),
 ('MingWatch', 'PROPN', 'MingWatch'),
 ('HK', 'PROPN', 'HK'),
 ('東方表行銅鑼灣Fashion', 'NUM', '東方表行銅鑼灣fashion'),
 ('Walk全新形象店開幕', 'PROPN', 'Walk全新形象店開幕'),
 ('全球國際品牌正式登陸新店', 'NOUN', '全球國際品牌正式登陸新店')]

### VISUALIZE DEPENDENCY PARSING

Dependency parsing is a technique in natural language processing (NLP) that analyzes the grammatical structure of a sentence by determining the relationships between words, represented as directed edges between tokens in a dependency tree. It helps uncover the syntactic dependencies and hierarchical structure within sentences, facilitating tasks like semantic analysis, question answering, and machine translation.

In [16]:
# Render a visualization of the dependency parsing for the 21st sentence of the extracted article text with adjusted distance between words.

displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

## 2 CONVERT URL TO TEXT AND COUNT ENTITIES

In [33]:
# Convert the content of a given URL into text and count the identified entities

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://apnews.com/article/israel-hamas-war-news-03-24-2024-24019f74683075740bf4dd8f7a42dfd8')
article = nlp(ny_bb)
len(article.ents)

244

### VISUALIZE ENTITIES IN TEXT

In [34]:
# Render a visualization of the identified entities in the extracted article text.

displacy.render(article, style='ent', jupyter=True)

### COUNT ENTITY LABELS

In [35]:
# Count the occurrence of each entity label in the extracted article text.

from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'GPE': 65,
         'ORG': 46,
         'DATE': 32,
         'NORP': 30,
         'CARDINAL': 23,
         'PERSON': 19,
         'LOC': 10,
         'WORK_OF_ART': 8,
         'EVENT': 5,
         'LAW': 2,
         'QUANTITY': 2,
         'PERCENT': 1,
         'TIME': 1})

### COUNT MOST COMMON ENTITIES

In [36]:
# Count the 25 most common entities in the extracted article text.

items = [x.text for x in article.ents]
Counter(items).most_common(25)

[('Gaza', 18),
 ('Israeli', 14),
 ('Israel', 12),
 ('Hamas', 9),
 ('AP', 8),
 ('Palestinians', 7),
 ('The Associated Press', 5),
 ('Rafah', 4),
 ('Sunday', 4),
 ('Shifa', 4),
 ('Russia', 3),
 ('China', 3),
 ('Gaza Strip', 3),
 ('Shifa Hospital', 3),
 ('November', 3),
 ('Hezbollah', 3),
 ('Asia Pacific', 2),
 ('Latin America      ', 2),
 ('Europe', 2),
 ('Africa', 2),
 ('Middle East', 2),
 ('Australia', 2),
 ('U.S. Election 2024', 2),
 ('Joe Biden', 2),
 ('Election 2024', 2)]

### PRINT SPECIFIC SENTENCE

In [38]:
# Print the 4th sentence from the extracted article text.

sentences = [x for x in article.sents]
print(sentences[3])

Top 25 Poll      Entertainment Movie reviews      Book reviews      Celebrity      Television      Music      Business Inflation      Personal finance      Financial Markets      Business Highlights      Financial wellness      Science Fact Check Oddities Newsletters Video Health Photography  Climate Personal Finance Tech Artificial Intelligence      Social Media      Lifestyle Religion AP Buyline Personal Finance Press Releases                   Search Query                Submit Search             Show Search          World Israel-Hamas War      Russia-Ukraine War      Global elections      Asia Pacific      Latin America      Europe      Africa      Middle East      China      Australia      U.S. Election 2024 Politics Joe Biden      Election 2024      Congress      Sports March Madness      MLB      NBA      NHL      NFL      Soccer      Golf      Tennis      AP


### VISUALIZE ENTITIES IN SPECIFIC SENTENCE

In [45]:
# Render a visualization of the identified entities in the 4th sentence of the extracted article text.

displacy.render(nlp(str(sentences[3])), jupyter=True, style='ent')

### EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [46]:
# Extract words along with their parts of speech and lemmas from the 4th sentence of the extracted article text, excluding stop words and punctuation.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[3])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('25', 'NUM', '25'),
 ('Poll', 'PROPN', 'Poll'),
 ('     ', 'SPACE', '     '),
 ('Entertainment', 'PROPN', 'Entertainment'),
 ('Movie', 'PROPN', 'Movie'),
 ('reviews', 'VERB', 'review'),
 ('     ', 'SPACE', '     '),
 ('Book', 'PROPN', 'Book'),
 ('reviews', 'NOUN', 'review'),
 ('     ', 'SPACE', '     '),
 ('Celebrity', 'PROPN', 'Celebrity'),
 ('     ', 'SPACE', '     '),
 ('Television', 'PROPN', 'Television'),
 ('     ', 'SPACE', '     '),
 ('Music', 'PROPN', 'Music'),
 ('     ', 'SPACE', '     '),
 ('Business', 'PROPN', 'Business'),
 ('Inflation', 'PROPN', 'Inflation'),
 ('     ', 'SPACE', '     '),
 ('Personal', 'ADJ', 'personal'),
 ('finance', 'NOUN', 'finance'),
 ('     ', 'SPACE', '     '),
 ('Financial', 'PROPN', 'Financial'),
 ('Markets', 'PROPN', 'Markets'),
 ('     ', 'SPACE', '     '),
 ('Business', 'PROPN', 'Business'),
 ('Highlights', 'PROPN', 'Highlights'),
 ('     ', 'SPACE', '     '),
 ('Financial', 'PROPN', 'Financial'),
 ('wellness', 'NOUN', 'wellness'),
 ('     ', '

### VISUALIZE DEPENDENCY PARSING

In [62]:
# Render a visualization of the dependency parsing for the 4th sentence of the extracted article text with adjusted distance between words.

displacy.render(nlp(str(sentences[1])), style='dep', jupyter = True, options = {'distance': 120})

## 3 CONVERT URL TO TEXT AND COUNT ENTITIES

In [50]:
# Convert the content of a given URL into text and count the identified entities.

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.livescience.com/technology/artificial-intelligence')
article = nlp(ny_bb)
len(article.ents)

100

### VISUALIZE ENTITIES IN TEXT

In [51]:
# Render a visualization of the identified entities in the extracted article text.

displacy.render(article, style='ent', jupyter=True)

### COUNT ENTITY LABELS

In [52]:
# Count the occurrence of each entity label in the extracted article text.

from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'ORG': 26,
         'PERSON': 16,
         'CARDINAL': 16,
         'DATE': 13,
         'PRODUCT': 11,
         'NORP': 4,
         'TIME': 3,
         'GPE': 3,
         'FAC': 2,
         'PERCENT': 2,
         'LOC': 2,
         'ORDINAL': 2})

### COUNT MOST COMMON ENTITIES

In [53]:
# Count the 25 most common entities in the extracted article text.

items = [x.text for x in article.ents]
Counter(items).most_common(25)

[('AI', 13),
 ('Keumars Afifi-Sabet', 7),
 ('OpenAI', 3),
 ('February 24', 2),
 ('Meta', 2),
 ('Mark Zuckerberg', 2),
 ('19', 2),
 ('January 24', 2),
 ('18', 2),
 ('Search Live Science   Subscribe RSS           Space Health Planet Earth Animals Archaeology Physics & Math Human Behavior Technology Chemistry More Science',
  1),
 ('TrendingPrincess of Wales', 1),
 ('Webb Space', 1),
 ('TelescopeApril 8', 1),
 ('Artificial Intelligence', 1),
 ('Google', 1),
 ('2,000-year-old', 1),
 ('Artificial Intelligence  Scientists', 1),
 ('Roland Moore-Coyler', 1),
 ('22', 1),
 ('March 24', 1),
 ('Artificial Intelligence Researchers', 1),
 ('20 March 24', 1),
 ('2027', 1),
 ('6 March 24', 1),
 ('three to eight years', 1)]

### PRINT SPECIFIC SENTENCE

In [54]:
# Print the 26th sentence from the extracted article text.

sentences = [x for x in article.sents]
print(sentences[25])

MOST READMOST SHARED1One of our favorite Garmin watches is now half-price at Walmart — and it's an ideal running companion2Mass grave of plague victims may be largest ever found in Europe, archaeologists say3India's evolutionary past tied to huge migration 50,000 years ago and to now-extinct human relatives41,900-year-old coins from Jewish revolt against the Romans discovered in the Judaen desert5Dying SpaceX rocket creates glowing, galaxy-like spiral in the middle of the Northern Lights1'Potentially hazardous' asteroid Bennu contains the building blocks of life and minerals unseen on Earth, scientists reveal in 1st comprehensive analysis2Speck of light spotted by Hubble is one of the most enormous galaxies in the early universe, James Webb telescope reveals38-hour intermittent fasting tied to 90% higher risk of cardiovascular death, early data hint4James Webb telescope confirms there is something seriously wrong with our understanding of the universe5Beluga whales appear to change the

### VISUALIZE ENTITIES IN SPECIFIC SENTENCE

In [55]:
# Render a visualization of the identified entities in the 26th sentence of the extracted article text.

displacy.render(nlp(str(sentences[25])), jupyter=True, style='ent')

### EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [56]:
# Extract words along with their parts of speech and lemmas from the 26th sentence of the extracted article text, excluding stop words and punctuation.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[25])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('READMOST', 'PROPN', 'READMOST'),
 ('SHARED1One', 'PROPN', 'SHARED1One'),
 ('favorite', 'ADJ', 'favorite'),
 ('Garmin', 'PROPN', 'Garmin'),
 ('watches', 'NOUN', 'watch'),
 ('half', 'ADJ', 'half'),
 ('price', 'NOUN', 'price'),
 ('Walmart', 'PROPN', 'Walmart'),
 ('ideal', 'ADJ', 'ideal'),
 ('running', 'VERB', 'run'),
 ('companion2Mass', 'PROPN', 'companion2Mass'),
 ('grave', 'NOUN', 'grave'),
 ('plague', 'NOUN', 'plague'),
 ('victims', 'NOUN', 'victim'),
 ('largest', 'AUX', 'largest'),
 ('found', 'VERB', 'find'),
 ('Europe', 'PROPN', 'Europe'),
 ('archaeologists', 'NOUN', 'archaeologist'),
 ('say3India', 'PROPN', 'say3India'),
 ('evolutionary', 'ADJ', 'evolutionary'),
 ('past', 'NOUN', 'past'),
 ('tied', 'VERB', 'tie'),
 ('huge', 'ADJ', 'huge'),
 ('migration', 'NOUN', 'migration'),
 ('50,000', 'NUM', '50,000'),
 ('years', 'NOUN', 'year'),
 ('ago', 'ADV', 'ago'),
 ('extinct', 'ADJ', 'extinct'),
 ('human', 'ADJ', 'human'),
 ('relatives41,900', 'PROPN', 'relatives41,900'),
 ('year', 'NOUN

### VISUALIZE DEPENDENCY PARSING

In [60]:
# Render a visualization of the dependency parsing for the 26th sentence of the extracted article text with adjusted distance between words.

displacy.render(nlp(str(sentences[2])), style='dep', jupyter = True, options = {'distance': 120})