In [3]:
import pandas as pd
import spacy 
from spacy import displacy

### SpaCy

#### Loading SpaCy Models:  
- en_core_web_sm: English multi-task CNN trained on OntoNotes. Size – 11 MB
- en_core_web_md: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 91 MB
- en_core_web_lg: English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Size – 789 MB
- en_core_web_trf: English transformer pipeline (roberta-base). Components: transformer, tagger, parser, ner, attribute_ruler, lemmatizer.  Size - 438 MB
- **en_core_web_trf** is available in SpaCy 3.X only

In [4]:
# !python -m spacy download en_core_web_sm
# !python -m spacy download en_core_web_md

In [5]:
# Load SpaCy model
nlp = spacy.load("en_core_web_sm")
# nlp = spacy.load("en_core_web_md")

#### SpaCy for POS tagging - token level
- SpaCy pipeline will automatically convert the paragraph into tokens, and analysis each token's POS

In [6]:
text = '''In the latest trading session, Caterpillar (CAT) closed at $113.22, marking a +1.52% move from the previous day. 
This change outpaced the S&P 500's 0.58% gain on the day. At the same time, the Dow added 0.14%, and the tech-heavy Nasdaq gained 1.66%.
Heading into today, shares of the construction equipment company had gained 11.4% over the past month, outpacing the Industrial Products 
sector's loss of 2.53% and the S&P 500's gain of 3.15% in that time.'''

In [7]:
# token level analysis
import pandas as pd

doc = nlp(text)

data_tokens = {
    'Words': [token.text for token in doc],
    'Lemma': [token.lemma_ for token in doc],
    'POS_Tags': [token.tag_ for token in doc],
    'POS': [token.pos_ for token in doc],
    'Dependency': [token.dep_ for token in doc],
    'Shape': [token.shape_ for token in doc],
    'Alpha': [token.is_alpha for token in doc],
    'Numeric': [token.like_num for token in doc],
    'Stopword': [token.is_stop for token in doc]
}

df_tokens = pd.DataFrame(data_tokens)
df_tokens

Unnamed: 0,Words,Lemma,POS_Tags,POS,Dependency,Shape,Alpha,Numeric,Stopword
0,In,in,IN,ADP,prep,Xx,True,False,True
1,the,the,DT,DET,det,xxx,True,False,True
2,latest,late,JJS,ADJ,amod,xxxx,True,False,False
3,trading,trading,NN,NOUN,compound,xxxx,True,False,False
4,session,session,NN,NOUN,pobj,xxxx,True,False,False
...,...,...,...,...,...,...,...,...,...
100,%,%,NN,NOUN,pobj,%,False,False,False
101,in,in,IN,ADP,prep,xx,True,False,True
102,that,that,DT,DET,det,xxxx,True,False,True
103,time,time,NN,NOUN,pobj,xxxx,True,False,False


In [8]:
for token in doc:
    # Check if the token resembles a number
    if token.like_num:
        # Get the next token in the document
        token_prior_2 = doc[token.i - 2]
        token_prior_1 = doc[token.i - 1]
        token_next_1 = doc[token.i + 1]
        token_next_2 = doc[token.i + 2]
        if (token_next_1.text == "%" or token_next_2.text == "%"):
            print("Percentage:::", token_prior_2, token_prior_1, token.text, token_next_1, token_next_2)
        else:
            print("Numbers:::", token_prior_2, token_prior_1, token.text, token_next_1, token_next_2)

Numbers::: at $ 113.22 , marking
Percentage::: marking a +1.52 % move
Numbers::: the S&P 500 's 0.58
Percentage::: 500 's 0.58 % gain
Percentage::: Dow added 0.14 % ,
Percentage::: Nasdaq gained 1.66 % .
Percentage::: had gained 11.4 % over
Percentage::: loss of 2.53 % and
Numbers::: the S&P 500 's gain
Percentage::: gain of 3.15 % in


### NER with SpaCy

The entity recognizer identifies non-overlapping labelled spans of tokens. The transition-based algorithm used encodes certain assumptions that are effective for “traditional” named entity recognition tasks, but may not be a good fit for every span identification problem. Specifically, the loss function optimizes for whole entity accuracy, so if your inter-annotator agreement on boundary tokens is low, the component will likely perform poorly on your problem.

Predictions will be saved to Doc.ents as a tuple. Each label will also be reflected to each underlying token, where it is saved in the Token.ent_type and Token.ent_iob fields. Note that by definition each token can only have one label.

When setting Doc.ents to create training data, all the spans must be valid and non-overlapping, or an error will be thrown.

| LOCATION        | VALUE                                              | TYPE         |
|-----------------|----------------------------------------------------|--------------|
| `Doc.ents`      | The annotated spans.                               | `Tuple[Span]`|
| `Token.ent_iob` | An enum encoding of the IOB part of the named entity tag. | `int`    |
| `Token.ent_iob_`| The IOB part of the named entity tag.              | `str`        |
| `Token.ent_type`| The label part of the named entity tag (hash).     | `int`        |
| `Token.ent_type_`| The label part of the named entity tag.           | `str`        |


More: https://spacy.io/api/entityrecognizer


In [9]:
# Checking active pipeline components
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [10]:
text = '''Surging Chinese demand and an improving U.S. economy have lifted sales of Caterpillar's signature yellow mining and construction machines. 
Now, with the pace of growth quickening in Latin America and Europe, the company is projecting higher earnings for 2018 than analysts estimated.  
The outlook from Caterpillar, considered an economic bellwether, comes as industries from manufacturing to services report increased sales and orders 
that have fueled record equity prices and buoyed investor expectations for this year. This week, the International Monetary Fund raised its estimate 
for 2018 global growth to the fastest in seven years.  Caterpillar's results showed strength across the board in nearly every industry for the first time, 
which indicated coordinated and synchronized macroeconomic growth, Larry De Maria, an analyst at William Blair & Co., said in an interview. 
It's a good harbinger for overall economic activity.'''

In [11]:
doc = nlp(text)

data_ents = {
    'Entities': [ent.text for ent in doc.ents],
    'Labels': [ent.label_ for ent in doc.ents],
    'Position_Start': [ent.start_char for ent in doc.ents],
    'Position_End': [ent.end_char for ent in doc.ents]
}

df_ents = pd.DataFrame(data_ents)

# Display the first 20 entries of the DataFrame
df_ents.head(20)

Unnamed: 0,Entities,Labels,Position_Start,Position_End
0,Chinese,NORP,8,15
1,U.S.,GPE,40,44
2,Caterpillar,ORG,74,85
3,Latin America,LOC,183,196
4,Europe,LOC,201,207
5,2018,DATE,255,259
6,Caterpillar,ORG,304,315
7,this year,DATE,513,522
8,This week,DATE,524,533
9,the International Monetary Fund,ORG,535,566


In [12]:
# examples of entity explaination
spacy.explain("FAC")

'Buildings, airports, highways, bridges, etc.'

In [13]:
spacy.explain("NORP")

'Nationalities or religious or political groups'

In [14]:
spacy.explain("GPE")

'Countries, cities, states'

In [15]:
# Organization Entity
df_ents[df_ents.Labels == 'ORG']

Unnamed: 0,Entities,Labels,Position_Start,Position_End
2,Caterpillar,ORG,74,85
6,Caterpillar,ORG,304,315
9,the International Monetary Fund,ORG,535,566
12,Caterpillar,ORG,643,654
15,William Blair & Co.,ORG,841,860


In [16]:
# Person Entity
df_ents[df_ents.Labels == 'PERSON']

Unnamed: 0,Entities,Labels,Position_Start,Position_End
14,Larry De Maria,PERSON,811,825


In [17]:
# Geopolitical Entity
df_ents[df_ents.Labels == 'GPE']

Unnamed: 0,Entities,Labels,Position_Start,Position_End
1,U.S.,GPE,40,44


In [18]:
# view all the Entity labels
nlp.get_pipe("ner").labels

('CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART')

In [19]:
# Another example
text = '''The NBC Tower is an office tower on the Near North Side of Chicago, Illinois located at 454 North Columbus Drive in downtown 
Chicago's Magnificent Mile area. Completed in 1989, the 37-story building reaches a height of 627 feet'''

In [21]:
# using large model now
nlp = spacy.load("en_core_web_md")

In [22]:
doc = nlp(text)

data_ents = {
    'Entities': [ent.text for ent in doc.ents],
    'Labels': [ent.label_ for ent in doc.ents],
    'Position_Start': [ent.start_char for ent in doc.ents],
    'Position_End': [ent.end_char for ent in doc.ents]
}

df_ents = pd.DataFrame(data_ents)

df_ents

Unnamed: 0,Entities,Labels,Position_Start,Position_End
0,The NBC Tower,ORG,0,13
1,the Near North Side,LOC,36,55
2,Chicago,GPE,59,66
3,Illinois,GPE,68,76
4,454,CARDINAL,88,91
5,North Columbus Drive,FAC,92,112
6,Chicago,GPE,126,133
7,Magnificent Mile,LOC,136,152
8,1989,DATE,172,176
9,37,CARDINAL,182,184


In [23]:
# visalize with SpaCy
displacy.render(doc, jupyter = True, style = "ent")

In [24]:
with open("./data/ioi.txt", "r") as f:
    ioi_text = f.read()

In [25]:
ioi_text[:100]

'Quantum Capital Partners\n100 Investment Blvd, Suite 3000\nNew York, NY 10020\nJuly 16, 2024\n\nCONFIDENT'

In [26]:
doc = nlp(ioi_text)

data_ents = {
    'Entities': [ent.text for ent in doc.ents],
    'Labels': [ent.label_ for ent in doc.ents],
    'Position_Start': [ent.start_char for ent in doc.ents],
    'Position_End': [ent.end_char for ent in doc.ents]
}

df_ents = pd.DataFrame(data_ents)

# Display the first 20 entries of the DataFrame
df_ents.head(20)

Unnamed: 0,Entities,Labels,Position_Start,Position_End
0,Quantum Capital Partners,ORG,0,24
1,100,CARDINAL,25,28
2,New York,GPE,57,65
3,NY,GPE,67,69
4,"July 16, 2024",DATE,76,89
5,CONFIDENTIAL\n\nBoard of Directors,ORG,91,123
6,Zephyr Technologies Inc.,ORG,124,148
7,400,CARDINAL,149,152
8,San Jose,GPE,168,176
9,CA,GPE,178,180


### NER with NLTK

In [32]:
import nltk
# nltk.download('popular', halt_on_error=False)
# nltk.download('all', halt_on_error=False)

In [28]:
import os
import requests
import nltk.corpus  
from nltk.text import Text
import pandas as pd
import re
import sys

In [29]:
text = '''Surging Chinese demand and an improving U.S. economy have lifted sales of Caterpillar's signature yellow mining and construction machines. 
Now, with the pace of growth quickening in Latin America and Europe, the company is projecting higher earnings for 2018 than analysts estimated.  
The outlook from Caterpillar, considered an economic bellwether, comes as industries from manufacturing to services report increased sales and orders 
that have fueled record equity prices and buoyed investor expectations for this year. This week, the International Monetary Fund raised its estimate 
for 2018 global growth to the fastest in seven years.  Caterpillar's results showed strength across the board in nearly every industry for the first time, 
which indicated coordinated and synchronized macroeconomic growth, Larry De Maria, an analyst at William Blair & Co., said in an interview. 
It's a good harbinger for overall economic activity.'''

In [30]:
# NLTK chunked_sentences is a tree structure, or list of lists.  We have to traverse it to get the values

entities = []
labels = []
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text))): #Binary=True means just tag entities as NE 
    if hasattr(chunk, 'label'):
        entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
        labels.append(chunk.label())

#entities_labels = list(zip(entities, labels))
entities_labels = list(set(zip(entities, labels))) #unique entities
entities_labels

[('Latin America', 'GPE'),
 ('William Blair', 'PERSON'),
 ('Larry De Maria', 'PERSON'),
 ('U.S.', 'GPE'),
 ('Chinese', 'GPE'),
 ('International Monetary Fund', 'ORGANIZATION'),
 ('Europe', 'GPE'),
 ('Caterpillar', 'GPE'),
 ('Caterpillar', 'PERSON')]

In [31]:
# Alternative NER, separating by sentenses first, then by tokens
entities = []
labels = []

for sent in nltk.sent_tokenize(text):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary = False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
            labels.append(chunk.label())

#entities_labels = list(zip(entities, labels))
entities_labels = list(set(zip(entities, labels))) #unique entities
entities_labels

[('Latin America', 'GPE'),
 ('William Blair', 'PERSON'),
 ('Larry De Maria', 'PERSON'),
 ('U.S.', 'GPE'),
 ('Chinese', 'GPE'),
 ('International Monetary Fund', 'ORGANIZATION'),
 ('Europe', 'GPE'),
 ('Caterpillar', 'GPE')]