<a href="https://colab.research.google.com/github/olinyoder2534/NLP_practice/blob/main/NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [46]:
import spacy

In [47]:
nlp = spacy.load("en_core_web_sm")

In [48]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [49]:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [50]:
text = 'Bob felt like crap after eating Taco Bell on Tuesday.'

In [51]:
doc = nlp(text)

In [52]:
for ent in doc.ents:
  print(ent.text, '|', ent.label_, '|', spacy.explain(ent.label_))

Bob | PERSON | People, including fictional
Taco Bell | ORG | Companies, agencies, institutions, etc.
Tuesday | DATE | Absolute or relative dates or periods


In [53]:
from spacy import displacy

displacy.render(doc, style = 'ent')

In [54]:
text1 = 'What is Nikola Tesla founded the car company Tesla?'

In [55]:
doc1 = nlp(text1)

In [56]:
for ent in doc1.ents:
    print(ent.text, '|', ent.label_, '|', spacy.explain(ent.label_))

Tesla | ORG | Companies, agencies, institutions, etc.


In [57]:
displacy.render(doc1, style = 'ent')

In [58]:
for token in doc1:
  print(token.text, '|', token.pos_)

What | PRON
is | AUX
Nikola | PROPN
Tesla | PROPN
founded | VERB
the | DET
car | NOUN
company | NOUN
Tesla | PROPN
? | PUNCT


In [59]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [60]:
nlp1 = pipeline("ner", model=model, tokenizer=tokenizer)
example = "What is Nikola Tesla founded the car company Tesla?"

ner_results = nlp1(example)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.9991315, 'index': 3, 'word': 'Nikola', 'start': 8, 'end': 14}, {'entity': 'I-PER', 'score': 0.9959331, 'index': 4, 'word': 'Te', 'start': 15, 'end': 17}, {'entity': 'I-PER', 'score': 0.9948672, 'index': 5, 'word': '##sla', 'start': 17, 'end': 20}, {'entity': 'B-ORG', 'score': 0.9907079, 'index': 10, 'word': 'Te', 'start': 45, 'end': 47}, {'entity': 'I-ORG', 'score': 0.97715366, 'index': 11, 'word': '##sla', 'start': 47, 'end': 50}]


In [61]:
for i in ner_results:
  print(i['word'], '|', i)

Nikola | {'entity': 'B-PER', 'score': 0.9991315, 'index': 3, 'word': 'Nikola', 'start': 8, 'end': 14}
Te | {'entity': 'I-PER', 'score': 0.9959331, 'index': 4, 'word': 'Te', 'start': 15, 'end': 17}
##sla | {'entity': 'I-PER', 'score': 0.9948672, 'index': 5, 'word': '##sla', 'start': 17, 'end': 20}
Te | {'entity': 'B-ORG', 'score': 0.9907079, 'index': 10, 'word': 'Te', 'start': 45, 'end': 47}
##sla | {'entity': 'I-ORG', 'score': 0.97715366, 'index': 11, 'word': '##sla', 'start': 47, 'end': 50}


In [62]:
from spacy.tokens import Span
s1 = Span(doc1, 2, 4, label = "PERSON")

doc1.ents = list(doc1.ents) + [s1]

In [63]:
for ent in doc1.ents:
    print(ent.text, '|', ent.label_, '|', spacy.explain(ent.label_))

displacy.render(doc1, style = 'ent')

Nikola Tesla | PERSON | People, including fictional
Tesla | ORG | Companies, agencies, institutions, etc.


In [64]:
text2 = """Joe want to know the famous foods in each city. So, he opened Google and search for this question. Google showed that
in Boston it is clam chowder, in Memphis it is barbeque, in Seattle it is hot dogs, in LA it is chicken fingers, in Chicago it is pizza,
in San Francisco it is fried rice and so on for all other citites."""

doc2 = nlp(text2)

In [70]:
countCities = 0
cities = []

for ent in doc2.ents:
    if ent.label_ == 'GPE':
      cities.append(ent.text)
      countCities += 1

print(cities)
print(countCities)

['Boston', 'Memphis', 'Seattle', 'LA', 'Chicago', 'San Francisco']
6


In [74]:
text3 = """Joe was born on 24 April 1973, Albert was born on November 5, 1988, Mike was born on 4.13.48,
and finally Ricky was born on December 3 1974."""

doc3 = nlp(text3)

In [75]:
dates = []

for ent in doc3.ents:
    if ent.label_ == 'DATE':
      dates.append(ent.text)

print(dates)

['24 April 1973', 'November 5, 1988', '4.13.48', 'December 3 1974']
