<a href="https://colab.research.google.com/github/hassaanhameed786/NLP/blob/main/Named_Entity_Recognition_with_NLTK_and_SpaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Named entity recognition (NER)is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. NER is used in many fields in Natural Language Processing (NLP), and it can help answering many real-world questions, such as:
Which companies were mentioned in the news article?
Were specified products mentioned in complaints or reviews?
Does the tweet contain the name of a person? Does the tweet contain this person’s location?

In [2]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
# random sentence from the goolgle news 
ex = 'An 11-year-old who survived the Texas school shooting is scared the gunman is still is out to get her, her parents say'

# # Data Preprocessing

In [4]:
# apply word tokenization and part-of-speech tagging to the sentence
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [5]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [6]:
sent = preprocess(ex)
sent

[('An', 'DT'),
 ('11-year-old', 'JJ'),
 ('who', 'WP'),
 ('survived', 'VBD'),
 ('the', 'DT'),
 ('Texas', 'NNP'),
 ('school', 'NN'),
 ('shooting', 'NN'),
 ('is', 'VBZ'),
 ('scared', 'VBN'),
 ('the', 'DT'),
 ('gunman', 'NN'),
 ('is', 'VBZ'),
 ('still', 'RB'),
 ('is', 'VBZ'),
 ('out', 'RP'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('her', 'PRP'),
 (',', ','),
 ('her', 'PRP$'),
 ('parents', 'NNS'),
 ('say', 'VBP')]

## implement noun phrase chunking

In [7]:
# Our chunk pattern consists of one rule, that a noun phrase, NP
# it should be formed whenever the chunker finds an optional determiner, DT
#  followed by any number of adjectives, JJ, and then a noun, NN.

pattern = 'NP: {<DT>?<JJ>*<NN>}'

## Chunking

In [9]:
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs) ## output can be read as a tree 

(S
  An/DT
  11-year-old/JJ
  who/WP
  survived/VBD
  the/DT
  Texas/NNP
  (NP school/NN)
  (NP shooting/NN)
  is/VBZ
  scared/VBN
  (NP the/DT gunman/NN)
  is/VBZ
  still/RB
  is/VBZ
  out/RP
  to/TO
  get/VB
  her/PRP
  ,/,
  her/PRP$
  parents/NNS
  say/VBP)


IOB tags have become the standard way to represent chunk structures in files, and we will also be using this format.

In [13]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)
ne_tree

[('An', 'DT', 'O'),
 ('11-year-old', 'JJ', 'O'),
 ('who', 'WP', 'O'),
 ('survived', 'VBD', 'O'),
 ('the', 'DT', 'O'),
 ('Texas', 'NNP', 'O'),
 ('school', 'NN', 'B-NP'),
 ('shooting', 'NN', 'B-NP'),
 ('is', 'VBZ', 'O'),
 ('scared', 'VBN', 'O'),
 ('the', 'DT', 'B-NP'),
 ('gunman', 'NN', 'I-NP'),
 ('is', 'VBZ', 'O'),
 ('still', 'RB', 'O'),
 ('is', 'VBZ', 'O'),
 ('out', 'RP', 'O'),
 ('to', 'TO', 'O'),
 ('get', 'VB', 'O'),
 ('her', 'PRP', 'O'),
 (',', ',', 'O'),
 ('her', 'PRP$', 'O'),
 ('parents', 'NNS', 'O'),
 ('say', 'VBP', 'O')]


NameError: ignored

In [18]:
 nltk.download('maxent_ne_chunker')
 nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [19]:
from nltk import ne_chunk
ne_tree = ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)

(S
  An/DT
  11-year-old/JJ
  who/WP
  survived/VBD
  the/DT
  (GPE Texas/NNP)
  school/NN
  shooting/NN
  is/VBZ
  scared/VBN
  the/DT
  gunman/NN
  is/VBZ
  still/RB
  is/VBZ
  out/RP
  to/TO
  get/VB
  her/PRP
  ,/,
  her/PRP$
  parents/NNS
  say/VBP)


In [20]:
# using spacy help this
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

**one of the nice things about Spacy is that we only need to apply nlp once, the entire background pipeline will return the objects.**

In [21]:
doc = nlp('An 11-year-old who survived the Texas school shooting is scared the gunman is still is out to get her, her parents say')

In [22]:
pprint([(X.text, X.label_) for X in doc.ents])

[('Texas', 'GPE')]


# Token
During the above example, we were working on entity level, in the following example, we are demonstrating token-level entity annotation using the BILUO tagging scheme to describe the entity boundaries.

In [23]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(An, 'O', ''),
 (11-year, 'O', ''),
 (-, 'O', ''),
 (old, 'O', ''),
 (who, 'O', ''),
 (survived, 'O', ''),
 (the, 'O', ''),
 (Texas, 'B', 'GPE'),
 (school, 'O', ''),
 (shooting, 'O', ''),
 (is, 'O', ''),
 (scared, 'O', ''),
 (the, 'O', ''),
 (gunman, 'O', ''),
 (is, 'O', ''),
 (still, 'O', ''),
 (is, 'O', ''),
 (out, 'O', ''),
 (to, 'O', ''),
 (get, 'O', ''),
 (her, 'O', ''),
 (,, 'O', ''),
 (her, 'O', ''),
 (parents, 'O', ''),
 (say, 'O', '')]


"B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set.

--------------------------------------------------------------------------------



Extracting named entity from an article 

Now let’s get serious with SpaCy and extracting named entities from article https://www.msn.com/en-us/news/world/russian-advances-in-ukraine-s-east-mark-a-tipping-point/ar-AAXUdDc?ocid=winp1taskbar&cvid=0a0edfc299cc4fa1995385eee0310321

In [61]:
from bs4 import BeautifulSoup
import requests
import re

def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
        return " ".join(re.split(r'[\n\t]+', soup.get_text()))



ny_bb = url_to_string('https://propakistani.pk/2022/05/31/fbr-posts-officers-at-directorate-general-of-dnfbps-under-fatf-requirement/')
article = nlp(ny_bb)
len(article.ents)

41

In [62]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'CARDINAL': 11,
         'DATE': 1,
         'MONEY': 8,
         'ORDINAL': 1,
         'ORG': 3,
         'PERCENT': 6,
         'PERSON': 5,
         'PRODUCT': 1,
         'WORK_OF_ART': 5})

In [63]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('0.2', 4), ('#cf-bubbles', 4), ('#cf', 2)]

In [73]:
sentences = [x for x in article.sents]
print(sentences[20])

:nth-child(2)


In [74]:
displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')

In [75]:
displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

In [76]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[20])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('nth', 'PROPN', 'nth'), ('child(2', 'NOUN', 'child(2')]

In [77]:
dict([(str(x), x.label_) for x in nlp(str(sentences[20])).ents])

{'nth-child(2': 'ORG'}

In [78]:
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[20]])

[(:, 'O', ''), (nth, 'O', ''), (-, 'O', ''), (child(2, 'O', ''), (), 'O', '')]
