<a href="https://colab.research.google.com/github/kunal24bit/NLP/blob/main/Parts_of_speech_and_named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Parts of Speech basics**

Most words are rare and its common for words that look completely different to mean almostthe same thing.
The same words in different order can mean something completely different.

Even spliiting text in to useful words like units can be difficult in many languages. While its possible to solve some problems starting from only the raw characters, its usually better to use linguistic knowledge to add useful information.

In [2]:
import spacy

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's back")

In [5]:
print(doc.text)

The quick brown fox jumped over the lazy dog's back


In [6]:
print(doc[4].tag_)

VBD


In [7]:
print(doc[4].pos_)

VERB


In [8]:
for token in doc:
  print(f"{token.text:{10}} {token.tag_:{10}}{token.pos_:{10}} {spacy.explain(token.tag_)}")

The        DT        DET        determiner
quick      JJ        ADJ        adjective
brown      JJ        ADJ        adjective
fox        NN        NOUN       noun, singular or mass
jumped     VBD       VERB       verb, past tense
over       IN        ADP        conjunction, subordinating or preposition
the        DT        DET        determiner
lazy       JJ        ADJ        adjective
dog        NN        NOUN       noun, singular or mass
's         POS       PART       possessive ending
back       NN        NOUN       noun, singular or mass


Spacy is smart to differentiate between the past tense and present tense

In [9]:
doc1 = nlp(u"I read a book on NLP.")

In [10]:
word =  doc1[1]

token = word

token.text

'read'

In [11]:
print(f"{token.text:{10}} {token.tag_:{10}}{token.pos_:{10}} {spacy.explain(token.tag_)}")

read       VBD       VERB       verb, past tense


In [12]:
doc2 = nlp(u"I am reading a book on NLP")

In [13]:
word = doc2[2]

In [14]:
token = word
print(f"{token.text:{10}} {token.tag_:{10}}{token.pos_:{10}} {spacy.explain(token.tag_)}")

reading    VBG       VERB       verb, gerund or present participle


Counting Parts of Speech in DOC

In [15]:
POS_counts = doc.count_by(spacy.attrs.POS)

In [16]:
POS_counts

{84: 3, 85: 1, 90: 2, 92: 3, 94: 1, 100: 1}

In [17]:
for k,v in sorted(POS_counts.items()):
  print(f"{k}. {doc.vocab[k].text:{5}} {v}")

84. ADJ   3
85. ADP   1
90. DET   2
92. NOUN  3
94. PART  1
100. VERB  1


In [18]:
TAG_counts = doc.count_by(spacy.attrs.TAG)


In [19]:
for k,v in sorted(TAG_counts.items()):
  print(f"{k:{24}}. {doc.vocab[k].text:{10}} {v}")

                      74. POS        1
     1292078113972184607. IN         1
    10554686591937588953. JJ         3
    15267657372422890137. DT         2
    15308085513773655218. NN         3
    17109001835818727656. VBD        1


**Why did the ID numbers get so big?**

 In spaCy, certain text values are hardcoded into Doc.vocab and take up the first several hundred ID numbers. Strings like ‘NOUN’ and ‘VERB’ are used frequently by internal operations. Others, like fine-grained tags, are assigned hash values as needed.

**Visulizing parts of speech**

In [20]:
from spacy import displacy

In [21]:
displacy.render(doc, style='dep', jupyter = True, options={"distance":90})

**Named Entity Recognition**

NER seeks to locate and classify named entity mention in unstructured texts into pre defined categories such as the person names, organizations, locations, medical codes, time expressions, quatities, monetary values percentage etc.

Our goal is to read raw text such as:


*   Jim bought 300 shares of Acme corp. in 2006
*   [Jim]_person bought 300 shares of [Acme Corp]_Organization in [2006]_time.


In [22]:
def show_ents(doc):
  if doc.ents:
    for ent in doc.ents:
      print(ent.text +'  -  ' + ent.label_ + '  -  ' + str(spacy.explain(ent.label_)))
  else:
    print("No entities found")

In [23]:
doc = nlp(u"Hi, how are you?")

In [24]:
show_ents(doc)

No entities found


In [25]:
doc = nlp(u"may I got to Washington D.C. next may to see the Washington Monument")

In [26]:
show_ents(doc)

Washington D.C.  -  GPE  -  Countries, cities, states
next may  -  DATE  -  Absolute or relative dates or periods
Washington  -  GPE  -  Countries, cities, states


In [27]:
doc = nlp(u"Can I please have 500 dollars of Microsoft stock?")

In [28]:
show_ents(doc)

500 dollars  -  MONEY  -  Monetary values, including unit
Microsoft  -  ORG  -  Companies, agencies, institutions, etc.


In [29]:
doc = nlp(u"Tesla to build a U.K. factory for 500 million dolars")

In [30]:
show_ents(doc)

U.K.  -  GPE  -  Countries, cities, states
500 million  -  CARDINAL  -  Numerals that do not fall under another type


As we can see Tesla is not recognized as a company. 

In [31]:
from spacy.tokens import Span

In [32]:
ORG = doc.vocab.strings[u"ORG"]

In [33]:
ORG

383

In [34]:
new_ent = Span(doc, 0,1, label=ORG)

In [35]:
doc.ents = list(doc.ents) + [new_ent]

In [36]:
show_ents(doc)

Tesla  -  ORG  -  Companies, agencies, institutions, etc.
U.K.  -  GPE  -  Countries, cities, states
500 million  -  CARDINAL  -  Numerals that do not fall under another type


As we have added Tesla to be organization that is why it is giving Tesla -ORG

We have added a single term as our NER.

But what if we have several terms to add as our NER?

For example if we are working with a vaccum company, we might want to add both vaccum cleaner and vaccum-cleaner as PROD(product) NER.


In [37]:
doc = nlp(u"Our company created a brand new vaccum cleaner."
          U"This new vaccum-cleaner is the best in show.")

In [38]:
show_ents(doc)

No entities found


In [40]:
from spacy.matcher import PhraseMatcher

In [41]:
matcher = PhraseMatcher(nlp.vocab)

In [42]:
phrase_list = ['vaccum-cleaner', 'vaccum cleaner']

In [43]:
phrase_patterns = [nlp(text) for text in phrase_list] 

In [44]:
matcher.add('newproduct',None,*phrase_patterns)

In [45]:
found_matches = matcher(doc)

In [46]:
found_matches

[(2689272359382549672, 6, 8), (2689272359382549672, 11, 14)]

In [47]:
from spacy.tokens import Span

In [48]:
PROD = doc.vocab.strings[u"PRODUCT"]

In [49]:
new_ents = [Span(doc, match[1], match[2], label = PROD) for match in found_matches]

In [53]:
doc.ents = list(doc.ents)  + new_ents

In [54]:
show_ents(doc)

vaccum cleaner  -  PRODUCT  -  Objects, vehicles, foods, etc. (not services)
vaccum-cleaner  -  PRODUCT  -  Objects, vehicles, foods, etc. (not services)


In [55]:
doc = nlp(u"Originially I paid $29.95 for this car toy, but noe it is marked down to $20.")

In [58]:
len([ent for ent in doc.ents if ent.label_ == 'MONEY'])

2

**Visualizing NER**

In [59]:
from spacy import displacy

In [73]:
doc = nlp(u"Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million."
          u"By contrast, Sony only sold 8 thousand Walkman music players.")

In [74]:
displacy.render(doc,style='ent',jupyter=True, options={"distance":50})

In [75]:
for sent in doc.sents:
  displacy.render(sent,style='ent',jupyter=True, options={"distance":50})