## Entity Recognition

TODO: Incorporate word sense disambiguation

In [1]:
import spacy
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

In [2]:
#https://www.tripadvisor.com/ExperiencesInsights/travelers/top-travel-trends-2018
sentence = 'As a tour operator or attraction owner, staying on top of trends in the rapidly changing travel landscape is critical to helping you make important decisions about your business, including how to better satisfy customers and where to invest your resources. To help you, we’ve taken a look at our own booking data and traveler surveys, combined with research from reputable industry resources, to provide you insight into this year’s top travel trends—plus key takeaways for your business.'

### NLTK POS Tag Meaning

Checkout this website to understand the meaning of these NLTK POS tags that we are extracting below
https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

In [3]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [4]:
sent = preprocess(sentence)
sent

[('As', 'IN'),
 ('a', 'DT'),
 ('tour', 'NN'),
 ('operator', 'NN'),
 ('or', 'CC'),
 ('attraction', 'NN'),
 ('owner', 'NN'),
 (',', ','),
 ('staying', 'VBG'),
 ('on', 'IN'),
 ('top', 'NN'),
 ('of', 'IN'),
 ('trends', 'NNS'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('rapidly', 'RB'),
 ('changing', 'VBG'),
 ('travel', 'NN'),
 ('landscape', 'NN'),
 ('is', 'VBZ'),
 ('critical', 'JJ'),
 ('to', 'TO'),
 ('helping', 'VBG'),
 ('you', 'PRP'),
 ('make', 'VBP'),
 ('important', 'JJ'),
 ('decisions', 'NNS'),
 ('about', 'IN'),
 ('your', 'PRP$'),
 ('business', 'NN'),
 (',', ','),
 ('including', 'VBG'),
 ('how', 'WRB'),
 ('to', 'TO'),
 ('better', 'RBR'),
 ('satisfy', 'NN'),
 ('customers', 'NNS'),
 ('and', 'CC'),
 ('where', 'WRB'),
 ('to', 'TO'),
 ('invest', 'VB'),
 ('your', 'PRP$'),
 ('resources', 'NNS'),
 ('.', '.'),
 ('To', 'TO'),
 ('help', 'VB'),
 ('you', 'PRP'),
 (',', ','),
 ('we', 'PRP'),
 ('’', 'VBP'),
 ('ve', 'RB'),
 ('taken', 'VBN'),
 ('a', 'DT'),
 ('look', 'NN'),
 ('at', 'IN'),
 ('our', 'PRP$'),
 ('own

### Noun Phrase aka. Named Entity Chucking

Now we will define a pattern for noun pharases. 

In [5]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

In [6]:
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs)

(S
  As/IN
  (NP a/DT tour/NN)
  (NP operator/NN)
  or/CC
  (NP attraction/NN)
  (NP owner/NN)
  ,/,
  staying/VBG
  on/IN
  (NP top/NN)
  of/IN
  trends/NNS
  in/IN
  the/DT
  rapidly/RB
  changing/VBG
  (NP travel/NN)
  (NP landscape/NN)
  is/VBZ
  critical/JJ
  to/TO
  helping/VBG
  you/PRP
  make/VBP
  important/JJ
  decisions/NNS
  about/IN
  your/PRP$
  (NP business/NN)
  ,/,
  including/VBG
  how/WRB
  to/TO
  better/RBR
  (NP satisfy/NN)
  customers/NNS
  and/CC
  where/WRB
  to/TO
  invest/VB
  your/PRP$
  resources/NNS
  ./.
  To/TO
  help/VB
  you/PRP
  ,/,
  we/PRP
  ’/VBP
  ve/RB
  taken/VBN
  (NP a/DT look/NN)
  at/IN
  our/PRP$
  (NP own/JJ booking/NN)
  data/NNS
  and/CC
  (NP traveler/NN)
  surveys/NNS
  ,/,
  combined/VBN
  with/IN
  (NP research/NN)
  from/IN
  (NP reputable/JJ industry/NN)
  resources/NNS
  ,/,
  to/TO
  provide/VB
  you/PRP
  insight/RB
  into/IN
  (NP this/DT year/NN)
  ’/VBZ
  (NP s/JJ top/JJ travel/NN)
  trends—plus/IN
  key/JJ
  takeaways/NNS
 

### Visualize
We can visualize different chunks of the noun phrases

In [7]:
#draws the tree
NPChunker = nltk.RegexpParser(pattern) 
result = NPChunker.parse(sent)
result.draw()

### Find Out IOB Tags
IOB is a tagging format in chunking tasks

I - prefix before a tag indicates that the tag is inside a chunk

O - tag indicates that a token belongs to no chunk

B- prefix before a tag indicates that the tag is the beginning of a chunk

In [8]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('As', 'IN', 'O'),
 ('a', 'DT', 'B-NP'),
 ('tour', 'NN', 'I-NP'),
 ('operator', 'NN', 'B-NP'),
 ('or', 'CC', 'O'),
 ('attraction', 'NN', 'B-NP'),
 ('owner', 'NN', 'B-NP'),
 (',', ',', 'O'),
 ('staying', 'VBG', 'O'),
 ('on', 'IN', 'O'),
 ('top', 'NN', 'B-NP'),
 ('of', 'IN', 'O'),
 ('trends', 'NNS', 'O'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('rapidly', 'RB', 'O'),
 ('changing', 'VBG', 'O'),
 ('travel', 'NN', 'B-NP'),
 ('landscape', 'NN', 'B-NP'),
 ('is', 'VBZ', 'O'),
 ('critical', 'JJ', 'O'),
 ('to', 'TO', 'O'),
 ('helping', 'VBG', 'O'),
 ('you', 'PRP', 'O'),
 ('make', 'VBP', 'O'),
 ('important', 'JJ', 'O'),
 ('decisions', 'NNS', 'O'),
 ('about', 'IN', 'O'),
 ('your', 'PRP$', 'O'),
 ('business', 'NN', 'B-NP'),
 (',', ',', 'O'),
 ('including', 'VBG', 'O'),
 ('how', 'WRB', 'O'),
 ('to', 'TO', 'O'),
 ('better', 'RBR', 'O'),
 ('satisfy', 'NN', 'B-NP'),
 ('customers', 'NNS', 'O'),
 ('and', 'CC', 'O'),
 ('where', 'WRB', 'O'),
 ('to', 'TO', 'O'),
 ('invest', 'VB', 'O'),
 ('your', 'PRP$',

Each touple in the above output contains 3 items - the token, its POS tag and its named entity tag (i.e. IOB tag)

### Using Spacy for Named Entity Recognition

We are going to use Spacy for NER for our text. Spacy's model is trained based on [Onenote5 corpus](https://catalog.ldc.upenn.edu/ldc2013t19). Spacy also has a NER model that is trained on Wikipedia but that model does less granular classification. The model trained on Onenote5 corpus supports below entity

![caption](files/Spacy-Onenote5-Entity.png)

More about it is here https://spacy.io/api/annotation#section-named-entities

In [42]:
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [43]:
doc = nlp(sentence)
pprint([(X.text, X.label_) for X in doc.ents])

[('year', 'DATE')]


In [44]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(As, 'O', ''),
 (a, 'O', ''),
 (tour, 'O', ''),
 (operator, 'O', ''),
 (or, 'O', ''),
 (attraction, 'O', ''),
 (owner, 'O', ''),
 (,, 'O', ''),
 (staying, 'O', ''),
 (on, 'O', ''),
 (top, 'O', ''),
 (of, 'O', ''),
 (trends, 'O', ''),
 (in, 'O', ''),
 (the, 'O', ''),
 (rapidly, 'O', ''),
 (changing, 'O', ''),
 (travel, 'O', ''),
 (landscape, 'O', ''),
 (is, 'O', ''),
 (critical, 'O', ''),
 (to, 'O', ''),
 (helping, 'O', ''),
 (you, 'O', ''),
 (make, 'O', ''),
 (important, 'O', ''),
 (decisions, 'O', ''),
 (about, 'O', ''),
 (your, 'O', ''),
 (business, 'O', ''),
 (,, 'O', ''),
 (including, 'O', ''),
 (how, 'O', ''),
 (to, 'O', ''),
 (better, 'O', ''),
 (satisfy, 'O', ''),
 (customers, 'O', ''),
 (and, 'O', ''),
 (where, 'O', ''),
 (to, 'O', ''),
 (invest, 'O', ''),
 (your, 'O', ''),
 (resources, 'O', ''),
 (., 'O', ''),
 (To, 'O', ''),
 (help, 'O', ''),
 (you, 'O', ''),
 (,, 'O', ''),
 (we, 'O', ''),
 (’ve, 'O', ''),
 (taken, 'O', ''),
 (a, 'O', ''),
 (look, 'O', ''),
 (at, 'O', ''),
 

In [45]:
from bs4 import BeautifulSoup
import requests
import re

In [46]:
#this determines how/what you extract from an article
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

In [47]:
article = url_to_string('https://www.tripadvisor.com/ExperiencesInsights/travelers/top-travel-trends-2018')
#this site returned 403 forbidden https://weekendsherpa.com/stories/hike-san-franciscos-grand-walk-sutra-baths-to-crissy-field/
#https://www.tripadvisor.com/ExperiencesInsights/
article = nlp(article)
len(article.ents)

237

In [48]:
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'CARDINAL': 26,
         'DATE': 59,
         'EVENT': 2,
         'FAC': 1,
         'GPE': 27,
         'MONEY': 6,
         'NORP': 5,
         'ORDINAL': 5,
         'ORG': 46,
         'PERCENT': 16,
         'PERSON': 23,
         'PRODUCT': 12,
         'QUANTITY': 3,
         'TIME': 2,
         'WORK_OF_ART': 4})

In [49]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('TripAdvisor', 13), ('\xa0', 11), ('815px', 7)]

In [50]:
sentences = [x for x in article.sents]
print(sentences[20])

> More and more travelers are turning to online and mobile channels to research, plan, and book their travel.


In [51]:
displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')

  "__main__", mod_spec)


In [52]:
displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

In [53]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[20])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('>', 'X', '>'),
 ('More', 'ADJ', 'more'),
 ('travelers', 'NOUN', 'traveler'),
 ('turning', 'VERB', 'turn'),
 ('online', 'ADV', 'online'),
 ('mobile', 'ADJ', 'mobile'),
 ('channels', 'NOUN', 'channel'),
 ('research', 'VERB', 'research'),
 ('plan', 'NOUN', 'plan'),
 ('book', 'VERB', 'book'),
 ('travel', 'NOUN', 'travel')]

In [54]:
dict([(str(x), x.label_) for x in nlp(str(sentences[20])).ents])

{}

In [55]:
displacy.render(article, jupyter=True, style='ent')

## Reference

1. NLTK - Extracting Information from Text https://www.nltk.org/book/ch07.html
2. Wikipedia IOB https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
3. Sample project by Susan Li https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da