## Named Entity Recognition

Named entities are the objects in the real world that have a name. For example - a person, an organization, a product, a country etc. Using spaCy, we will use data from tripadvisor web pages to extract named entities in this project. 

TODO: Incorporate word sense disambiguation

In [1]:
#https://www.tripadvisor.com/ExperiencesInsights/travelers/top-travel-trends-2018
sentence = 'As a tour operator or attraction owner, staying on top of trends in the rapidly changing travel landscape is critical to helping you make important decisions about your business, including how to better satisfy customers and where to invest your resources. To help you, we’ve taken a look at our own booking data and traveler surveys, combined with research from reputable industry resources, to provide you insight into this year’s top travel trends—plus key takeaways for your business.'

### Installation

python -m spacy download en_core_web_lg

We are going to use spaCy for NER in our text. spaCy's model is trained based on [Onenote5 corpus](https://catalog.ldc.upenn.edu/ldc2013t19). spaCy also has a NER model that is trained on Wikipedia but that model does less granular classification. The model trained on Onenote5 corpus supports below entity

### Entity

![caption](files/Spacy-Onenote5-Entity.png)

More about it is here https://spacy.io/api/annotation#section-named-entities

In [2]:
import spacy
from spacy import displacy
from collections import Counter
from spacy.lang.en import English
import en_core_web_lg

spaCy has language models (https://spacy.io/usage/models) that we can use for entity recognition. For our purpose, we will use `en_core_web_lg` model. In general `lg` models are expected to perform better as they incorporate a larger dataset. This model is used for `vocabulary`, `entities`, `syntax` and `vector` recognition. 

In [4]:
#load the spaCy model
nlp = en_core_web_lg.load()

#generate the document object
doc = nlp(sentence)

#now lets see what are the entities that we have in the input sentence
print([(X.text, X.label_) for X in doc.ents])

[('year', 'DATE')]


### BILOU Tagging scheme

There are different tagging schemes in Natural Language Processing tasks - `IOB` is one of the common ones that is used in `NLTK`. 

I - prefix before a tag indicates that the tag is inside a chunk

O - tag indicates that a token belongs to no chunk

B- prefix before a tag indicates that the tag is the beginning of a chunk


#### Why BILOU not IOB
Instead of `IOB`, spaCy uses `BILOU` tagging scheme. As mentioned in the spaCy doc "There are several coding schemes for encoding entity annotations as token tags. These coding schemes are equally expressive, but not necessarily equally learnable. Ratinov and Roth showed that the minimal Begin, In, Out scheme was more difficult to learn than the BILUO scheme that we use, which explicitly marks boundary tokens."

![caption](files/BILUO.png)


In [6]:
print([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(As, 'O', ''), (a, 'O', ''), (tour, 'O', ''), (operator, 'O', ''), (or, 'O', ''), (attraction, 'O', ''), (owner, 'O', ''), (,, 'O', ''), (staying, 'O', ''), (on, 'O', ''), (top, 'O', ''), (of, 'O', ''), (trends, 'O', ''), (in, 'O', ''), (the, 'O', ''), (rapidly, 'O', ''), (changing, 'O', ''), (travel, 'O', ''), (landscape, 'O', ''), (is, 'O', ''), (critical, 'O', ''), (to, 'O', ''), (helping, 'O', ''), (you, 'O', ''), (make, 'O', ''), (important, 'O', ''), (decisions, 'O', ''), (about, 'O', ''), (your, 'O', ''), (business, 'O', ''), (,, 'O', ''), (including, 'O', ''), (how, 'O', ''), (to, 'O', ''), (better, 'O', ''), (satisfy, 'O', ''), (customers, 'O', ''), (and, 'O', ''), (where, 'O', ''), (to, 'O', ''), (invest, 'O', ''), (your, 'O', ''), (resources, 'O', ''), (., 'O', ''), (To, 'O', ''), (help, 'O', ''), (you, 'O', ''), (,, 'O', ''), (we, 'O', ''), (’ve, 'O', ''), (taken, 'O', ''), (a, 'O', ''), (look, 'O', ''), (at, 'O', ''), (our, 'O', ''), (own, 'O', ''), (booking, 'O', ''), (d

In [7]:
from bs4 import BeautifulSoup
import requests
import re

In [8]:
#this determines how/what you extract from an article
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

In [9]:
article = url_to_string('https://www.tripadvisor.com/ExperiencesInsights/travelers/top-travel-trends-2018')
#this site returned 403 forbidden https://weekendsherpa.com/stories/hike-san-franciscos-grand-walk-sutra-baths-to-crissy-field/
#https://www.tripadvisor.com/ExperiencesInsights/
article = nlp(article)
len(article.ents)

233

In [10]:
#show the Spacy Entity Labels
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'ORG': 67, 'DATE': 56, 'CARDINAL': 20, 'GPE': 20, 'PRODUCT': 16, 'PERCENT': 16, 'PERSON': 11, 'WORK_OF_ART': 7, 'MONEY': 5, 'ORDINAL': 3, 'FAC': 3, 'EVENT': 2, 'TIME': 2, 'LOC': 2, 'QUANTITY': 2, 'NORP': 1})

In [11]:
items = [x.text for x in article.ents]
Counter(items).most_common(3)

[('TripAdvisor', 13), ('300w', 9), ('srcset="https://mk0attractionsiexgba.kinstacdn.com', 8)]

In [12]:
#print the (text, label) touple
pair = [(x.text, x.label_) for x in article.ents]
print(pair)

[('the Year - TripAdvisor Experiences', 'EVENT'), ('Page   ', 'PERSON'), ('the Year–', 'EVENT'), ('May 8', 'DATE'), ('year', 'DATE'), ('1', 'CARDINAL'), ('Tours', 'GPE'), ('uploads/2017/11/camels-1024x683.jpg"', 'WORK_OF_ART'), ('1024w', 'DATE'), ('300w', 'PRODUCT'), ('1080w', 'DATE'), ('817px)', 'ORG'), ('$135 billion', 'MONEY'), ('third', 'ORDINAL'), ('$174 billion', 'MONEY'), ('2020', 'DATE'), ('$765 million', 'MONEY'), ('more than 200', 'CARDINAL'), ('2005', 'DATE'), ('2018', 'DATE'), ('68%', 'PERCENT'), ('59%', 'PERCENT'), ('2017', 'DATE'), ('Millennials', 'ORG'), ('79%', 'PERCENT'), ('TripBarometer', 'PRODUCT'), ('2017', 'DATE'), ('TripAdvisor', 'ORG'), ('Tours', 'GPE'), ('the years ahead', 'DATE'), ('srcset="https://mk0attractionsiexgba.kinstacdn.com', 'ORG'), ('1024w', 'DATE'), ('300w', 'PRODUCT'), ('https://mk0attractionsiexgba.kinstacdn.com/wp-content/uploads/2018/01/booking-online-768x514.jpg', 'ORG'), ('https://mk0attractionsiexgba.kinstacdn.com/wp-content/uploads/2018/01/b

Using `en_core_web_lg` improved the performance, it lables `Facebook` as `ORG`. 

In [13]:
sentences = [x for x in article.sents]
print(sentences[20])

To best position yourself for success, focus on expanding your customer reach and differentiating yourself from competitors in your market.


In [14]:
displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')

  "__main__", mod_spec)


In [15]:
displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

In [16]:
[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[20])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('To', 'PART', 'to'), ('best', 'ADJ', 'good'), ('position', 'VERB', 'position'), ('yourself', 'PRON', '-PRON-'), ('for', 'ADP', 'for'), ('success', 'NOUN', 'success'), ('focus', 'VERB', 'focus'), ('on', 'ADP', 'on'), ('expanding', 'VERB', 'expand'), ('your', 'ADJ', '-PRON-'), ('customer', 'NOUN', 'customer'), ('reach', 'VERB', 'reach'), ('and', 'CCONJ', 'and'), ('differentiating', 'VERB', 'differentiate'), ('yourself', 'PRON', '-PRON-'), ('from', 'ADP', 'from'), ('competitors', 'NOUN', 'competitor'), ('in', 'ADP', 'in'), ('your', 'ADJ', '-PRON-'), ('market', 'NOUN', 'market')]

In [17]:
dict([(str(x), x.label_) for x in nlp(str(sentences[20])).ents])

{}

In [18]:
displacy.render(article, jupyter=True, style='ent')

As seen above, in some cases the entity recognition is not working. For example, it identifies `Page` as `Person` probably because `page` starts with capital `p`. This happens here because there is a `Page` tab in the web link which while data collection came as a text data. So we need more data cleaning. 

## Reference

1. spaCy Named Entity Recognition https://spacy.io/usage/linguistic-features
2. NLTK - Extracting Information from Text https://www.nltk.org/book/ch07.html
2. Wikipedia IOB https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
3. Sample project by Susan Li https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da