# NLP - Session 6 - SpaCy Introduction for NLP | Linguistic Annotaion | Emoji, Phone, Email, HashTag Extraction and Matching

## Using Linguistic Annotations
Let’s say you’re analyzing user comments and you want to find out what people are saying about Facebook. You want to start off by finding adjectives following “Facebook is” or “Facebook was”. This is obviously a very rudimentary solution, but it’ll be fast, and a great way to get an idea for what’s in your data. Your pattern could look like this:

`[{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}]`

This translates to a token whose lowercase form matches “facebook” (like Facebook, facebook or FACEBOOK), followed by a token with the lemma “be” (for example, is, was, or ‘s), followed by an optional adverb, followed by an adjective.

This is the link for all the annotations-

https://spacy.io/api/annotation

In [1]:
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
matcher = Matcher(nlp.vocab)

In [6]:
matched_sents = []

In [5]:
pattern = [{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}]

In [7]:
def callback_method_fb(matcher, doc, i, matches):
    matched_id, start, end = matches[i]
    span = doc[start:end]
    sent = span.sent
    
    match_ents = [{
        "start": span.start_char - sent.start_char,
        "end": span.end_char - sent.start_char,
        "label": "MATCH"
    }]
    
    matched_sents.append({"text": sent.text, "ents": match_ents})

In [8]:
matcher.add("fb", callback_method_fb, pattern)

In [9]:
doc = nlp("I'd say that Facebook is evil. – Facebook is pretty cool, right?")

In [10]:
matches = matcher(doc)

In [11]:
matches

[(8017838677478259815, 4, 7), (8017838677478259815, 9, 13)]

In [12]:
matched_sents

[{'text': "I'd say that Facebook is evil.",
  'ents': [{'start': 13, 'end': 29, 'label': 'MATCH'}]},
 {'text': '– Facebook is pretty cool, right?',
  'ents': [{'start': 2, 'end': 25, 'label': 'MATCH'}]}]

In [13]:
displacy.render(matched_sents, style="ent", manual=True)

## Phone Number
Phone numbers can have many different formats and matching them is often tricky. During tokenization, spaCy will leave sequences of numbers intact and only split on whitespace and punctuation. This means that your match pattern will have to look out for number sequences of a certain length, surrounded by specific punctuation – depending on the national conventions.

You want to match like this `(123) 4567 8901` or `(123) 4567-8901`

`[{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"}, {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]`

In this pattern we are looking for a opening bracket. Then we are matching a number with 3 digits. Then a closing bracket. Then a number with 4 digits. Then a dash which is optional. Lastly, a number with 4 digits.

In [14]:
pattern = [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"}, {"ORTH": "-", "OP": "?"}, {"SHAPE": "dddd"}]

In [16]:
matcher = Matcher(nlp.vocab)
matcher.add("PhoneNumber", None, pattern)

In [17]:
doc = nlp("Call me at (123) 4560-7890")

In [18]:
print([t.text for t in doc])

['Call', 'me', 'at', '(', '123', ')', '4560', '-', '7890']


In [19]:
matches = matcher(doc)
matches

[(7978097794922043545, 3, 9)]

In [20]:
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

(123) 4560-7890


## Email Address Matching
In this the pattern checks for one or more character from a-zA-Z0-9-_.. Then a @. Then again one or more character from a-zA-Z0-9-_.

In [21]:
pattern = [{"TEXT": {"REGEX": "[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+"}}]

In [22]:
matcher = Matcher(nlp.vocab)

In [23]:
matcher.add("Email", None, pattern)

In [24]:
text = "Email me at email2me@kgptalkie.com and talk.me@kgptalkie.com"

In [25]:
doc = nlp(text)

In [26]:
matches = matcher(doc)
matches

[(11010771136823990775, 3, 4), (11010771136823990775, 5, 6)]

In [27]:
for match_id, start, end in matches:
    span = doc[start:end]
    print(span.text)

email2me@kgptalkie.com
talk.me@kgptalkie.com


## Hashtags and emoji on social media
Social media posts, especially tweets, can be difficult to work with. They’re very short and often contain various emoji and hashtags. By only looking at the plain text, you’ll lose a lot of valuable semantic information.

Let’s say you’ve extracted a large sample of social media posts on a specific topic, for example posts mentioning a brand name or product. As the first step of your data exploration, you want to filter out posts containing certain emoji and use them to assign a general sentiment score, based on whether the expressed emotion is positive or negative, e.g. 😀 or 😞. You also want to find, merge and label hashtags like #MondayMotivation, to be able to ignore or analyze them later.

By default, spaCy’s tokenizer will split emoji into separate tokens. This means that you can create a pattern for one or more emoji tokens. Valid hashtags usually consist of a #, plus a sequence of ASCII characters with no whitespace, making them easy to match as well.

We have made a list of positive and negative emojis.

In [28]:
pos_emoji = ["😀", "😃", "😂", "🤣", "😊", "😍"]  # Positive emoji
neg_emoji = ["😞", "😠", "😩", "😢", "😭", "😒"]  # Negative emoji
pos_emoji, neg_emoji

(['😀', '😃', '😂', '🤣', '😊', '😍'], ['😞', '😠', '😩', '😢', '😭', '😒'])

In [29]:
# Add patterns to match one or more emoji tokens
pos_patterns = [[{"ORTH": emoji}] for emoji in pos_emoji]
neg_patterns = [[{"ORTH": emoji}] for emoji in neg_emoji]
pos_patterns, neg_patterns

([[{'ORTH': '😀'}],
  [{'ORTH': '😃'}],
  [{'ORTH': '😂'}],
  [{'ORTH': '🤣'}],
  [{'ORTH': '😊'}],
  [{'ORTH': '😍'}]],
 [[{'ORTH': '😞'}],
  [{'ORTH': '😠'}],
  [{'ORTH': '😩'}],
  [{'ORTH': '😢'}],
  [{'ORTH': '😭'}],
  [{'ORTH': '😒'}]])

We will write a function label_sentiment() which will be called after every match to label the sentiment of the emoji. If the sentiment is positive then we are adding 0.1 to doc.sentiment and if the sentiment is negative then we are subtracting 0.1 from doc.sentiment.

In [30]:
def label_sentiment(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    if doc.vocab.strings[match_id] == 'HAPPY':
        doc.sentiment += 0.1
    elif doc.vocab.strings[match_id] == 'SAD':
        doc.sentiment -= 0.1

In [31]:
matcher = Matcher(nlp.vocab)

In [32]:
matcher.add("HAPPY", label_sentiment, *pos_patterns)
matcher.add("SAD", label_sentiment, *neg_patterns)

Here with the HAPPY and SAD matchers we are also adding HASHTAG matcher to extract the hashtags. For hashtags we are going to match text which has atleast one ‘#’.

In [36]:
matcher.add("HASHTAG", None, [{"TEXT": "#"}, {"IS_ASCII": True}])

In [40]:
doc = nlp("Hello world 😀 #DATASCIENCE 😢")

In [41]:
matches = matcher(doc)

In [42]:
for match_id, start, end in matches:
    string_id = doc.vocab.strings[match_id]  # Look up string ID
    span = doc[start:end]
    print(string_id, span.text)

HAPPY 😀
HASHTAG #DATASCIENCE
SAD 😢


## Efficient phrase matching
If you need to match large terminology lists, you can also use the PhraseMatcher and create Doc objects instead of token patterns, which is much more efficient overall. The Doc patterns can contain single or multiple tokens.

We are going to extract the names in terms from a document. We have made a pattern for the same.

In [43]:
from spacy.matcher import PhraseMatcher

In [44]:
matcher = PhraseMatcher(nlp.vocab)

In [54]:
terms = ['DONALD TRUMP', 'ANGELA MERKEL', 'WASHINGTON D.C.']

In [55]:
pattern = [nlp.make_doc(text) for text in terms]
pattern

[DONALD TRUMP, ANGELA MERKEL, WASHINGTON D.C.]

In [56]:
matcher.add("term", None, *pattern)

In [57]:
doc = nlp("German Chancellor ANGELA MERKEL and US President DONALD TRUMP "
          "converse in the Oval Office inside the White House in WASHINGTON D.C.")
doc

German Chancellor ANGELA MERKEL and US President DONALD TRUMP converse in the Oval Office inside the White House in WASHINGTON D.C.

In [58]:
matches = matcher(doc)

In [59]:
for match_id, start, end in matches:
    string_id = doc.vocab.strings[match_id]  # Look up string ID
    span = doc[start:end]
    print(string_id, span.text)

term ANGELA MERKEL
term DONALD TRUMP
term WASHINGTON D.C.


## Custom Rule Based Entity Recognition
The EntityRuler is an exciting new component that lets you add named entities based on pattern dictionaries, and makes it easy to combine rule-based and statistical named entity recognition for even more powerful models.

### Entity Patterns
Entity patterns are dictionaries with two keys: “label”, specifying the label to assign to the entity if the pattern is matched, and “pattern”, the match pattern. The entity ruler accepts two types of patterns:

 - Phrase Pattern `{"label": "ORG", "pattern": "Apple"}`
 - Token Pattern `{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}`

### Using the entity ruler
The EntityRuler is a pipeline component that’s typically added via `nlp.add_pipe`. When the nlp object is called on a text, it will find matches in the doc and add them as entities to the doc.ents, using the specified pattern label as the entity label.

https://spacy.io/api/annotation#named-entities

We are importing EntityRuler from spacy.pipeline. Then we are loading a fresh model using `spacy.load()`. We have created a pattern which will label KGP Talkie as ORG and san francisco as GPE.

In [60]:
from spacy.pipeline import EntityRuler

In [61]:
nlp = spacy.load("en_core_web_sm")

In [62]:
ruler = EntityRuler(nlp)

In [63]:
patterns = [{"label": "ORG", "pattern": "KGP Talkie"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
patterns

[{'label': 'ORG', 'pattern': 'KGP Talkie'},
 {'label': 'GPE', 'pattern': [{'LOWER': 'san'}, {'LOWER': 'francisco'}]}]

In [64]:
ruler.add_patterns(patterns)

In [65]:
nlp.add_pipe(ruler)

In [66]:
doc = nlp("KGP Talkie is opening its first big office in San Francisco.")
doc

KGP Talkie is opening its first big office in San Francisco.

In [67]:
for ent in doc.ents:
    print(ent.text, ent.label_)

KGP Talkie PERSON
first ORDINAL
San Francisco GPE


Compared to using only regular expressions on raw text, spaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for – they also give you access to the tokens within the document and their relationships. This means you can easily access and analyze the surrounding tokens, merge spans into single tokens or add entries to the named entities.