# Additional NLP Concepts

## Tokenizatio

In [3]:
text = '''
The United States of America (U.S.A. or USA), is located in North America.
It consists of 50 states, five major unincorporated territories, 326 Indian reservations, 
a federal district, and some minor possessions.[g] 
At 3.8 million square miles (9.8 million square kilometers), 
it is the world's third- or fourth-largest country by total area.[c] 
With a population of more than 331 million people, 
it is the third most populous country in the world. 
The national capital is Washington, D.C., and the most populous city is New York City.
'''.replace('\n', '').strip()

print(text)

The United States of America (U.S.A. or USA), is located in North America.It consists of 50 states, five major unincorporated territories, 326 Indian reservations, a federal district, and some minor possessions.[g] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[c] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York City.


## Split on spaces

In [4]:
for token in text.split(' ')[:20]:
    print(f'"{token}"', end=', ')

"The", "United", "States", "of", "America", "(U.S.A.", "or", "USA),", "is", "located", "in", "North", "America.It", "consists", "of", "50", "states,", "five", "major", "unincorporated", 

## Split on non-alpha-numeric characters

`\W`: Matches any character which is not a word character. 
If the ASCII flag is used this becomes the equivalent of `[^a-zA-Z0-9_]`

In [7]:
import re 

for token in re.split(r'\W+', text)[:20]:
    print(f'"{token}"', end=', ')

"The", "United", "States", "of", "America", "U", "S", "A", "or", "USA", "is", "located", "in", "North", "America", "It", "consists", "of", "50", "states", 

## Language-aware splitting

In [9]:
import spacy

nlp = spacy.blank("en")

for token in nlp(text)[:20]:
    print(f'"{token}"', end=', ')

"The", "United", "States", "of", "America", "(", "U.S.A.", "or", "USA", ")", ",", "is", "located", "in", "North", "America", ".", "It", "consists", "of", 

In [12]:
import spacy

nlp = spacy.blank("en")

for token in nlp("Let's go to N.Y.!"):
    print(f'"{token}"', end=', ')

"Let", "'s", "go", "to", "N.Y.", "!", 

In [15]:
import spacy

nlp = spacy.blank("en")

for token in nlp("I'm gonna visit New York City at 6:00 A.M. :-)"):
    print(f'"{token}"', end=', ')

"I", "'m", "gon", "na", "visit", "New", "York", "City", "at", "6:00", "A.M.", ":-)", 

In [27]:
nlp = spacy.blank("en")

doc = nlp("I'm gonna visit New York City at 6:00 A.M. :-)")
    
with doc.retokenize() as retokenizer:
    for i in range(len(doc) - 3):
        if doc[i:i+3].text == 'New York City':
            retokenizer.merge(doc[i:i+3], attrs={"LEMMA": "new york city"})
            
print("After:", [token.text for token in doc])

After: ['I', "'m", 'gon', 'na', 'visit', 'New York City', 'at', '6:00', 'A.M.', ':-)']


#### Question

- Can we automate the discovery of phrases like "New York City" without setting manual rules?

## Lemmatization

In [29]:
import spacy

nlp = spacy.load('en_core_web_lg')

for token in nlp("I'm gonna visit New York City at 6:00 A.M. :-)"):
    print(f'"{token.lemma_}"', end=', ')

"I", "be", "going", "to", "visit", "New", "York", "City", "at", "6:00", "A.M.", ":-)", 

In [32]:
import spacy

nlp = spacy.load('en_core_web_lg')

for token in nlp("He is reading the books she read before"):
    print(f'"{token.lemma_}"', end=', ')

"he", "be", "read", "the", "book", "she", "read", "before", 

## Part of Speech Tags and Stop Words

In [43]:
import spacy

nlp = spacy.load("en_core_web_lg")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

pd.DataFrame(
    [
        {
            'text': token.text, 
            'lemma': token.lemma_,
            'part_of_speech': token.pos_,
            'is_stop_word': token.is_stop,
            'is_alpha': token.is_alpha,
        } 
        for token in doc
    ]
).set_index('text')

Unnamed: 0_level_0,lemma,part_of_speech,is_stop_word,is_alpha
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Apple,Apple,PROPN,False,True
is,be,AUX,True,True
looking,look,VERB,False,True
at,at,ADP,True,True
buying,buy,VERB,False,True
U.K.,U.K.,PROPN,False,False
startup,startup,NOUN,False,True
for,for,ADP,True,True
$,$,SYM,False,False
1,1,NUM,False,False


### Dependancy Parsing

In [66]:
from spacy import displacy

nlp = spacy.load("en_core_web_lg")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# https://spacy.io/api/top-level#displacy_options
displacy.render(doc, style="dep", options={"distance": 85})

## Exercise 

- Can you classify the documents based on their `polarity` (or `deceptive`) using their `part of speech` tags instead of their actual text?
- Are there `part of speech` tags that correlate more with certain `polarity` (or with being `deceptive`)?

## Named Entity Recognition (NER)

In [46]:
import spacy

nlp = spacy.load("en_core_web_lg")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

pd.DataFrame(
    [
        {
            'text': ent.text, 
            'label': ent.label_,
            'start_char': ent.start_char,
            'end_char': ent.end_char,
        } 
        for ent in doc.ents
    ]
).set_index('text')

Unnamed: 0_level_0,label,start_char,end_char
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apple,ORG,0,5
U.K.,GPE,27,31
$1 billion,MONEY,44,54


### Custom NER

In [47]:
import spacy

nlp = spacy.load("en_core_web_lg")
doc = nlp("I was listening to The Who on Spotify")

pd.DataFrame(
    [
        {
            'text': ent.text, 
            'label': ent.label_,
            'start_char': ent.start_char,
            'end_char': ent.end_char,
        } 
        for ent in doc.ents
    ]
).set_index('text')

Unnamed: 0_level_0,label,start_char,end_char
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Spotify,ORG,30,37


In [50]:
import spacy

nlp = spacy.load("en_core_web_lg")

# Custom Rule for The Who
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "BAND", "pattern": [{"TEXT": "The"}, {"TEXT": "Who"}]}]
ruler.add_patterns(patterns)

doc = nlp("I was listening to The Who on Spotify")

pd.DataFrame(
    [
        {
            'text': ent.text, 
            'label': ent.label_,
            'start_char': ent.start_char,
            'end_char': ent.end_char,
        } 
        for ent in doc.ents
    ]
).set_index('text')


Unnamed: 0_level_0,label,start_char,end_char
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Who,BAND,19,26
Spotify,ORG,30,37


### Notes

- Statistical vs Rule-based Entity Recognition
- Entity Linking [Spacy Doc](https://spacy.io/usage/linguistic-features#entity-linking)