## Chapter 2: The text processing pipeline

### Tokenization
The first action an NLP application performs on a text is parsing text into tokens.

In [4]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I am flying to Frisco')
print([w.text for w in doc])

['I', 'am', 'flying', 'to', 'Frisco']


### Lemmatization
The process of reducing word forms to their lemma. The lemma is the base form of the word which would be found in a dictionary.

In [5]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('this product integrates both libraries for downloading \
          and applying patches')
for token in doc:
    print(token.text, token.lemma_)

this this
product product
integrates integrate
both both
libraries library
for for
downloading download
                     
and and
applying apply
patches patch


### Using lemmatization for meaning recognition
If we reduce words to their lemmas, it makes it easier extract necessary information from input. We can also add special cases to the tokenizer for things like nicknames.

In [8]:
import spacy
from spacy.symbols import ORTH, LEMMA
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I am flying to Frisco') 
print([w.text for w in doc])
nlp.get_pipe("attribute_ruler").add([[{"TEXT": "Frisco"}]], {"LEMMA": "San Francisco"})
print([w.lemma_ for w in nlp(u'I am flying to Frisco')])

['I', 'am', 'flying', 'to', 'Frisco']
['I', 'be', 'fly', 'to', 'San Francisco']


### Using parts of speech to find relevant verbs

We can filter to find the infinitive and present progressive forms.

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have flown to LA. Now I am flying to Frisco')
print([w.text for w in doc if w.tag_ == 'VBG' or w.tag_ == 'VB'])


['flying']


We can also find the proper nouns in a sentence.

In [2]:
print([w.text for w in doc if w.pos_ == 'PROPN'])

['LA', 'Frisco']


We can access the synatactic dependency labels of the tokens in a sentence.

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have flown to LA. Now I am flying to Frisco')
for token in doc:
    print(token.text, token.pos_, token.dep_)

I PRON nsubj
have AUX aux
flown VERB ROOT
to ADP prep
LA PROPN pobj
. PUNCT punct
Now ADV advmod
I PRON nsubj
am AUX aux
flying VERB ROOT
to ADP prep
Frisco PROPN pobj


What it doesnt show is how words are related to each other in a sentence. This is called a dependency arc. To look at the dependency arcs in the discourse, we can use the loop below.

In [4]:
for token in doc:
    print(token.head.text, token.dep_, token.text)

flown nsubj I
flown aux have
flown ROOT flown
flown prep to
to pobj LA
flown punct .
flying advmod Now
flying nsubj I
flying aux am
flying ROOT flying
flying prep to
to pobj Frisco


We might be interested in tokens marked with ROOT and pobj as those are key to intent recognition in a sentence.

In [10]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have flown to LA. Now I am flying to Frisco')
for sent in doc.sents:
    print([w.text for w in sent if w.dep_ == 'ROOT' or w.dep_ == 'pobj'])

['flown', 'LA']
['flying', 'Frisco']


### Try This
We can combine the examples from preceding sections into a single script to correctly identify speakers intent to fly to San Francisco.

In [12]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have flown to LA. Now I am flying to Frisco')
nlp.get_pipe("attribute_ruler").add([[{"TEXT": "Frisco"}]], {"LEMMA": "San Francisco"})
for sent in doc.sents:
    print([w.lemma_ for w in sent if w.dep_ == 'ROOT' and w.tag_ == 'VBG' or w.dep_ == 'pobj'])

['LA']
['fly', 'Frisco']


### Named Entity Recognition
We can find named entities in a sentence using the ent_type attribute.

In [13]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'I have flown to LA. Now I am flying to Frisco')
for token in doc:
    if token.ent_type != 0:
        print(token.text, token.ent_type_)

LA GPE
Frisco ORG


GPE means the entity is a geopolitical entity