# Intro to NLP

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm") # English language model

In [2]:
doc = nlp("The social media firm's communications team tweeted: 'Now that everyone is asking… yes, we've been working on an edit feature since last year!'")

In [4]:
# Tokenizing

for token in doc:
    print(token)

The
social
media
firm
's
communications
team
tweeted
:
'
Now
that
everyone
is
asking
…
yes
,
we
've
been
working
on
an
edit
feature
since
last
year
!
'


## Text preprocessing

There are a few types of preprocessing to improve how we model with words. The first is `lemmatizing.` The `lemma` of a word is its base form. For example, **"walk"** is the lemma of the word **"walking"**. 

It's also common to remove stopwords. Stopwords are words that occur frequently in the language and don't contain much information. English stopwords include "the", "is", "and", "but", "not".

In [8]:
print(f"Token \t\t\t\tLemma \t\t\t\tStopword".format('Token','Lemma','Stopword'))
print("-"*80)

for token in doc:
    print(f"{token.text}\t\t\t\t{token.lemma_}\t\t\t\t{token.is_stop}")

Token 				Lemma 				Stopword
--------------------------------------------------------------------------------
The				the				True
social				social				False
media				medium				False
firm				firm				False
's				's				True
communications				communication				False
team				team				False
tweeted				tweet				False
:				:				False
'				'				False
Now				now				True
that				that				True
everyone				everyone				True
is				be				True
asking				ask				False
…				…				False
yes				yes				False
,				,				False
we				we				True
've				've				True
been				be				True
working				work				False
on				on				True
an				an				True
edit				edit				False
feature				feature				False
since				since				True
last				last				True
year				year				False
!				!				False
'				'				False


Language data has a lot of noise mixed in with informative content. Removing stop words might help the predictive model focus on relevant words. Lemmatizing similarly helps by combining multiple forms of the same word into one base form. However, lemmatizing and dropping stopwords might result in the models performing worse. This preprocessing should be treated as part of the `hyperparameter optimization process`.

## Pattern Matching

In [9]:
from spacy.matcher import PhraseMatcher

In [12]:
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

The matcher is created using the vocabulary of the model. Setting **attr='LOWER'** will match the phrases on lowercased text. This provides `case insensitive` matching.

In [14]:
text = "The rockets will be made by Arianespace, which was founded by Amazon owner Jeff Bezos and United Launch Alliance. Like Elon Musk's Starlink, users will connect to the internet via a terminal"

In [15]:
terms = ["Starlink","Arianaspace","Amazon","United Launch Alliance","Elon Musk","Jeff Bezos"]

In [17]:
patterns = [nlp(text) for text in terms]
print(patterns)

[Starlink, Arianaspace, Amazon, United Launch Alliance, Elon Musk, Jeff Bezos]


In [18]:
matcher.add("TerminologyList",patterns)

In [22]:
text_doc = nlp("Unlike Elon Musk's Falcon rockets, the rockets used for Project Kuiper's are still in development." 
               "Amazon says Project Kuiper aims to provide high-speed broadband to customers.")

matches = matcher(text_doc)
print(matches)

[(3766102292120407359, 1, 3), (3766102292120407359, 19, 20)]


In [25]:
match_id_1, start_1, end_1 = matches[0]
match_id_2, start_2, end_2 = matches[1]


print(nlp.vocab.strings[match_id_1], text_doc[start_1:end_1])
print(nlp.vocab.strings[match_id_2], text_doc[start_2:end_2])

TerminologyList Elon Musk
TerminologyList Amazon
