# Kaggle Course - Introduction to Natural Language Processing

spaCy is the the leading library of NLP and it has become one of the most popular frameworks. It has very rich [documentation](https://spacy.io/usage) as well. It actually has an interactive four lesson [tutorial](https://course.spacy.io/en/) where you can learn all about spaCy.

Installation instructions are already there in the documenation. However if you work on jupyter notebook in conda environment like me, please run the below command in Anaconda Prompt (Run as administrator) to start your spaCy journey.

****Step1 : Install spaCy****<br/>
conda install -c conda-forge spacy

****Step2 : Install Language Model****<br/>Choose one between these two lanugage models<br/><br/>
python -m spacy download en  #default English model (~50MB)<br/><br/>
python -m spacy download en_core_web_md # larger English model (1GB)

spaCy is used primarily for following purposes.
- Basic text processing and pattern matching
- Building machine learning models with text
- Representing text with word embeddings that numerically capture the meaning of words or documents

In [13]:
import spacy

spaCy relies on ****models**** that are language specific and come in different sizes. Here is how we load the English language spaCy model.

In [14]:
nlp = spacy.load('en')

Once the model is loaded we can process text, see below example.

In [15]:
doc = nlp("Tea is healthy and calming, don't you think?")

Once we have this doc object created, we can do a lot with it...like..
- Tokenizing
- Lemmatizing
- Removing stop words
- Pattern Matching
<br/>etc.

****Tokenizing****
<br/>Its a process that returns a document object containing tokens. A token is nothing but a unit of text in the document such as individual words and punctuation.

In [17]:
for token in doc:
    print(token)

Tea
is
healthy
and
calming
,
do
n't
you
think
?


****Lemmatizing****
<br/>Its a process of taking a word back to its original or base form. For example lemmatizing the word "running" will give us "run". We can perform this on Tokens.

In [19]:
for token in doc:
    print(token.lemma_)

tea
be
healthy
and
calm
,
do
not
-PRON-
think
?


****Removing Stop Words****
<br/>Stop words removal is a process of removing words that occur very frequently in the document and don't add much information. We can check if a word is stop word or not. Words such as "the", "is", "but", "and".

In [22]:
for token in doc:
    print(token.is_stop)

False
True
False
True
False
False
True
True
True
False
False


We can summarize the three steps like this:

In [23]:
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
calming		calm		False
,		,		False
do		do		True
n't		not		True
you		-PRON-		True
think		think		False
?		?		False


#### Pattern Matching
Pattern matching is the process of matching tokens or phrases within a document. Compared to using regular expressions on raw text, spaCy’s rule-based matcher engines and components not only let you find the words and phrases you’re looking for – they also give you access to the tokens within the document and their relationships. This means you can easily access and analyze the surrounding tokens, merge spans into single tokens or add entries to the named entities in doc.ents.

To match individual tokens, you create a **Matcher**. When you want to match a list of terms, it's easier and more efficient to use **PhraseMatcher**. For example, if you want to find where different smartphone models show up in some text, you can create patterns for the model names of interest. First you create the <b>PhraseMatcher</b> itself.

In [26]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

The matcher is created using the vocabulary of your model. Here we're using the small English model you loaded earlier which we have named <b>nlp</b>. Setting attr='LOWER' will match the phrases on lowercased text. This provides case insensitive matching. The thing to keep in mind is the object <b>matcher</b> is actually <b>PhraseMatcher</b> for future reference.

Next you create a list of terms to match in the text. The phrase matcher needs the patterns as document objects. The easiest way to get these is with a list comprehension using the model that you are using (in our case <b>nlp</b> model).

Then we will add a match-rule to the phrase-matcher. A match-rule consists of: an ID
key, an on_match callback, and one or more patterns. In our case the pattern is the list of Smartphone models. 

In [27]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

Then you create a document from the text in which we want to search and use the phrase matcher to find where the terms occur in the text. In our case we will search the Smartphone models in the text below.

In [28]:

text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") 
matches = matcher(text_doc)
print(matches)

[(3766102292120407359, 17, 19), (3766102292120407359, 22, 24), (3766102292120407359, 30, 32), (3766102292120407359, 33, 35)]


The matches that we get are tuple of the match id and the positions of the start and end of the phrase. For example in the current case the first match id is "3766102292120407359", start position is "17" and end position is "19".

And then we can search for the exact string using the match id and start and end position is the text block like below.

In [29]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])

TerminologyList iPhone 11
