######  Spacy is one of the leading libraries for Natural Language Processing in Python, It comes with different models
###### Like, in the code below we are loading the model spacy.load('en')

In [3]:
import spacy

In [15]:
nlp = spacy.load('en')

###### Processing the below text after loading the model

In [43]:
doc = nlp("Oh, ma boy ain't got time to deal with you.")

###### With this Doc object you can do multiple things with it

# Tokenizing

###### It is the process of returning all specific units of all the individual text that has be provided
###### It actually returns document object that contains tokens

iterating the tokens through the document.

In [22]:
for token in doc:
    print(token)

Oh
,
ma
boy
ai
n't
got
time
to
deal
with
you
.


you can see that it is also seperating ain't to (ai) and (n't) into different tokens and puntuations as well.
Each token comes with additional information of itself such is token.lemma , token.is_stop etc

# Text Processing

###  stop words

token.is_stop returns either True or False, that whether that token is a stop word or not.
Stop Words are actually frequently occuring words in a sentence or a paragraph or any text bigger then a paragraph 
stop words are ==> is, the, but, not, how, or what. Stop Words usually doesn't contains much information.
Removing Stop Words is considered as a best practice with Text Processing or sometimes it is considered as the process of hyperparameter optimization process

### Lemmatization / Lemmatizing / Lemma

The process of lemmatization is actually a process of converting a text into it's base form. Such as the Lemma of 'walking' is ==> 'walk', lemma 'eating' is ==> 'eat' etc. 
Lemmatizating of words is also considered as the common practice with prcocessing of a text or sometimes it is considered as the process of hyperparameter optimization process. 

In [59]:
doc = nlp("The job of feets is walking, but their hobby is dancing.")

In [60]:
print('Token\t\t\t\tLemma\t\t\t\tStopWords')
print('_'*100)
for token in doc:
    print("{t}\t\t\t\t{l}\t\t\t\t{s} ".format(t=token, l=token.lemma_, s=token.is_stop)) 
    # \t ==> adding 4 spaces are the text

Token				Lemma				StopWords
____________________________________________________________________________________________________
The				the				True 
job				job				False 
of				of				True 
feets				feet				False 
is				be				True 
walking				walk				False 
,				,				False 
but				but				True 
their				-PRON-				True 
hobby				hobby				False 
is				be				True 
dancing				dance				False 
.				.				False 


## Pattern Matching

pattern matching is also one of the most common process to be done in NLP (Natural Language processing). If you want to create pattern matching of tokens then you create a matcher. If you want to match list of terms, so it is easier and more efficient to use PhraseMatcher.
Exp, if you want to find that where these smartphones appears in some text then you create a patterns for the model names of your interest

In [61]:
from spacy.matcher import PhraseMatcher

In [87]:
matcher = PhraseMatcher(nlp.vocab ,attr='LOWER' )

Here, the matcher has been created using the vocabulary of your model that was loaded earlier. and attr = 'lower' has been provided to avoid case sensitive errors

now you create a list of patterns(mobile names) that you want to find and I am putting this in the terms variables. Then you have to convert them into document objects which is required by the Phrase matcher


In [88]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]

In [89]:
patterns

[Galaxy Note, iPhone 11, iPhone XS, Google Pixel]

In [90]:
matcher.add("UsmanTerminologyList",None,*patterns)

Then you create a document from the text to search and uses the PhraseMatcher to search that from the the terms occur in the given text

In [91]:
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.")

In [92]:
matches = matcher(text_doc)

The matches here are a tuple of the match id and the positions of the start and end of the phrase.

In [93]:
matches

[(8556725160488194001, 17, 19),
 (8556725160488194001, 22, 24),
 (8556725160488194001, 30, 32),
 (8556725160488194001, 33, 35)]

In [94]:
match_id,start,end = matches[1]

In [95]:
print(nlp.vocab.strings[match_id],text_doc[start:end])

UsmanTerminologyList Galaxy Note
