<H1>Import Spacy</h1>
<H4>N.L.P in spaCy is the Natural Language Process</h4>
<H4>Ultimately Human language computerized into to machines...</h4>
<H4>Linguistics is the Science of Language</h4>


<H1>Loading a spaCy Model ( Language ) i.e. 'en_core_web_lg' </H1>

In [138]:
import spacy
from spacy.matcher import Matcher, PhraseMatcher
from spacy import displacy

nlp = spacy.load('en_core_web_lg')


<H1>Tokenization<h2>& derived properties of words.</h2></H1> 
<h2>The computer understanding the word as we would</h2>

In [None]:
def word_tokenization(text):
	return nlp(text)

doc = word_tokenization('It costs £1 million pounds.')
print("Index: ", [token.i for token in doc])
print("Orth: ", [token.orth for token in doc]) 
print("Text: ", [token.text for token in doc]) 
print("is_alpha: ", [token.is_alpha for token in doc])
print("is_punct: ", [token.is_punct for token in doc])
print("like_num: ", [token.like_num for token in doc])

<H1>spaCy Labels Explained</H1>

"Charlie" said to me, 'Wow!, Really spaCy can explain its labels but How?'


In [None]:
def spacy_explain(label):
	print(spacy.explain(label))


spacy_explain('ADJ')
spacy_explain('nsubj')
spacy_explain('GPE')

<H1>Basic Linguistic Features</H1>

<h3>P.O.S --> Parts of speech</h3>
<h3>Dependency Parser --> The syntactic relation connecting child to head</h3>
<h3>Head of Text --> The original text of the token head.</h3>

In [None]:
doc = word_tokenization('I had a great time at the beach.') # 'At the beach, I had a great time', 'I had, at the beach, a great time'

for token in doc:
    print('P.O.S', token.text, '-->', token.pos_)
    print('Dependency', token.text, '-->', token.dep_)
    print('Head of Text', token.head.text, '-->', token.text, ',')


displacy.render(doc, style="dep")

In [None]:
spacy_explain('prep')
spacy_explain('dobj')
spacy_explain('pobj')

<H1>Predicting Named Entities</H1>
<H3>spaCy's pre-trained entities</H3>

In [None]:
doc = word_tokenization('"Apple is looking at buying U.K. startup for £1 billion." Charlie said the yesterday')

for ent in doc.ents:
    print(ent.text, '-->', ent.label_)


displacy.render(doc, style="ent")

In [None]:
spacy_explain('ORG')
spacy_explain('GPE')

<H1>LEMMA attribute</H1>
<H2>The goal of both stemming and lemmatization is to reduce inflectional forms<H2>
<H2>Lemmatization --> Finds the root word [ am, are, is ] --> be</H2>
<H2>Stemming --> Removes inflectional form of words in this case the prefix of --> [ 'winning' ] --> [ 'winn' ] (Not a real word)</H2>

In [None]:
def lemmatization():
    doc = word_tokenization('am are is')
    print("Text -->", [token.text for token in doc])
    print("Lemma -->", [token.lemma_ for token in doc])

lemmatization()

<H1>spaCy Matcher Class</H1>
<H2>Matcher --> Uses spaCy's Linguistic features and Lexical properties to look for phrases, words etc</H2>

In [119]:
matcher = Matcher(nlp.vocab)

<H1>Creating Patterns for the Matcher<H1> 

In [120]:
def sportsPatterns(sport):
	return [
		{'IS_DIGIT': True},
		{'LOWER': f'{sport}', 'OP': '?'},
		{'LOWER': 'world'},
		{'LOWER': 'cup'},
		{'IS_PUNCT': True},
	]


def emotionPatterns(emotion):
	return [
		{'LEMMA': f'{emotion}', 'POS': 'VERB'},
	]


def gadgetPatterns(gadget, extensionName):
	return [
		{'LOWER': f'{gadget}'}, 
		{'LOWER': f'{extensionName}', 'OP': '?'}
	]

<H1>Setting a few variable patterns based on a category</H1>

In [121]:
fifa = sportsPatterns(sport='fifa')
rugby = sportsPatterns(sport='rugby')

love = emotionPatterns(emotion='love')
hate = emotionPatterns(emotion='hate')

phone = gadgetPatterns(gadget='iphone', extensionName='x')
computer = gadgetPatterns(gadget='mac', extensionName=' ')

<H1>Adding Matchers and Set up away to show the matcher</H1>

In [122]:
def add_matchers(matcher):
	matcher.add("World_Cups", [fifa, rugby]);
	matcher.add("Emotion", [love, hate]);
	matcher.add("Gadgets", [phone, computer]);


add_matchers(matcher)

def showMatcher(doc):
	matches = matcher(doc)
	for match_id, start, end in matches:
		string_id = nlp.vocab.strings[match_id]  # Get string representation of matcher
		span = doc[start:end]  # The matched span
		print(
			f"""
match_id: {match_id},
string_id: {string_id},
start: {start},
end: {end},
TEXT: {span.text}
		""")

<H1>Showing Matchers In Action</H1>

In [None]:
def showRepresentationOfMatchers():
	doc = word_tokenization('Upcoming Mac Pro, has leaked the release date')
	doc2 = word_tokenization('2018 FIFA world cup: France won!')
	doc3 = word_tokenization('I loved dogs now I love cats more')
	doc4 = word_tokenization('I hate tomatoes')
	showMatcher(doc)
	showMatcher(doc2)
	showMatcher(doc3)
	showMatcher(doc4)


showRepresentationOfMatchers()

<H1>Efficient Phrase Matcher</H1>

In [None]:
matcher = PhraseMatcher(nlp.vocab)
pattern = word_tokenization('Golden Retriever')
pattern2 = word_tokenization('golden retriever')
pattern3 = word_tokenization('lion')
pattern4 = word_tokenization('Tiger')
matcher.add('DOG', [pattern, pattern2])
matcher.add('CAT', [pattern3, pattern4])

doc = nlp("Me and my Golden Retriever, saw a lion at the Zoo")


for match_id, start, end in matcher(doc):
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print('Matched phrase: ', span.text)
    print('Matcher: ', string_id)

<H1>Basics Similarity && Vectors</H1>
<H2>Similarity ---> Default calculates cosine similarity<H2>
<H2>Vectors/Word embeddings ---> Comparison technique used by spaCy to give a multi-dimensional representation of the word<H2>

In [125]:
# # 2 documents
# doc1 = word_tokenization('I like fast food')
# doc2 = word_tokenization('I love pizza')
# print(f"{round(doc1.similarity(doc2) * 100, 2)}%")

# # 2 tokens
# doc = word_tokenization('I like pizza and pasta')
# print(f"{round(doc[2].similarity(doc[4]) * 100, 2)}%")

# # document and token
# doc3 = word_tokenization('I love pizza')
# token = word_tokenization('soap')[0]
# print(f"{round(doc3.similarity(token) * 100, 2)}%")

# span and document
# span = word_tokenization('I like pizza and pasta')[2: 5]
# document = word_tokenization('MacDonald\'s sells burgers')

# print(f"{round(span.similarity(document) * 100, 2)}%")

# print(f"Length for doc1 vectors is {len(doc1.vector)}")
# print(f"Length for doc2 vectors is {len(doc2.vector)}")
# print(f"Vectors for doc1 {doc1.vector[0: 20]}")
# print(f"Vectors for doc2 {doc2.vector[0: 20]}")