<a href="https://colab.research.google.com/github/haddjyb2k/-APPLIED-DATA-SECIENCE-CAPSTONE/blob/main/Text_Analysis_with_Spacy_to_Master_NLP_techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Text Analysis with Spacy to Master NLP techniques

In [1]:
!pip install -U pip setuptools wheel
!pip install -U spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [2]:
import spacy

In [3]:
import en_core_web_sm

In [4]:
nlp = spacy.load('en_core_web_sm')

In [5]:
doc = nlp(u'Microsoft is trying to buy France based startup at $7 Million')
for token in doc:
  print(token.text)

Microsoft
is
trying
to
buy
France
based
startup
at
$
7
Million


In [6]:
for token in doc:
  print(token, token.pos_)

Microsoft PROPN
is AUX
trying VERB
to PART
buy VERB
France PROPN
based VERB
startup NOUN
at ADP
$ SYM
7 NUM
Million NUM


#### Tokenization

### Chunking

In [7]:
!pip install nlp

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [8]:
doc4 = nlp(u'tesla is a automobile based endorsed with high tech work for implimenting the electric cars')
for chunks in doc4.noun_chunks:
  print(chunks)

tesla
a automobile
high tech work
the electric cars


### How to visualize Tokenized data

Visualize synthetic dependency between tokens in documents.

In [9]:
from spacy import displacy
doc = nlp(u'Tesla to build solar electric startup in gujrat for $70 million')
displacy.render(doc, style='dep', jupyter=True, options = {'distance':100})

### Phrase Matching

In [11]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
#solarpower
pattern1 = [{'LOWER':'solarpower'}]
#solar power
pattern2 = [{'LOWER':'solar'},{'LOWER':'power'}]
#solar-power
pattern3 = [{'LOWER':'solar'},{"IS_PUNCT": True},{'LOWER':'power'}]
matcher.add('SolarPower',[pattern1,pattern2,pattern3])
#matcher.add('SolarPower',None,pattern1,pattern2,pattern3)
doc = nlp(u'The Solar Power industry continues to grow as demand for solarpower increases. solar-power operated products are popularity')
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


### Import spacy vocabulary Matcher object and create three different patterns to match the document. when printed, the output will get id of pattern, start and end position of matched phrase. Now,lets  see how  by printing each pattern with its id which it has matched.

In [12]:
for match_id, start, end in found_matches:
  string_id = nlp.vocab.strings[match_id]
  span = doc[start:end]
  print(match_id, string_id, start, end, span.text) 

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 solar-power


## Part of Speech(POS) Tagging

In English Grammar, part of speech tells us what is the function of a word and how it is used in a sentence. some of the common parts of speech are Noun, Pronoun, verb, adjective, adverb, etc.

POS tagging is a method of automatically assigning POS tags to all the words of a document. POS tagging is of 2 types. one is a course in which normal words come like nouns, verbs, and adjectives. Second is a fine-grained text which includes words that provide some special information like plural noun, past or present tense, superlative adjective, etc.

Now let us practice POS tagging practically, we will define the document and assign its part of speech and tags as well we will print a description of each tag it assigns.

In [13]:
doc = nlp(u'The quick brown fox, snatch the piece of cube from mouth of black crow')
for token in doc:
  print(f"{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_)}")

The        DET        DT         determiner
quick      ADJ        JJ         adjective (English), other noun-modifier (Chinese)
brown      ADJ        JJ         adjective (English), other noun-modifier (Chinese)
fox        NOUN       NN         noun, singular or mass
,          PUNCT      ,          punctuation mark, comma
snatch     VERB       VB         verb, base form
the        DET        DT         determiner
piece      NOUN       NN         noun, singular or mass
of         ADP        IN         conjunction, subordinating or preposition
cube       NOUN       NN         noun, singular or mass
from       ADP        IN         conjunction, subordinating or preposition
mouth      NOUN       NN         noun, singular or mass
of         ADP        IN         conjunction, subordinating or preposition
black      ADJ        JJ         adjective (English), other noun-modifier (Chinese)
crow       NOUN       NN         noun, singular or mass


See how smartly Spacy has to assign correct POS tags to each token and its description can be read as well. It is also possible  to count how many words of each POS tag occur in our document and it will display each POS code and its count. And from vocabulary, we can check the exact POS

In [14]:
pos_counts = doc.count_by(spacy.attrs.POS)
print(pos_counts)
print(doc.vocab[92].text) #check which POS

{90: 2, 84: 3, 92: 5, 97: 1, 100: 1, 85: 3}
NOUN


### Visualizing Part of Speech 

In [15]:
options = {'distance':110,'compact':'True','color':'#F20835','bg':'#ADD8E6','font':'arial'}
displacy.render(doc, style='dep', jupyter=True, options=options)


### Named Entity Recognition(NER)

Entities are the words or groups which represent some special information about common things such as country, state, organization, person, etc. Spacy is a well-known library to perform entity recognition. It can identify entities and explain them saying what it means. so let’s try this out.

1.   List item
2.   List item



In [16]:
doc3 = nlp(u"Ambani good to go at Gujrat to start a agro based industry in jio Mart for $70 million")
for entity in doc3.ents:
  print(entity)
  print(entity.label_)
  print(str(spacy.explain(entity.label_)))
  print("n")

Ambani
ORG
Companies, agencies, institutions, etc.
n
Gujrat
ORG
Companies, agencies, institutions, etc.
n
jio Mart
ORG
Companies, agencies, institutions, etc.
n
$70 million
MONEY
Monetary values, including unit
n


### Visualizing NER

In [23]:
doc1 = nlp(u'over the last quarter Amazon has raised its profit from 20 thousand dilivery for a profit of $7 Million.'
u'By contract JBL only sold out 10 thousand Walkman Product Bluetooth speakers.')

In [24]:
#options = {'distance':110,'compact':'True','color':'#F20835','bg':'#ADD8E6','font':'arial'}
displacy.render(doc1, style='ent', jupyter=True, options=options)