<a href="https://colab.research.google.com/github/mralamdari/NLP-Some-Libraries-Practice/blob/main/NLP_Some_Libraries_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#[Spacy](https://spacy.io/api)

In [1]:
import spacy

###Loading a Model

In [2]:
nlp = spacy.load('en_core_web_sm')

###Apply the model to a document (doc object)

In [3]:
doc = nlp('The First My spacy code, to learn this awsome library ')

In [4]:
doc

The First My spacy code, to learn this awsome library 

In [5]:
for token in doc:
  print(token)
  #print(token.text)

The
First
My
spacy
code
,
to
learn
this
awsome
library


In [6]:
# pos ===> Part of Speach
for token in doc:
  print(token.text, token.pos, token.pos_)

The 90 DET
First 96 PROPN
My 90 DET
spacy 92 NOUN
code 92 NOUN
, 97 PUNCT
to 94 PART
learn 100 VERB
this 90 DET
awsome 84 ADJ
library 92 NOUN


In [7]:
# dep ===> Syntactic dependency
for token in doc:
  print(token.text, token.dep_)

The det
First amod
My poss
spacy compound
code ROOT
, punct
to aux
learn relcl
this det
awsome compound
library dobj


In [8]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f25d7222650>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f25d6b30fa0>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f25d6b30de0>)]

In [9]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [10]:
doc2 = nlp(u"This is a ?? sentence    to practice. tokenization, It's Awsome")

In [11]:
for token in doc2:
  print(token.text, token.pos_, token.dep_)

This DET nsubj
is AUX ROOT
a DET attr
? PUNCT punct
? PUNCT punct
sentence NOUN npadvmod
    SPACE 
to PART aux
practice VERB ROOT
. PUNCT punct
tokenization NOUN npadvmod
, PUNCT punct
It PRON nsubj
's AUX ROOT
Awsome PROPN attr


In [12]:
for token in doc2:
  print(token.text, token.tag_)

This DT
is VBZ
a DT
? .
? .
sentence NN
    _SP
to TO
practice VB
. .
tokenization NN
, ,
It PRP
's VBZ
Awsome NNP


In [13]:
for token in doc2:
  print(token.text, token.lemma_)

This this
is be
a a
? ?
? ?
sentence sentence
       
to to
practice practice
. .
tokenization tokenization
, ,
It -PRON-
's be
Awsome Awsome


In [14]:
for token in doc2:
  print(token.text, token.shape_)

This Xxxx
is xx
a x
? ?
? ?
sentence xxxx
       
to xx
practice xxxx
. .
tokenization xxxx
, ,
It Xx
's 'x
Awsome Xxxxx


In [15]:
for token in doc2:
  print(type(token))

<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>


###Tokenization

#####**Prefix**: chars at the beginning
#####**Suffix**: chars at the end
#####**Infix**: chars in between
#####Exception


In [16]:
my_str="'We\'re moving to L.A.!, are you coming?"

In [17]:
doc = nlp(my_str)

In [18]:
for token in doc:
  print(token.text)

'
We
're
moving
to
L.A.
!
,
are
you
coming
?


In [19]:
doc2 = nlp(u'Apple helped me create this http://popo@gmail.com gmail!, In L.A. with my freind in Japan so, I can only give you $153.61')

In [20]:
for token in doc2:
  print(token.text)

Apple
helped
me
create
this
http://popo@gmail.com
gmail
!
,
In
L.A.
with
my
freind
in
Japan
so
,
I
can
only
give
you
$
153.61


In [21]:
for entity in doc2.ents:
  print(entity, entity.label_)
  print(str(spacy.explain(entity.label_)))
  print('\n')

Apple ORG
Companies, agencies, institutions, etc.


L.A. GPE
Countries, cities, states


Japan GPE
Countries, cities, states


153.61 MONEY
Monetary values, including unit




In [22]:
doc3 = nlp(u'a beautiful butterfly can go beyonds eyes so esi, be carefull. yesterday IBM bought an apple for $5.5 millions')

In [23]:
for chunk in doc3.noun_chunks:
  print(chunk)

a beautiful butterfly
beyonds eyes
IBM
an apple
$5.5 millions


In [24]:
spacy.displacy.render(doc3, style='dep', jupyter=True, options={'distance':110})
# spacy.displacy.serve(doc3, style='dep', jupyter=False, options={'distance':110})

In [25]:
spacy.displacy.render(doc3, style='ent', jupyter=True)
# spacy.displacy.serve(doc3, style='ent', jupyter=False)

###Stemming

######a crude method for categoling related words

In [26]:
import nltk
from nltk.stem.porter import PorterStemmer

In [27]:
p_stemmer = PorterStemmer()

In [28]:
words = ['runner', 'run', 'ran', 'runs', 'easy', 'easier', 'easiest']

In [29]:
for word in words:
  print(f'{word} ----> {p_stemmer.stem(word)}')

runner ----> runner
run ----> run
ran ----> ran
runs ----> run
easy ----> easi
easier ----> easier
easiest ----> easiest


In [30]:
from nltk.stem.snowball import SnowballStemmer

In [31]:
s_stemmer = SnowballStemmer(language='english')

In [32]:
for word in words:
  print(f'{word} ----> {s_stemmer.stem(word)}')

runner ----> runner
run ----> run
ran ----> ran
runs ----> run
easy ----> easi
easier ----> easier
easiest ----> easiest


###Lemmatization
######It is a more informative way of reducing down words to thier root

In [33]:
doc1 = nlp(u'I am a runner and loved runners, so i think running is awsome, so run when you can')

In [34]:
for token in doc1:
  print(token.text, token.pos_, token.lemma, token.lemma_)

I PRON 561228191312463089 -PRON-
am AUX 10382539506755952630 be
a DET 11901859001352538922 a
runner NOUN 12640964157389618806 runner
and CCONJ 2283656566040971221 and
loved VERB 3702023516439754181 love
runners NOUN 12640964157389618806 runner
, PUNCT 2593208677638477497 ,
so CCONJ 9781598966686434415 so
i PRON 5097672513440128799 i
think VERB 16875814820671380748 think
running VERB 12767647472892411841 run
is AUX 10382539506755952630 be
awsome ADJ 3521391281120521496 awsome
, PUNCT 2593208677638477497 ,
so ADV 9781598966686434415 so
run VERB 12767647472892411841 run
when ADV 15807309897752499399 when
you PRON 561228191312463089 -PRON-
can VERB 6635067063807956629 can


In [35]:
for token in doc1:
  print(token.text, token.lemma_)

I -PRON-
am be
a a
runner runner
and and
loved love
runners runner
, ,
so so
i i
think think
running run
is be
awsome awsome
, ,
so so
run run
when when
you -PRON-
can can


In [36]:
def show_lemmas(doc):
  for token in doc:
    print(f'{token.text:{12}} {token.pos_:{10}} {token.lemma:<{22}} {token.lemma_}')

In [37]:
show_lemmas(doc1)

I            PRON       561228191312463089     -PRON-
am           AUX        10382539506755952630   be
a            DET        11901859001352538922   a
runner       NOUN       12640964157389618806   runner
and          CCONJ      2283656566040971221    and
loved        VERB       3702023516439754181    love
runners      NOUN       12640964157389618806   runner
,            PUNCT      2593208677638477497    ,
so           CCONJ      9781598966686434415    so
i            PRON       5097672513440128799    i
think        VERB       16875814820671380748   think
running      VERB       12767647472892411841   run
is           AUX        10382539506755952630   be
awsome       ADJ        3521391281120521496    awsome
,            PUNCT      2593208677638477497    ,
so           ADV        9781598966686434415    so
run          VERB       12767647472892411841   run
when         ADV        15807309897752499399   when
you          PRON       561228191312463089     -PRON-
can          VERB       

###Stop words

In [38]:
print(nlp.Defaults.stop_words)

{'may', 'get', 'n’t', 'third', 'please', 'six', 'her', 'they', 'have', 'everything', 'thereupon', 'again', 'because', 'been', 'using', 'whoever', 'indeed', 'hence', 'since', 'me', 'in', 'all', 'take', 'was', 'doing', 'moreover', 'whence', 'mostly', 'first', 'from', '‘ll', 'whereupon', 'several', 'our', 'the', 'where', "'ve", 'whenever', 'amongst', 'down', 'upon', 'see', 'less', 'keep', 'of', 'so', 'though', 'must', 'due', 'around', 'anyway', 'during', 'former', 'anyhow', 'together', 'if', 'ca', 'whither', 'except', 'other', 'behind', 'he', 'too', 'regarding', 'already', 'it', 'myself', 'below', 'your', 'fifteen', 'beforehand', 'their', 'some', 'three', 'nowhere', 'everyone', 're', 'various', 'really', 'under', 'why', 'off', 'you', 'then', 'however', 'next', 'him', 'not', 'while', 'how', 'am', 'became', 'forty', 'thereby', 'nor', 'whole', 'against', 'make', 'for', 'himself', 'never', 'made', 'whether', 'fifty', 'per', '’ve', 'even', 'sometimes', 'i', 'sometime', 'but', 'towards', 'neith

In [39]:
len(nlp.Defaults.stop_words)

326

In [40]:
nlp.vocab['is'].is_stop

True

######Add a Stop Word

In [41]:
s_word = 'btw'
nlp.Defaults.stop_words.add(s_word)
nlp.vocab[s_word].is_stop = True

In [42]:
nlp.vocab['btw'].is_stop

True

######Remove a stop word

In [43]:
s_word = 'beyond'
nlp.Defaults.stop_words.remove(s_word)
nlp.vocab[s_word].is_stop = False

In [44]:
nlp.vocab['beyond'].is_stop

False

######Matching and Vocabulary

In [45]:
matcher = spacy.matcher.Matcher(nlp.vocab)

In [46]:
#solarpower
pattern1 = [{'LOWER': 'solarpower'}]
# solar-power
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]
# solar power
pattern3 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]

In [47]:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)

In [48]:
doc = nlp(u'The Solar Power industry uses solarpower, so Solar-Power is amazing')

In [49]:
matches = matcher(doc)

In [50]:
matches

[(8656102463236116519, 1, 3),
 (8656102463236116519, 5, 6),
 (8656102463236116519, 8, 11)]

In [51]:
for id, start, end in matches:
  print(id, nlp.vocab.strings[id], doc[start: end].text)

8656102463236116519 SolarPower Solar Power
8656102463236116519 SolarPower solarpower
8656102463236116519 SolarPower Solar-Power


In [52]:
matcher.remove('SolarPower')

In [53]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP': '*'}, {'LOWER': 'power'}]

In [54]:
matcher.add('SolarPower', None, pattern1, pattern2)

In [55]:
doc2 = nlp(u'Solar----Power is Solar POWER')

In [56]:
matcher(doc2)

[(8656102463236116519, 2, 4)]

In [57]:
doc2 = nlp(u'Solar---Power is Solar POWER')

In [58]:
matcher(doc2)

[(8656102463236116519, 0, 3), (8656102463236116519, 4, 6)]

In [59]:
p_matcher = spacy.matcher.PhraseMatcher(nlp.vocab)

In [60]:
phrase_list = ['voodoo economics', 'supply-side economics', 'trikle-down economics']

In [61]:
phrase_patterns = [nlp(phrase) for phrase in phrase_list]

In [62]:
type(phrase_patterns[0])

spacy.tokens.doc.Doc

In [63]:
p_matcher.add('EcobMatcher', None, *phrase_patterns)

In [64]:
temp = nlp('hey voodoo economics, got to supply-side economics, say trikle-down economics')
p_matches = p_matcher(temp)

In [65]:
p_matches

[(8838797697251217482, 1, 3),
 (8838797697251217482, 6, 10),
 (8838797697251217482, 12, 16)]

In [66]:
for id, start, end in p_matches:
  print(id, nlp.vocab.strings[id], temp[start: end].text)

8838797697251217482 EcobMatcher voodoo economics
8838797697251217482 EcobMatcher supply-side economics
8838797697251217482 EcobMatcher trikle-down economics


In [72]:
doc12 = nlp(u'a runner is a person that runs, especially in a specified way')

In [76]:
for token in doc12:
  print(token.text)

a
runner
is
a
person
that
runs
,
especially
in
a
specified
way


In [75]:
for token in doc12:
  print(token.text, token.pos_)

a DET
runner NOUN
is AUX
a DET
person NOUN
that DET
runs VERB
, PUNCT
especially ADV
in ADP
a DET
specified VERB
way NOUN


In [77]:
for token in doc12:
  print(token.text, token.lemma, token.lemma_)

a 11901859001352538922 a
runner 12640964157389618806 runner
is 10382539506755952630 be
a 11901859001352538922 a
person 14800503047316267216 person
that 4380130941430378203 that
runs 12767647472892411841 run
, 2593208677638477497 ,
especially 13751905263548122051 especially
in 3002984154512732771 in
a 11901859001352538922 a
specified 5918416916626768037 specify
way 6878210874361030284 way


In [78]:
for token in doc12:
  print(token.text, token.lemma_)

a a
runner runner
is be
a a
person person
that that
runs run
, ,
especially especially
in in
a a
specified specify
way way


In [79]:
pattern12 = [{'runner': 'RUNNER'}]

In [82]:
print(nlp.Defaults.stop_words)

{'may', 'get', 'n’t', 'third', 'please', 'six', 'her', 'they', 'have', 'everything', 'thereupon', 'again', 'because', 'been', 'using', 'whoever', 'indeed', 'hence', 'since', 'me', 'in', 'all', 'take', 'was', 'doing', 'moreover', 'whence', 'mostly', 'first', 'from', '‘ll', 'whereupon', 'several', 'our', 'the', 'where', "'ve", 'whenever', 'amongst', 'down', 'upon', 'see', 'less', 'keep', 'of', 'so', 'though', 'must', 'due', 'around', 'anyway', 'during', 'former', 'anyhow', 'together', 'if', 'ca', 'whither', 'except', 'other', 'behind', 'he', 'too', 'regarding', 'already', 'it', 'myself', 'below', 'your', 'fifteen', 'beforehand', 'their', 'some', 'three', 'nowhere', 'everyone', 're', 'various', 'really', 'under', 'why', 'off', 'you', 'then', 'however', 'next', 'him', 'not', 'while', 'how', 'am', 'became', 'forty', 'thereby', 'nor', 'whole', 'against', 'make', 'for', 'himself', 'never', 'made', 'whether', 'fifty', 'per', '’ve', 'even', 'sometimes', 'i', 'sometime', 'but', 'towards', 'neith

In [86]:
list(nlp.Defaults.stop_words)[:20]

['may',
 'get',
 'n’t',
 'third',
 'please',
 'six',
 'her',
 'they',
 'have',
 'everything',
 'thereupon',
 'again',
 'because',
 'been',
 'using',
 'whoever',
 'indeed',
 'hence',
 'since',
 'me']

In [91]:
for word in list(nlp.Defaults.stop_words):
  if word.startswith('f'):
    print(word)

first
from
former
fifteen
forty
for
fifty
front
few
full
five
four
formerly
further


In [92]:
for word in list(nlp.Defaults.stop_words):
  if word.startswith('z'):
    print(word)

In [96]:
stop_words_count = {}
for word in list(nlp.Defaults.stop_words):
  starting_letter = word[0]
  if not stop_words_count.get(starting_letter):
    stop_words_count[starting_letter] = word

In [97]:
stop_words_count

{"'": "'ve",
 'a': 'again',
 'b': 'because',
 'c': 'ca',
 'd': 'doing',
 'e': 'everything',
 'f': 'first',
 'g': 'get',
 'h': 'her',
 'i': 'indeed',
 'j': 'just',
 'k': 'keep',
 'l': 'less',
 'm': 'may',
 'n': 'n’t',
 'o': 'our',
 'p': 'please',
 'q': 'quite',
 'r': 'regarding',
 's': 'six',
 't': 'third',
 'u': 'using',
 'v': 'various',
 'w': 'whoever',
 'y': 'your',
 '‘': '‘ll',
 '’': '’ve'}