# Spacy

In [1]:
!pip install spacy

Collecting spacy
  Downloading spacy-3.2.3-cp39-cp39-win_amd64.whl (11.3 MB)
Collecting pathy>=0.3.5
  Downloading pathy-0.6.1-py3-none-any.whl (42 kB)
Collecting thinc<8.1.0,>=8.0.12
  Downloading thinc-8.0.14-cp39-cp39-win_amd64.whl (1.0 MB)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.1-py3-none-any.whl (7.0 kB)
Collecting spacy-legacy<3.1.0,>=3.0.8
  Downloading spacy_legacy-3.0.9-py2.py3-none-any.whl (20 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.6-cp39-cp39-win_amd64.whl (112 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.2-cp39-cp39-win_amd64.whl (451 kB)
Collecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.6-cp39-cp39-win_amd64.whl (6.6 MB)
Collecting wasabi<1.1.0,>=0.8.1
  Downloading wasabi-0.9.0-py3-none-any.whl (25 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Using cached catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting pydantic

In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [20]:
text = """ The Republican president is being challenged by Democratic 
Party nominee Joe Biden, who is best known as Barack Obama’s vice-president but has been in 
US politics since the 1970s.As election day approaches, pollingcompanies will be trying to gauge 
the mood of the nation by asking voters which candidate they prefer."""

In [21]:
doc = nlp(text)

In [22]:
type(doc)

spacy.tokens.doc.Doc

In [23]:
print(doc)

 The Republican president is being challenged by Democratic 
Party nominee Joe Biden, who is best known as Barack Obama’s vice-president but has been in 
US politics since the 1970s.As election day approaches, pollingcompanies will be trying to gauge 
the mood of the nation by asking voters which candidate they prefer.


In [8]:
# sentence tokenization
list(doc.sents)

[ The Republican president is being challenged by Democratic 
 Party nominee Joe Biden, who is best known as Barack Obama’s vice-president but has been in 
 US politics since the 1970s.,
 As election day approaches, pollingcompanies will be trying to gauge 
 the mood of the nation by asking voters which candidate they prefer.]

In [9]:
# word tokenization

for token in doc:
    print(token)

 
The
Republican
president
is
being
challenged
by
Democratic


Party
nominee
Joe
Biden
,
who
is
best
known
as
Barack
Obama
’s
vice
-
president
but
has
been
in


US
politics
since
the
1970s
.
As
election
day
approaches
,
pollingcompanies
will
be
trying
to
gauge


the
mood
of
the
nation
by
asking
voters
which
candidate
they
prefer
.


In [10]:
# Stopwords

stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [11]:
print(stopwords)

{'eight', 'herein', 'all', 'via', 'just', 'will', 'to', 'still', 'only', '’d', 'herself', '‘d', 'whereas', 'nine', 'yourselves', 'one', 'yours', 'really', 'nobody', 'now', 'too', "n't", 'may', 'elsewhere', 'might', 'upon', 'sometime', 'each', 'amongst', 'almost', 'thru', 'per', 'does', 'twenty', '‘m', 'how', 'amount', 'side', 'thereupon', 'also', 'otherwise', 'once', 'them', "'m", 'hence', 'within', 'whereby', '‘s', 'would', 'last', 'is', 'your', 'than', 'she', 'fifty', 'three', 'others', 'their', 'whose', 'serious', 'back', 'hundred', 'everyone', 'not', "'s", 'but', 'he', 'thereafter', 'there', 'should', 'often', 'do', 'yet', 'moreover', 'anyway', 'latterly', 'former', 'same', '‘re', 'several', 'whole', 'latter', 'becomes', 'four', '’ve', 'i', 'why', 'for', 'hereby', 'rather', 'they', 'be', 'less', 'whereupon', 'across', 'among', 'could', 'though', 'neither', 'keep', 'this', 'many', 'n’t', 'itself', 'any', 'which', 'because', 'whenever', 'eleven', 'under', 'however', 'see', 'below', '

In [12]:
# remove the stopwrods from the below text

text = """ The Republican president is being challenged by Democratic Party 
nominee Joe Biden, who is best known as Barack Obama’s vice-president but has been in US 
politics since the 1970s.As election day approaches, pollingcompanies will be trying to gauge 
the mood of the nation by asking voters which candidate they prefer.""" 

In [37]:
result = (" ").join([str(token) for token in doc if token not in stopwords])
print(result)      

  The Republican president is being challenged by Democratic 
 Party nominee Joe Biden , who is best known as Barack Obama ’s vice - president but has been in 
 US politics since the 1970s . As election day approaches , pollingcompanies will be trying to gauge 
 the mood of the nation by asking voters which candidate they prefer .


In [38]:
# using Spacy we can use .is_stop
result = (" ").join([str(token) for token in doc if not token.is_stop])
print(result)   

  Republican president challenged Democratic 
 Party nominee Joe Biden , best known Barack Obama vice - president 
 politics 1970s . election day approaches , pollingcompanies trying gauge 
 mood nation asking voters candidate prefer .


In [39]:
# using Spacy we can use .is_punct
result = (" ").join([str(token) for token in doc if not token.is_punct])
print(result) 

  The Republican president is being challenged by Democratic 
 Party nominee Joe Biden who is best known as Barack Obama ’s vice president but has been in 
 US politics since the 1970s As election day approaches pollingcompanies will be trying to gauge 
 the mood of the nation by asking voters which candidate they prefer


In [40]:
# write a function to remove stopwords and punctatuon from a text and retrun a text

def cleaning(data):
    return (" ").join([str(token) for token in doc if (not token.is_punct and not token.is_stop )])

In [41]:
print(cleaning(doc))  

  Republican president challenged Democratic 
 Party nominee Joe Biden best known Barack Obama vice president 
 politics 1970s election day approaches pollingcompanies trying gauge 
 mood nation asking voters candidate prefer


In [42]:
# lemmatization

for token in doc:  
    print(token,'----->',token.lemma_)

  ----->  
The -----> the
Republican -----> republican
president -----> president
is -----> be
being -----> be
challenged -----> challenge
by -----> by
Democratic -----> democratic

 -----> 

Party -----> Party
nominee -----> nominee
Joe -----> Joe
Biden -----> Biden
, -----> ,
who -----> who
is -----> be
best -----> well
known -----> know
as -----> as
Barack -----> Barack
Obama -----> Obama
’s -----> ’s
vice -----> vice
- -----> -
president -----> president
but -----> but
has -----> have
been -----> be
in -----> in

 -----> 

US -----> US
politics -----> politic
since -----> since
the -----> the
1970s -----> 1970
. -----> .
As -----> as
election -----> election
day -----> day
approaches -----> approach
, -----> ,
pollingcompanies -----> pollingcompanie
will -----> will
be -----> be
trying -----> try
to -----> to
gauge -----> gauge

 -----> 

the -----> the
mood -----> mood
of -----> of
the -----> the
nation -----> nation
by -----> by
asking -----> ask
voters -----> voter
which -----> 

In [43]:
# pos

for token in doc:  
    print(token,'----->',token.tag_ ,'---->' , token.pos_ , '-->', spacy.explain(token.tag_))

  -----> _SP ----> SPACE --> whitespace
The -----> DT ----> DET --> determiner
Republican -----> JJ ----> ADJ --> adjective (English), other noun-modifier (Chinese)
president -----> NN ----> NOUN --> noun, singular or mass
is -----> VBZ ----> AUX --> verb, 3rd person singular present
being -----> VBG ----> AUX --> verb, gerund or present participle
challenged -----> VBN ----> VERB --> verb, past participle
by -----> IN ----> ADP --> conjunction, subordinating or preposition
Democratic -----> JJ ----> ADJ --> adjective (English), other noun-modifier (Chinese)

 -----> _SP ----> SPACE --> whitespace
Party -----> NNP ----> PROPN --> noun, proper singular
nominee -----> NN ----> NOUN --> noun, singular or mass
Joe -----> NNP ----> PROPN --> noun, proper singular
Biden -----> NNP ----> PROPN --> noun, proper singular
, -----> , ----> PUNCT --> punctuation mark, comma
who -----> WP ----> PRON --> wh-pronoun, personal
is -----> VBZ ----> AUX --> verb, 3rd person singular present
best -----> R