Tokenization, POS Tagging, Parsing and NER

In [5]:
pip install spacy

Collecting spacy
  Downloading spacy-3.8.7-cp312-cp312-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp312-cp312-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.10-cp312-cp312-win_amd64.whl.metadata (2.5 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Downloading srsly-2.5.1-cp312-cp312-win_a

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.3.2 which is incompatible.
scipy 1.13.1 requires numpy<2.3,>=1.22.4, but you have numpy 2.3.2 which is incompatible.


In [6]:
import spacy 
import spacy.cli

One-time use only.

In [7]:
spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [8]:
nlp = spacy.load("en_core_web_sm")

In [17]:
doc =  nlp("Apple is looking at buying U.K. startup for $1 billion.")

In [18]:
#Tokenization
print("Tokens generated: from the sample: ")
for token in doc:
    print(f"{token.text} - {token.pos_} - {token.dep_}")

#Named Entity Recognition
print("\nExtracting Named Entities NERs from sample:")
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")

#Dependency Parsing
print("\nDependency Parsing from sample:")
for token in doc:
    print(f"{token.text} --> {token.head.text} --> {token.dep_}")

#Lemmatization (root word extraction)
print("\nLemmatization from sample:")
for token in doc:
    print(f"{token.text} --> {token.lemma_}")



Tokens generated: from the sample: 
Apple - PROPN - nsubj
is - AUX - aux
looking - VERB - ROOT
at - ADP - prep
buying - VERB - pcomp
U.K. - PROPN - nsubj
startup - VERB - ccomp
for - ADP - prep
$ - SYM - quantmod
1 - NUM - compound
billion - NUM - pobj
. - PUNCT - punct

Extracting Named Entities NERs from sample:
Apple -> ORG
U.K. -> GPE
$1 billion -> MONEY

Dependency Parsing from sample:
Apple --> looking --> nsubj
is --> looking --> aux
looking --> looking --> ROOT
at --> looking --> prep
buying --> at --> pcomp
U.K. --> startup --> nsubj
startup --> buying --> ccomp
for --> startup --> prep
$ --> billion --> quantmod
1 --> billion --> compound
billion --> for --> pobj
. --> looking --> punct

Lemmatization from sample:
Apple --> Apple
is --> be
looking --> look
at --> at
buying --> buy
U.K. --> U.K.
startup --> startup
for --> for
$ --> $
1 --> 1
billion --> billion
. --> .


en_core_web_md Version

In [40]:
spacy.cli.download("en_core_web_md")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [41]:
nlp_md = spacy.load("en_core_web_md")
print = ("Model 'en_core_web_md' loaded successfully!")

In [44]:
doc1 = nlp_md("Apple is launching its new iphone model in New York this week.")
doc2 = nlp_md("Iphone 8 is the product of Apple.")

del print

print("\nSimilarity between two documents:")
print(doc1.similarity(doc2))




Similarity between two documents:
0.8661232590675354


Custom Pipeline Component using Spacy

In [54]:
import spacy 
from spacy.tokens import Doc

nlp = spacy.load("en_core_web_md")

if not Doc.has_extension("has_exclamation"):
    Doc.set_extension("has_exclamation", default=False)

@spacy.Language.component("exclamation_flager")
def exlamation_flager_function(doc):
    doc._.has_exclamation = "!" in doc.text 
    return doc

nlp.add_pipe("exclamation_flager", last=True)

doc = nlp("Wow! This is Amazing!!!")
print("\nHas Exclamation: ", doc._.has_exclamation)

@spacy.Language.component("custom_logic")
def custom_logic(doc):
    verbs = [token.text for token in doc if token.pos_ == "VERB"]
    print("\nVerbs in the sentence: ", verbs)
    return doc

nlp.add_pipe("custom_logic", last = True)

text = "Tesla is building a giga factory in Berlin!"
doc = nlp(text)

print([token.text for token in doc])
print([(ent.text, ent.label_) for ent in doc.ents])



Has Exclamation:  True

Verbs in the sentence:  ['building']
['Tesla', 'is', 'building', 'a', 'giga', 'factory', 'in', 'Berlin', '!']
[('Tesla', 'ORG'), ('Berlin', 'GPE')]
