# **Working with SpaCy**



In [1]:
# installing SpaCy and it's language models
!python -m spacy download en_core_web_sm
!python -m spacy download xx_ent_wiki_sm
#!python -m spacy download en_core_web_md

2024-01-15 20:04:28.109095: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-15 20:04:28.109215: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-15 20:04:28.113443: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-15 20:04:28.136915: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https:

In [2]:
import spacy

## **SpaCy Pipelines**

SpaCy has a variety of already established pipelines
It is common convension to assign this pipeline object to the variable "nlp".

In [3]:
nlp = spacy.load("en_core_web_sm")

#call the variable to examine what this object looks like
nlp

<spacy.lang.en.English at 0x789283501f30>

Here, we have loaded the small (12 MB) language model for English. So the nlp object basically contains a language model responsible for the tasks that it was trained to perform. You can take a look at the [SpaCy documentation](https://spacy.io/usage/spacy-101#pipelines) for a visualization of what this standard pipeline looks like. The tasks that can be performed with the small language model for English will depend on the components that have been put in the pipeline.

In [4]:
#check pipeline components
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7892831a4e20>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7892831a4d60>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7892834e37d0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x789283433640>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x78928345a700>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7892834e33e0>)]

This should show you tuples of two different items. The first item of the tuple gives provides the name of a component that is part of the pipeline. Examples of components would be tagger, parser, lemmatizer, and ner. The second item is the actual component in SpaCy that is used to perform the task, whatever it may be.

We can also design our own pipelines in SpaCy. Suppose we don't need to parse the text or we do not want to do named entity recognition. We can disable these features in our pipeline, which can be helpful in speeding up the pipeling when we are feeding in a lot of text to the pipeline.

In [5]:
nlp = spacy.load('en_core_web_sm', disable = ['parser','ner'])

# What does this object look like?
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7892830e2ce0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7892830e2ec0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x789282311480>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7892820da340>)]

Let's say we want to design a pipeline from scratch. You can get started with a "blank" pipeline as shown below.

In [6]:
nlp = spacy.blank("en")

#what does this object look like?
nlp.pipeline

[]

As we can see, the blank pipeline has no components. However, it should be noted it will have a tokenizer by default as it is necessary to process any text passed to the nlp object.

In [30]:
nlp = spacy.blank("en")

if nlp.tokenizer is None:
    print("The blank pipeline does not have a tokenizer.")
else:
    print("The blank pipeline has a tokenizer.")

The blank pipeline has a tokenizer.


We can pass text to a spaCy NLP object by using the nlp method and passing in the text as a string as shown below.

In [31]:
text = "This is some text for preprocessing."
doc = nlp(text)

# you could also use the from_string method as well to do the same thing
# doc = nlp.from_string(text)

doc

This is some text for preprocessing.

However, at this point, our pipeline is empty so we cannot perform any preprocessing steps besides tokenization. Try adding a couple of components.

In [11]:
nlp = spacy.blank("en")
nlp.add_pipe("tagger")

<spacy.pipeline.tagger.Tagger at 0x78926daa4160>

In [35]:
text = "William Shakespeare was born in Stratford-upon-Avon, England, in April 1564. He is widely regarded as one of the greatest playwrights and poets in the English language. Shakespeare's works, including Romeo and Juliet, Hamlet, and Macbeth, are celebrated for their profound insight into the human condition."
nlp_en = spacy.blank("en")
nlp_xx = spacy.load("xx_ent_wiki_sm")
nlp_en.add_pipe("ner", name="ner_xx", source=nlp_xx)
doc = nlp_en(text)
print(doc.ents)

(William Shakespeare, Stratford-upon-Avon, England, English language, Shakespeare, Romeo and Juliet, Hamlet, Macbeth)



A list of texts can be fed to a pipeline using the `pipe()` method, which can also be useful when processing a large number of documents. There are options to handle the documents in batches as well as updating the number of threads used to process the text.

In [12]:
nlp = spacy.load("en_core_web_sm")
texts = ["This is some text.", "Another text here.", "And one more text."]

for doc in nlp.pipe(texts):
    print([token.text for token in doc])

['This', 'is', 'some', 'text', '.']
['Another', 'text', 'here', '.']
['And', 'one', 'more', 'text', '.']


## Tokenization

As you may see from the cell above, when you use the `nlp()` method on a text, the tokenizer component is automatically applied as the first step in the pipeline (even if using a blank pipeline), and the resulting Doc object contains a sequence of token objects:

In [2]:
nlp = spacy.load("en_core_web_sm")
text = "This is some example text to tokenize. It contains two sentences."

#pass the text to the nlp object to perform tokenization
doc = nlp(text)

#iterate over the tokens in the doc object
for token in doc:
    print(token.text)

This
is
some
example
text
to
tokenize
.
It
contains
two
sentences
.


In [3]:
for sent in doc.sents:
    print(sent)

This is some example text to tokenize.
It contains two sentences.


SpaCy has what is referred to as "non-destructive tokenization", meaning you can always have access to the original text (it won't carve up the text stream into little pieces).

In [42]:
for sent in doc.sents:
  print(">", sent.start, sent.end)

> 0 8
> 8 13


In [15]:
text = [
    '"Can you see the snow-capped mountains?" asked Martha. "I can\'t," replied Xavier.',
    "Get me those T.P.S. reports A.S.A.P., Mr. O'Donohue!",
    "After today, I'll never call you a ne'er-do-well again.",
    "What's the frequency, Kenneth?"
]

#tokenization using NLTK
import nltk
nltk.download('punkt')
tokens_nltk = nltk.word_tokenize(text[2])
print("NLTK Tokenization:", tokens_nltk)

#tokenization using spaCy
doc = nlp(text[2])
tokens_spacy = [token.text for token in doc]
print("spaCy Tokenization:", tokens_spacy)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


NLTK Tokenization: ['After', 'today', ',', 'I', "'ll", 'never', 'call', 'you', 'a', "ne'er-do-well", 'again', '.']
spaCy Tokenization: ['After', 'today', ',', 'I', "'ll", 'never', 'call', 'you', 'a', "ne'er", '-', 'do', '-', 'well', 'again', '.']


What is the difference in the number of stop words offered by NLTK and SpaCy? What are the words that are different between these two lists of words?

In [16]:
stop_words = nlp.Defaults.stop_words
stop_words

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [6]:
import nltk
from nltk.corpus import stopwords
import spacy

nltk.download('stopwords')

nltk_stopwords = set(stopwords.words('english'))
spacy_stopwords = set(nlp.Defaults.stop_words)

common_stopwords = nltk_stopwords.intersection(spacy_stopwords)

nltk_specific_stopwords = nltk_stopwords.difference(spacy_stopwords)
spacy_specific_stopwords = spacy_stopwords.difference(nltk_stopwords)

#print(len(nltk_stopwords))
#print(len(spacy_stopwords))
#print("Common stopwords:", common_stopwords)
print("NLTK specific stopwords:", nltk_specific_stopwords)
print("SpaCy specific stopwords:", spacy_specific_stopwords)


NLTK specific stopwords: {'won', "weren't", "doesn't", 't', 'theirs', 'o', "aren't", "couldn't", "don't", 'needn', 'wouldn', 'ain', 'mustn', "it's", 'having', 'hadn', "should've", 'ma', "she's", 'isn', 'shouldn', 've', 'y', 'hasn', "isn't", 'weren', 'aren', 'd', "mightn't", 'll', "won't", "you're", "you'd", "you'll", 'mightn', 'couldn', 'doesn', 'm', "hadn't", 'shan', "haven't", "mustn't", "that'll", "wasn't", "needn't", "hasn't", 'wasn', "didn't", 'don', "you've", 'didn', "wouldn't", "shan't", 'haven', "shouldn't", 's'}
SpaCy specific stopwords: {'show', 'throughout', '’ll', "'ve", 'whenever', 'ten', 'beforehand', 'much', 'name', 'third', 'however', '‘ve', 'otherwise', 'get', 'thus', 'call', 'along', 'side', 'make', 'among', '‘re', 'cannot', 'anyway', 'herein', 'becomes', 'every', 'eleven', 'really', 'amongst', 'might', 'within', 'whatever', 'besides', '‘d', 'another', 'becoming', 'many', 'therein', 'whence', 'whereupon', '’ve', 'already', 'fifty', 'made', 'else', 'top', 'almost', 'se

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Lemmatization

The default small language model does lemmatization by default so it is easy to extract lemmas (root or reduced form of a word) after creating the doc object.

In [8]:
nlp = spacy.load("en_core_web_sm")
#nlp = spacy.blank("en")
doc = nlp("playing played play player")

for token in doc:
    print(token.text, token.lemma_)

playing play
played play
play play
player player


## POS Tagging and Chunking

In [6]:
text = """The quick brown fox jumped over the lazy dog"""
text = nlp(text)
for w in text:
    print (w, w.pos_)

The DET
quick ADJ
brown ADJ
fox NOUN
jumped VERB
over ADP
the DET
lazy ADJ
dog NOUN


In [7]:
for noun in text.noun_chunks:
    print(noun.text)

The quick brown fox
the lazy dog


In [9]:
for chunk in text.noun_chunks:
    print(f"{chunk.text} | {chunk.root.text} | {chunk.root.dep_} | {chunk.root.head.text}")

The quick brown fox | fox | nsubj | jumped
the lazy dog | dog | pobj | over


## Name Entity Recognition

Named Entity Recognition (NER) aims to identify and classify named entities in text into predefined categories such as person names, organizations, locations, etc.

In [27]:
# Define the text to be processed
text = "Apple CEO announced today that the company will be investing $1 billion in a new research and development facility in Cupertino, California, the home of Apple's headquarters."
doc = nlp(text)
for entity in doc.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Apple - ORG - Companies, agencies, institutions, etc.
today - DATE - Absolute or relative dates or periods
$1 billion - MONEY - Monetary values, including unit
Cupertino - GPE - Countries, cities, states
California - GPE - Countries, cities, states
Apple - ORG - Companies, agencies, institutions, etc.


SpaCy also supports some visualization techniques that can be useful for looking through the output of ner, dependency parsing etc.

In [28]:
from spacy import displacy

# Pass the text to the nlp object to process it
doc = nlp(text)

# Use the displacy.render method to visualize the parse tree
displacy.render(doc, style='ent', jupyter=True)

**Custom Components**

In [4]:
import spacy
from spacy.tokens import Doc

def get_upper(doc):
    #convert all text in the document to uppercase
    return [token.text.upper() for token in doc]

#add the custom property to the Doc class
Doc.set_extension("upper", getter=get_upper, force=True)

nlp = spacy.load("en_core_web_sm")


doc = nlp("This is a custom SpaCy extension to make every token uppercase.")
uppercase_text = doc._.upper #Doc._.xxxxx
print(uppercase_text)
print(doc.text)

['THIS', 'IS', 'A', 'CUSTOM', 'SPACY', 'EXTENSION', 'TO', 'MAKE', 'EVERY', 'TOKEN', 'UPPERCASE', '.']
This is a custom SpaCy extension to make every token uppercase.
