# Natural Language Processing | Preprocessing

spaCy upgrade and package installation.

** IMPORTANT **
If you're running this in the cloud rather than using a local Jupyter server on your machine, then the notebook will timeout after a period of inactivity. If that happens and you don't reconnect in time, you will need to upgrade spaCy again and reinstall the requisite statistical packages.

Refer to this link on how to run Colab notebooks locally on your machine to avoid this issue:
**https://research.google.com/colaboratory/local-runtimes.html**

In [None]:
!pip install -U spacy==3.*




In [2]:
!python -m spacy info


[1m

spaCy version    3.7.5                         
Location         /usr/local/lib/python3.10/dist-packages/spacy
Platform         Linux-6.1.85+-x86_64-with-glibc2.35
Python version   3.10.12                       
Pipelines        en_core_web_sm (3.7.1)        



In [8]:
import spacy

In [9]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m50.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


After importing spaCy, the next thing we need to do is load a suitable statistical model for our project. spaCy offers a variety of models for different languages. These models help with tokenization, part-of-speech tagging, named entity recognition, and more.

Here, we're loading the en_core_web_sm model which is the smallest English model spaCy offers and is a good starting point for NLP tasks.
https://spacy.io/models/en#en_core_web_sm

In [11]:
nlp = spacy.load('en_core_web_sm')

In [12]:
type(nlp)

# **Tokenization**

Tokenization with **spaCy**

We pass whatever text we want to process to nlp, which returns a Doc container object containing the tokenized text and a number of annotations for each token.  You can learn more about the Doc object here:
https://spacy.io/api/doc

In [13]:
# Sample sentence.
s = "He didn't want to pay $20 for this book."
doc = nlp(s)

In [14]:
print(doc)

He didn't want to pay $20 for this book.


In [15]:
# We can view an individual token by indexing into the Doc object.
print(doc[0])

He


In [16]:
# A Doc object is a container of other objects, namely Token and Span objects.
print(type(doc[0]))

<class 'spacy.tokens.token.Token'>


In [17]:
# Slicing a Doc object returns a Span object.
print(doc[0:3])
print(type(doc[0:3]))

He didn't
<class 'spacy.tokens.span.Span'>


In [18]:
# Access a token's index in a sentence.
print([(t.text, t.i) for t in doc])

[('He', 0), ('did', 1), ("n't", 2), ('want', 3), ('to', 4), ('pay', 5), ('$', 6), ('20', 7), ('for', 8), ('this', 9), ('book', 10), ('.', 11)]


# We can iterate over this Doc object and view the tokens.

In [19]:
print([t.text for t in doc])

['He', 'did', "n't", 'want', 'to', 'pay', '$', '20', 'for', 'this', 'book', '.']


Note how



*  "didn't" is separated into "did" and "n't".
*   the currency symbol and amount are separated.
* the period at the end of the sentence is its own token.







In [None]:
or

In [20]:
texts = []
for t in doc:
    texts.append(t.text)
print(texts)

['He', 'did', "n't", 'want', 'to', 'pay', '$', '20', 'for', 'this', 'book', '.']


The Doc object can be indexed and sliced like a regular list. The Doc object contains Token and Span objects, which offer different views into the text.

In [22]:
# We can view an individual token by indexing into the Doc object.
print(doc[0])

He


In [23]:
# A Doc object is a container of other objects, namely Token and Span objects.
print(type(doc[0]))

<class 'spacy.tokens.token.Token'>


In [24]:
# Slicing a Doc object returns a Span object.
print(doc[0:3])
print(type(doc[0:3]))

He didn't
<class 'spacy.tokens.span.Span'>


In [25]:
# Access a token's index in a sentence.
print([(t.text, t.i) for t in doc])

[('He', 0), ('did', 1), ("n't", 2), ('want', 3), ('to', 4), ('pay', 5), ('$', 6), ('20', 7), ('for', 8), ('this', 9), ('book', 10), ('.', 11)]


Spacy's tokenization is non-destructive, which means the original input can be reconstructed from the tokens.


In [26]:
# You can view the original input like so:
print(doc.text)

He didn't want to pay $20 for this book.


We can also tokenize multiple sentences and access each sentence individually using the Doc object's sents property.

In [32]:
s = """Either the well was very deep, or she fell very slowly, for she
had plenty of time as she went down to look about her and to wonder what
was going to happen next. First, she tried to look down and make out what
she was coming to, but it was too dark to see anything; then she looked at
the sides of the well, and noticed that they were filled with cupboards and
book-shelves; here and there she saw maps and pictures hung upon pegs."""

doc = nlp(s)

# Look at individual sentences (there should be two 'Span' objects).

# Split into sentences and print them
for sent in doc.sents:
    print(sent.text)


Either the well was very deep, or she fell very slowly, for she 
had plenty of time as she went down to look about her and to wonder what 
was going to happen next.
First, she tried to look down and make out what 
she was coming to, but it was too dark to see anything; then she looked at 
the sides of the well, and noticed that they were filled with cupboards and 
book-shelves; here and there she saw maps and pictures hung upon pegs.


In [33]:
s = "Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down."
doc = nlp(s)

# Split into sentences
for sent in doc.sents:
    print(sent)

Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down.


In [34]:
s = "He didn’t want to pay $20 for this book."
doc = nlp(s)

# Extract currency and its value
for token in doc:
    if token.is_currency:
        print(f"{token.text}{doc[token.i + 1].text}")  # Output: $20


$20
