# Text preprocessing using Spacy

In "Spacy", there are a model for each language; it has to be downloaded. To process a language, we have to load its model. The API works with pipes of tasks. By default, all pipes are loaded.

In [1]:
# If the model is already on your system, do not activate
# This is a model used just for testing, for more accurate models refer to
# https://spacy.io/models
!python -m spacy download en_core_web_sm

2023-10-15 17:42:45.124724: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-15 17:42:45.379354: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-15 17:42:46.640129: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-10-15 17:42:46.640368: W tensorflow/core

In [2]:
import spacy

# Load the trained model
# For more languages: https://spacy.io/models
nlp = spacy.load("en_core_web_sm")

# Show different tasks
nlp.pipe_names

2023-10-15 17:43:34.080301: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-15 17:43:34.106882: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-15 17:43:35.059025: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-10-15 17:43:35.059260: W tensorflow/core

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [5]:
# we can disable some tasks
nlp.select_pipes(disable=['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])
nlp.pipe_names

[]

## I. Sentence tokenization

Given a text, get its sentences.

In [10]:
text = 'This is a text written by Mr. Aries. It uses U.S. english to illustrate sentence tokenization.'

# we can add tasks after disabling them
if 'sentencizer' not in nlp.pipe_names:
    nlp.add_pipe('sentencizer')

# processing a text using all enabled tasks
doc = nlp(text)

# -------- Method 1 ---------
# sents_list = []
# for sent in doc.sents:
#     sents_list.append(sent.text)

# -------- Method 2 ---------
# sents_list = [sent.text for sent in doc.sents]

# -------- Method 3 ---------
sents_list = list(doc.sents)

sents_list

[This is a text written by Mr. Aries.,
 It uses U.S. english to illustrate sentence tokenization.]

## II. Words tokenization

It is automatically executed when calling **nlp(text)**. 
This is because words are the main component for other tasks.

In [13]:
# -------- Method 1 ---------
# tokens = []
# for word in doc:
#     tokens.append(word.text)

# -------- Method 2 ---------
# tokens = [word.text for word in doc]

# -------- Method 3 ---------
tokens = list(doc)

tokens

[This,
 is,
 a,
 text,
 written,
 by,
 Mr.,
 Aries,
 .,
 It,
 uses,
 U.S.,
 english,
 to,
 illustrate,
 sentence,
 tokenization,
 .]

## III. StopWords filtering

For each word, there is a boolean attribute **is_stop** which indicates if a word is a stop-word.

In [19]:
filtered_tokens = []
for word in doc:
    if word.is_stop==False:
        filtered_tokens.append(word.text)
        

filtered_tokens

['text',
 'written',
 'Mr.',
 'Aries',
 '.',
 'uses',
 'U.S.',
 'english',
 'illustrate',
 'sentence',
 'tokenization',
 '.']

## IV. Lemmatization

In [20]:
text = 'This is a text written by Mr. Aries. It uses U.S. english to illustrate sentence tokenization.'


nlp.enable_pipe('tagger')
nlp.enable_pipe('tok2vec')# apparently it uses this as well
nlp.enable_pipe('attribute_ruler')
nlp.enable_pipe('lemmatizer') # lemmatizer must use tagger + attribute ruler OR morphologizer

print(nlp.pipe_names)

doc = nlp(text)

lemmas_list = []
for word in doc:
    lemmas_list.append((word.text, word.lemma_))

lemmas_list

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'sentencizer']


[('This', 'this'),
 ('is', 'be'),
 ('a', 'a'),
 ('text', 'text'),
 ('written', 'write'),
 ('by', 'by'),
 ('Mr.', 'Mr.'),
 ('Aries', 'Aries'),
 ('.', '.'),
 ('It', 'it'),
 ('uses', 'use'),
 ('U.S.', 'U.S.'),
 ('english', 'english'),
 ('to', 'to'),
 ('illustrate', 'illustrate'),
 ('sentence', 'sentence'),
 ('tokenization', 'tokenization'),
 ('.', '.')]