# **1. Tokenization**

**In SpaCy, tokenization is part of the language model pipeline, and tokens can be accessed using [token.text for token in doc].**

In [None]:
import spacy

# Load the SpaCy English model
nlp = spacy.load('en_core_web_sm')

# Process the text
text = "This is an example sentence. Tokenize this sentence using SpaCy."
doc = nlp(text)

# Extract tokens
spacy_tokens = [token.text for token in doc]
print(spacy_tokens)


['This', 'is', 'an', 'example', 'sentence', '.', 'Tokenize', 'this', 'sentence', 'using', 'SpaCy', '.']


**We load SpaCy’s model, pass the text to `nlp()` to tokenize it, extract the tokens using `[token.text for token in doc]`, and print them.**

In [None]:
# Load the SpaCy English model
nlp = spacy.load('en_core_web_sm')

# Input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# Tokenize the text
doc = nlp(text)
tokens = [token.text for token in doc]

print("Tokens:", tokens)


Tokens: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'language', '.']


# **2. Lowercasing**

Converting all text to lowercase to make it case-insensitive. To lowercase the tokens in a list using SpaCy, you can use the built-in lower() method for each token in the Doc object:


In [None]:
# Input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# Tokenize the text
doc = nlp(text)

# Lowercase the tokens
lowercased_tokens = [token.text.lower() for token in doc]

print("Lowercased tokens:", lowercased_tokens)


Lowercased tokens: ['natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'language', '.']


# **3. Remove Punctuation**
Removing punctuation marks simplifies the text and makes it easier to process. To remove punctuation from a list of tokens using SpaCy, you can check if each token is a punctuation character using the is_punct attribute. Here is an example of how to do this:

In [None]:
# Input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# Tokenize the text
doc = nlp(text)

# Remove punctuation
filtered_tokens = [token.text for token in doc if not token.is_punct]

print("Tokens without punctuation:", filtered_tokens)


Tokens without punctuation: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'natural', 'language']


# **4. Remove Stop Words**

Removing common words that do not add significant meaning to the text, such as “a,” “an,” and “the,” is an important step in text processing. To remove common stop words from a list of tokens using SpaCy, you can use the is_stop attribute of each token. Here is an example of how to do this:

**NOTE: In SpaCy, you don't need to download stop words separately like you do in NLTK. The stop words are included with the language model when you load it. Therefore, you can simply remove the nltk.download('stopwords') line altogether**

In [None]:
# Input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# Tokenize the text
doc = nlp(text)

# Remove stop words
filtered_tokens = [token.text for token in doc if not token.is_stop]

print("Tokens without stop words:", filtered_tokens)


Tokens without stop words: ['Natural', 'language', 'processing', 'field', 'artificial', 'intelligence', 'deals', 'interaction', 'computers', 'human', '(', 'natural', ')', 'language', '.']


# **5. Remove extra whitespace**

In SpaCy, you can handle whitespace cleaning directly with Python string methods since SpaCy doesn’t have a built-in function for this specific task. Here's the SpaCy version for removing extra whitespace:

In [None]:
# Input text with extra white space
text = "  Natural   language processing   is   a field   of artificial intelligence   that deals with the interaction between computers and human   (natural)   language.   "

# Remove leading and trailing white space
text = text.strip()

# Replace multiple consecutive white space characters with a single space
text = " ".join(text.split())

print("Cleaned text:", text)


Cleaned text: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language.


# **6. Remove URLs**


To remove URLs in SpaCy, you would still use regular expressions (as SpaCy doesn't have built-in URL detection), but the tokenization can be handled by SpaCy if you want to work with tokens afterward.

In [None]:
import re

# Input text with URLs
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. Check out this article for more information: https://en.wikipedia.org/wiki/Natural_language_processing"

# Define a regular expression pattern to match URLs
pattern = r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"

# Replace URLs with an empty string
cleaned_text = re.sub(pattern, "", text).strip()

# Print the cleaned text without URLs
print("Text without URLs:", cleaned_text)


Text without URLs: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. Check out this article for more information:


# **7. Remove HTML code**

In SpaCy, you'd still use regular expressions to remove HTML tags as SpaCy doesn't have built-in functionality for stripping HTML. Here's how you can convert your NLTK example into SpaCy:




In [None]:
import re

# Input text with HTML code
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. <b>This is an example of bold text.</b>"

# Define a regular expression pattern to match HTML tags
pattern = r"<[^>]+>"

# Replace HTML tags with an empty string
cleaned_text = re.sub(pattern, "", text).strip()

# Optionally process the cleaned text using SpaCy (if further processing is needed)
doc = nlp(cleaned_text)

# Print the cleaned text without HTML
print("Text without HTML code:", cleaned_text)


Text without HTML code: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. This is an example of bold text.


# **8.  Lemmatization**

NOTE: In SpaCy, stemming is not directly available because it uses lemmatization instead, which is more sophisticated and accurate. Lemmatisation returns the base or dictionary form of a word, unlike stemming, which often cuts off prefixes or suffixes. Here's how you can achieve a similar effect using lemmatization in SpaCy:


In [None]:
# Input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# Tokenize and process the text
doc = nlp(text)

# Lemmatize each token
lemmatized_tokens = [token.lemma_ for token in doc]

print("Lemmatized tokens:", lemmatized_tokens)


Lemmatized tokens: ['natural', 'language', 'processing', 'be', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deal', 'with', 'the', 'interaction', 'between', 'computer', 'and', 'human', '(', 'natural', ')', 'language', '.']


# **9.  Part-of-speech tagging**

In SpaCy, there’s no need for separate downloads for POS tagging. The language model (e.g., `en_core_web_sm`) includes built-in features like tokenization, POS tagging, and lemmatization.

After loading the model (`nlp = spacy.load('en_core_web_sm')`), you can immediately perform POS tagging. The text is processed using `doc = nlp(text)` to create a `Doc` object, and you can access each token's POS tag with `token.pos_`.

In [None]:
# Input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# Tokenize and process the text using SpaCy
doc = nlp(text)

# Tag each token with its POS tag
tagged_tokens = [(token.text, token.pos_) for token in doc]

print("Tagged tokens:", tagged_tokens)


Tagged tokens: [('Natural', 'ADJ'), ('language', 'NOUN'), ('processing', 'NOUN'), ('is', 'AUX'), ('a', 'DET'), ('field', 'NOUN'), ('of', 'ADP'), ('artificial', 'ADJ'), ('intelligence', 'NOUN'), ('that', 'PRON'), ('deals', 'VERB'), ('with', 'ADP'), ('the', 'DET'), ('interaction', 'NOUN'), ('between', 'ADP'), ('computers', 'NOUN'), ('and', 'CCONJ'), ('human', 'ADJ'), ('(', 'PUNCT'), ('natural', 'ADJ'), (')', 'PUNCT'), ('language', 'NOUN'), ('.', 'PUNCT')]


# **10. Named Entity Recognition**

In SpaCy, no separate downloads are needed for Named Entity Recognition (NER) since the language model (e.g., `en_core_web_sm`) includes NER functionality.

Load the model with `nlp = spacy.load('en_core_web_sm')`, then process text using `doc = nlp(text)`. Access named entities with `doc.ents`, which gives a list of entities as tuples (entity text, label) like "PERSON" or "ORG".

In [None]:
# Input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. John Smith works at Google in New York."

# Process the text using SpaCy
doc = nlp(text)

# Extract named entities
named_entities = [(ent.text, ent.label_) for ent in doc.ents]

print("Named entities:", named_entities)


Named entities: [('John Smith', 'PERSON'), ('Google', 'ORG'), ('New York', 'GPE')]
