# Natural Language Processing (NLP) Using Tokenization, Stopword Removal, and Lemmatization

## Tokenization

Tokenization is the process of separating text into individual words or phrases.

- It is the first step in language cleaning.
- Tokenization allows the user to work with words as separate objects.

> From ChatGPT:
> In `spaCy`, tokenization is automatically handled when we create a `Doc` object from the text.

In [16]:
import spacy

In [17]:
# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Tokenization is the first step in text preprocessing."

# Process the text using spaCy
doc = nlp(text)

# Tokenize the text
tokens = [token.text for token in doc]
print("Tokens:", tokens)

Tokens: ['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']


## Stopword Removal

Stopwords are commonly used words that aren't really needed in sentences when vectorizing, such as "A", "an", "the", "of", "are" etc.

- In spaCy, there is a 'is_stop' attribute that returns a boolean value based off of a library in the spaCy module. 
- If the boolean is true, it is a stopword, if it is false, it is not a stopword.
- This is very useful for removing stopwords in Python using something like a for loop.

> From ChatGPT:

In [18]:
# Removing stopwords from the tokenized text
tokens_without_stopwords = [token.text for token in doc if not token.is_stop]
print("Tokens without stopwords:", tokens_without_stopwords)

Tokens without stopwords: ['Tokenization', 'step', 'text', 'preprocessing', '.']


## Lemmatization

Lemmatization is the process of returning a word to its root form. For example, "running", "ran", and "runs" would all be converted into "run".

- The process of lemmatization allows for normalization of text that helps in analysis and vectorization.
- In spaCy, lemmatization is done by using the lemma_ attribute.

> From ChatGPT:

In [19]:
# Lemmatizing the tokens
lemmas = [token.lemma_ for token in doc]
print("Lemmas:", lemmas)

Lemmas: ['tokenization', 'be', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']


In another example by ChatGPT, it is shown using all these three steps together into a single function that preprocesses text:

In [20]:
def clean_text(text):
    doc = nlp(text)
    # Tokenization, Stopword Removal, and Lemmatization
    cleaned_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return cleaned_tokens

# Example
text = "Lemmatization and stopword removal are essential in cleaning text."
cleaned_text = clean_text(text)
print("Cleaned Text:", cleaned_text)

Cleaned Text: ['lemmatization', 'stopword', 'removal', 'essential', 'clean', 'text']


You may be asking why some of these words like 'removal' aren't changed to something like 'remove'.
This is because in lemmatization, part of speech is considered. For example, 'removal' in the context of being a noun will most likely be unchanged, if it is a verb, it will be changed to 'remove'.