<a href="https://colab.research.google.com/github/jenelaineDC/Natural-Language-Processing/blob/main/Tokenization_Stemming_Lemmatization_using_NLTK_and_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# NLP Tokenization & Preprocessing

This notebook provides a comprehensive introduction to text tokenization and preprocessing in NLP using NLTK and spaCy. It is designed for learners who want to understand both the theory and practical implementation of preparing text for analysis or machine learning.
    


## Tokenization
Tokenization is the process of splitting text into smaller parts called **tokens** — usually words or sentences.

**Example:**
```
"I love NLP!" → ["I", "love", "NLP", "!"]
```

**Why It's Important:**
- Computers can't directly understand text; they need structured input.  
- Tokenization helps us break down text for analysis (e.g., counting words, building vocabularies).  
- It’s the *first step* in almost every NLP pipeline.

**Two submodules:**
- Sentence Tokenizer: A sentence tokenizer splits a piece of text into sentences. Each resulting token is a complete sentence, not individual words.
- Work Tokenizer: A word tokenizer splits a sentence (or text) into individual words or tokens. Punctuation often becomes a separate token.

| Feature | Sentence Tokenizer                                              | Word Tokenizer                                                                |
| ------- | --------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| Output  | Sentences                                                       | Words / tokens                                                                |
| Level   | Coarse                                                          | Fine-grained                                                                  |
| Purpose | Document-level analysis, sentence classification, summarization | Word-level analysis, feature extraction, embeddings                           |
| Example | `"Hello world! I am NLP."` → `["Hello world!", "I am NLP."]`    | `"Hello world! I am NLP."` → `['Hello', 'world', '!', 'I', 'am', 'NLP', '.']` |

    


### Using NLTK
- `sent_tokenize()` → splits text into sentences using punctuation and capitalization cues.
- `word_tokenize()` → splits sentences into words, handling punctuation and contractions.
- `nltk.download('punkt')` → downloads models that help NLTK identify sentence boundaries.


NLTK’s tokenization functions (like `word_tokenize()` and `sent_tokenize()`) rely on a pre-trained **Punkt model** to correctly identify:
- where sentences start and end, and
- how to split text into individual words.

So before you can use those tokenizers, you need to make sure the Punkt model is available — hence the download.
    

In [30]:
!pip install nltk



In [31]:
# Import NLTK
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download required tokenizer data
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [32]:
# Sample text
text = "Hello world! I'm learning Natural Language Processing using NLTK."

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:")
print(sentences)

# Word Tokenization
words = word_tokenize(text)
print("\nWord Tokenization:")
print(words)

Sentence Tokenization:
['Hello world!', "I'm learning Natural Language Processing using NLTK."]

Word Tokenization:
['Hello', 'world', '!', 'I', "'m", 'learning', 'Natural', 'Language', 'Processing', 'using', 'NLTK', '.']


### Using Spacy

- `spacy.load("en_core_web_sm")` loads the English language model.  
- `nlp(text)` runs the text through spaCy’s pipeline, returning a `Doc` object.  
- `doc.sents` gives sentences.  
- `[token.text for token in doc]` extracts each token (word, punctuation, etc.).

In [33]:
!pip install spacy



In [34]:
import spacy

In [35]:
# Load small English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Hello world! I'm learning Natural Language Processing using spaCy."

# Process the text
doc = nlp(text)

# Sentence Tokenization
print("Sentence Tokenization:")
for sent in doc.sents:
    print(sent.text)

# Word Tokenization
words = [token.text for token in doc]
print("\nWord Tokenization:")
print(words)


Sentence Tokenization:
Hello world!
I'm learning Natural Language Processing using spaCy.

Word Tokenization:
['Hello', 'world', '!', 'I', "'m", 'learning', 'Natural', 'Language', 'Processing', 'using', 'spaCy', '.']



## NLTK vs spaCy

| Feature | **NLTK** | **spaCy** |
|----------|-----------|-----------|
| **Focus** | Academic / teaching | Production / industrial |
| **Speed** | Slower | Faster (Cython backend) |
| **Ease of Use** | Simple | More powerful |
| **Sentence Parsing** | Basic | Context-aware |
| **Integration** | Works with NLTK modules | Works with full NLP pipeline |

✅ Use **NLTK** for learning and research.  
✅ Use **spaCy** for larger, production-ready NLP applications.
    

In [36]:
# 🧪 Practice Exercise

text = "Dr. Smith's email is dr.smith@example.com! Let's meet at 10:30 a.m., okay?"

# NLTK Tokenization
nltk_tokens = word_tokenize(text)

# spaCy Tokenization
doc = nlp(text)
spacy_tokens = [token.text for token in doc]

print("NLTK Tokens:\n", nltk_tokens)
print("\nspaCy Tokens:\n", spacy_tokens)


NLTK Tokens:
 ['Dr.', 'Smith', "'s", 'email', 'is', 'dr.smith', '@', 'example.com', '!', 'Let', "'s", 'meet', 'at', '10:30', 'a.m.', ',', 'okay', '?']

spaCy Tokens:
 ['Dr.', 'Smith', "'s", 'email', 'is', 'dr.smith@example.com', '!', 'Let', "'s", 'meet', 'at', '10:30', 'a.m.', ',', 'okay', '?']


## Text Preprocessing After Tokenization

After tokenization, our text is split into words — but these words are still messy and inconsistent.
For example:
```
"Cats", "cat", "CAT", "cats."
```

All these mean the same thing, but they look different to a computer.
Text preprocessing helps by cleaning and normalizing the text, so the model can focus on meaning instead of surface differences.

Common steps include:

| Step                         | Purpose                                                                   |
| ---------------------------- | ------------------------------------------------------------------------- |
| **Lowercasing**              | Makes the text consistent (so “Cat” and “cat” are treated the same)       |
| **Removing punctuation**     | Removes irrelevant characters that don’t carry meaning                    |
| **Removing stopwords**       | Filters out common words (“the”, “is”, “and”) that add little information |
| **Stemming / Lemmatization** | Reduces words to their base form (so “running”, “runs” → “run”)           |


Quick Guidelines:

| Step                   | Needed?     | Comment                                                        |
| ---------------------- | ----------- | -------------------------------------------------------------- |
| Lowercasing            | Usually yes | Ensures consistency, especially for Bag of Words/TF-IDF        |
| Punctuation removal    | Depends     | Safe for classification/search; avoid for syntax/NER/sentiment |
| Stopword removal       | Often       | Reduces noise for frequency-based features                     |
| Stemming/Lemmatization | Optional    | Helps normalize words for simpler models                       |




## Lowercasing and Punctuation Removal

- Lowercasing is important because NLP models often treat `"Cat"` and `"cat"` as different tokens.
- Many models or algorithms (like Bag of Words, TF-IDF, or simple frequency counts) are case-sensitive. So `"Cat" ≠ "cat"` unless you normalize the text.
- Exceptions:
  - Some advanced models (like BERT, spaCy embeddings, transformers) use subword tokenization or case-aware models, which can handle mixed-case text. In those cases, you might not need to lowercase manually.

- **✅ Rule of thumb**:
  - For traditional NLP pipelines, always lowercase. For modern transformers, check if the model is case-sensitive (bert-base-cased vs bert-base-uncased).

### Punctuation Removal

- For tasks like word frequency, Bag of Words, or TF-IDF, punctuation doesn’t carry meaning and may just create noisy tokens (`"word."` ≠ `"word"`).
- Exceptions:
  - Some NLP tasks rely on punctuation, e.g., sentiment analysis (`"I love it!"` vs `"I love it."`) or NLP parsing tasks.
  - Libraries like spaCy already separate punctuation into its own token, so `"Hello!"` becomes `["Hello", "!"]`. This makes it easier to either keep or remove punctuation depending on the task.
- **✅ Rule of thumb**:
  - If your task is text classification, search, or topic modeling, it’s safe to remove punctuation.
  - If your task is syntax analysis, named entity recognition, or sentiment analysis, consider keeping punctuation.

In [37]:
import string

text = "Cats are running faster than the dogs were. They aren't stopping anytime soon!"
words = word_tokenize(text)

# Lowercasing
words_lower = [w.lower() for w in words]

# Remove punctuation
words_no_punct = [w for w in words_lower if w not in string.punctuation]

print("After Lowercasing & Removing Punctuation:")
print(words_no_punct)

After Lowercasing & Removing Punctuation:
['cats', 'are', 'running', 'faster', 'than', 'the', 'dogs', 'were', 'they', 'are', "n't", 'stopping', 'anytime', 'soon']


## Stop Word Removal

Stopwords are very common words in a language that often carry little meaning on their own:
Examples:
```
"the", "is", "and", "in", "to"
```

Why remove stopwords?
- Reduces noise in Bag of Words or TF-IDF models.
- Reduces dimensionality in text features.
- Improves performance for tasks like text classification or topic modeling.

When NOT to remove stopwords:
- When context matters: sentiment analysis or language modeling, since "not" changes meaning.
- When using transformer models, as they usually learn importance weights internally.

✅ Rule of thumb:
- Remove stopwords for classical NLP pipelines;
- keep them for deep learning or context-sensitive tasks.

#### Using NLTK

In [38]:
# Download stopwords
from nltk.corpus import stopwords
nltk.download('stopwords')

# Display all stopwords
stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [39]:
text = "This is a simple example to demonstrate stopword removal using NLTK."

# Tokenize
words = word_tokenize(text)

# Lowercase (optional but recommended)
words_lower = [w.lower() for w in words]

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in words_lower if w not in stop_words]

print("Original words:")
print(words_lower)
print("\nAfter stopword removal (NLTK):")
print(filtered_words)

Original words:
['this', 'is', 'a', 'simple', 'example', 'to', 'demonstrate', 'stopword', 'removal', 'using', 'nltk', '.']

After stopword removal (NLTK):
['simple', 'example', 'demonstrate', 'stopword', 'removal', 'using', 'nltk', '.']


#### Using spaCy

Yes! 🌿 spaCy has built-in stopword removal functionality. You don’t need to download anything extra like in NLTK — spaCy already comes with a comprehensive stopword list for each language model. It has also a built-in lowercase

In [40]:
# Load English model
nlp = spacy.load("en_core_web_sm")

In [41]:
# Print the number of default stop words
len(nlp.Defaults.stop_words)

326

In [42]:
# We can check a word that is stopword or not by using vocab method
print(nlp.vocab['myself'].is_stop)      # check "myself" if stopword or not

print(nlp.vocab['mystery'].is_stop)     # check "mystery" if stopword or not

True
False


In [43]:
# Make a new stopword

# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('mystery')
# Set the stop_word tag on the lexeme
nlp.vocab['mystery'].is_stop = True

print("New Stopword length: ", len(nlp.Defaults.stop_words))
print("Mystery a stopword? ", nlp.vocab['mystery'].is_stop)

New Stopword length:  327
Mystery a stopword?  True


In [44]:
# Remove a stopword

# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('mystery')
# Remove the stop_word tag from the lexeme
nlp.vocab['mystery'].is_stop = False

print("New Stopword length: ", len(nlp.Defaults.stop_words))
print("Mystery a stopword? ", nlp.vocab['mystery'].is_stop)

New Stopword length:  326
Mystery a stopword?  False


In [45]:
text = "This is a simple example to demonstrate stopword removal using spaCy."

# Process text
doc = nlp(text)

# Remove stopwords and punctuation
filtered_words = [token.text for token in doc if not token.is_stop and not token.is_punct]

print("After stopword removal (spaCy):")
print(filtered_words)

After stopword removal (spaCy):
['simple', 'example', 'demonstrate', 'stopword', 'removal', 'spaCy']


In [46]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

## Stemming and Lemmatization

Both stemming and lemmatization reduce words to their root forms — but they do it differently.

| Feature        | **Stemming**                                    | **Lemmatization**                         |
| -------------- | ----------------------------------------------- | ----------------------------------------- |
| **Definition** | Cuts off word endings using simple rules        | Converts word to its dictionary base form |
| **Example**    | “running” → “run”, “studies” → “studi”          | “running” → “run”, “studies” → “study”    |
| **Accuracy**   | Faster but less accurate (can create non-words) | Slower but linguistically correct         |
| **Library**    | `nltk.PorterStemmer`                            | `spacy` or `WordNetLemmatizer`            |


#### Using NLTK

- NLTK can do both stemming and lemmatization, but lemmatization requires the WordNet resource.

In [47]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["running", "runs", "runner", "studies", "studying"]
stemmed_words = [stemmer.stem(word) for word in words]

print("Original words:", words)
print("Stemmed words (NLTK):", stemmed_words)

Original words: ['running', 'runs', 'runner', 'studies', 'studying']
Stemmed words (NLTK): ['run', 'run', 'runner', 'studi', 'studi']


In [48]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemmatizer = WordNetLemmatizer()

words = ["running", "runs", "runner", "studies", "studying"]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]  # 'v' = verb

print("Original words:", words)
print("Lemmatized words (NLTK):", lemmatized_words)

Original words: ['running', 'runs', 'runner', 'studies', 'studying']
Lemmatized words (NLTK): ['run', 'run', 'runner', 'study', 'study']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


#### Using spaCy

- spaCy does not provide stemming, only lemmatization.

In [49]:
nlp = spacy.load("en_core_web_sm")

text = "running runs runner studies studying"
doc = nlp(text)

lemmatized_words = [token.lemma_ for token in doc]
print("Original text:", text)
print("Lemmatized words (spaCy):", lemmatized_words)

Original text: running runs runner studies studying
Lemmatized words (spaCy): ['run', 'run', 'runner', 'study', 'study']
