This notebook covers the foundational preprocessing steps required for any NLP pipeline.  
We demonstrate how to clean and normalize raw text using both `nltk` and `spaCy`.

In [4]:
!pip install nltk spacy

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting spacy
  Downloading spacy-3.8.7-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting click (from nltk)
  Using cached click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2025.7.31-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (53 kB)
Collecting tqdm (from nltk)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp313-cp313-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Collecting cyme

In [6]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [24]:
pip uninstall nltk -y && pip install nltk

Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
Collecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
Successfully installed nltk-3.9.1
Note: you may need to restart the kernel to use updated packages.


In [9]:
import nltk
import spacy
from nltk.corpus import stopwords
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import PorterStemmer
from spacy.lang.en import English

In [8]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/funavry/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/funavry/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [12]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/funavry/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
nlp = spacy.load("en_core_web_sm")

In [6]:
sample_text = "I recently bought the new smartphone, and I'm impressed! The camera is amazing, battery lasts all day."


#### Key Steps:
- Tokenization using `nltk` and `spaCy`
- Stopword removal
- Stemming using `PorterStemmer` (NLTK)
- Lemmatization using `spaCy`

##### Tokenization
What: Breaking text into smaller units (tokens) like words or sentences.

Example:
Input: "ChatGPT is awesome!"
Tokens: ['ChatGPT', 'is', 'awesome', '!']

In [10]:
tokenizer = TreebankWordTokenizer()
tokens_nltk = tokenizer.tokenize(sample_text)
print("NLTK Tokens:", tokens_nltk)

NLTK Tokens: ['I', 'recently', 'bought', 'the', 'new', 'smartphone', ',', 'and', 'I', "'m", 'impressed', '!', 'The', 'camera', 'is', 'amazing', ',', 'battery', 'lasts', 'all', 'day', '.']


In [13]:
doc = nlp(sample_text)
tokens_spacy = [token.text for token in doc]
print("spaCy Tokens:", tokens_spacy)

spaCy Tokens: ['I', 'recently', 'bought', 'the', 'new', 'smartphone', ',', 'and', 'I', "'m", 'impressed', '!', 'The', 'camera', 'is', 'amazing', ',', 'battery', 'lasts', 'all', 'day', '.']


##### Stopword Removal

What: Removing common words that carry little meaning in NLP (e.g., "is", "the", "and").

Example:
Input: ['ChatGPT', 'is', 'awesome']
Output after stopword removal: ['ChatGPT', 'awesome']



In [14]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens_nltk if w.lower() not in stop_words and w.isalpha()]
print("Filtered Tokens:", filtered_tokens)

Filtered Tokens: ['recently', 'bought', 'new', 'smartphone', 'impressed', 'camera', 'amazing', 'battery', 'lasts', 'day']


##### Stemming
What: Cutting words to their base/root form using simple rules (may not be real words).

Example:
["running", "runs", "runner"] → ["run", "run", "runner"]
But it can also be aggressive:

"better" → "better" (not changed)

"universities" → "univers"

In [15]:
stemmer = PorterStemmer()
stemmed = [stemmer.stem(word) for word in filtered_tokens]
print("Stemmed Words:", stemmed)

Stemmed Words: ['recent', 'bought', 'new', 'smartphon', 'impress', 'camera', 'amaz', 'batteri', 'last', 'day']


##### Lemmatization
What: Converting words to their dictionary base form (lemma) using grammar (real words).

Example:

["running", "runs", "better"] → ["run", "run", "good"]

(Here “better” is mapped to “good” because it's a comparative form)



In [16]:
lemmatized = [token.lemma_ for token in doc if token.text.lower() not in stop_words and token.is_alpha]
print("Lemmatized Words:", lemmatized)

Lemmatized Words: ['recently', 'buy', 'new', 'smartphone', 'impressed', 'camera', 'amazing', 'battery', 'last', 'day']
