# 🧹 NLP Text Cleaning Pipeline (Visualization)

```mermaid
flowchart TD
    A[📂 Read Raw Text] --> B[✂️ Tokenization<br/>Sentence + Word]
    B --> C[🔽 Lowercasing]
    C --> D[🚫 Stopword Removal]
    D --> E[✂️ Punctuation Removal]
    E --> F[🌱 Stemming]
    E --> G[🌳 Lemmatization]
    F --> H[🔗 N-gram Generation<br/>Unigram, Bigram, Trigram]
    G --> H
    H --> I[📊 Feature Extraction<br/>BoW, TF-IDF, Embeddings]


## 1. Read Raw Text & Lowercasing  

📂 **Load the text file** → bring data into memory.  
✂️ **Split** into sentences and words for structured processing.  
🔽 **Convert everything to lowercase** → ensures consistency (*Apple = apple*).  

➡️ This is the foundation step: get your raw input into a clean, uniform format before applying deeper NLP transformations.


In [1]:
lines = []

with open("../data/sample.txt", "r+") as f:
    for line in f.readlines():
        lines.append(line.lower().strip().replace(".", ""))

In [2]:
print(lines)

['in july 2023, google announced a $2 billion investment in bengaluru to expand its cloud data centers', 'sundar pichai emphasized that india is one of the fastest-growing digital markets', 'meanwhile, microsoft signed a partnership with infosys to accelerate ai adoption across asia', 'apple released the iphone 15 in september 2023, and tim cook visited mumbai for the launch', 'according to the times of india, the reserve bank of india may introduce new regulations for digital payments by early 2024', '']


## 🔹 2. Tokenization  
- 📝 **Manual Tokenization** → `split()`

In [3]:
cleaned = [word for line in lines for word in line.split()]

In [4]:
print(cleaned)

['in', 'july', '2023,', 'google', 'announced', 'a', '$2', 'billion', 'investment', 'in', 'bengaluru', 'to', 'expand', 'its', 'cloud', 'data', 'centers', 'sundar', 'pichai', 'emphasized', 'that', 'india', 'is', 'one', 'of', 'the', 'fastest-growing', 'digital', 'markets', 'meanwhile,', 'microsoft', 'signed', 'a', 'partnership', 'with', 'infosys', 'to', 'accelerate', 'ai', 'adoption', 'across', 'asia', 'apple', 'released', 'the', 'iphone', '15', 'in', 'september', '2023,', 'and', 'tim', 'cook', 'visited', 'mumbai', 'for', 'the', 'launch', 'according', 'to', 'the', 'times', 'of', 'india,', 'the', 'reserve', 'bank', 'of', 'india', 'may', 'introduce', 'new', 'regulations', 'for', 'digital', 'payments', 'by', 'early', '2024']


## 🔹 3. Stopword Removal  
🚫 Remove high-frequency, low-information words (*the, is, at, on, and…*).  
➡️ Keeps the focus on **meaningful tokens**.

In [5]:
!pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [6]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/priyesh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
stopwords_list = stopwords.words("english")

In [8]:
print(stopwords_list)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [9]:
cleaned_stopwords = [word for word in cleaned if word not in stopwords_list]

In [10]:
print(cleaned_stopwords)

['july', '2023,', 'google', 'announced', '$2', 'billion', 'investment', 'bengaluru', 'expand', 'cloud', 'data', 'centers', 'sundar', 'pichai', 'emphasized', 'india', 'one', 'fastest-growing', 'digital', 'markets', 'meanwhile,', 'microsoft', 'signed', 'partnership', 'infosys', 'accelerate', 'ai', 'adoption', 'across', 'asia', 'apple', 'released', 'iphone', '15', 'september', '2023,', 'tim', 'cook', 'visited', 'mumbai', 'launch', 'according', 'times', 'india,', 'reserve', 'bank', 'india', 'may', 'introduce', 'new', 'regulations', 'digital', 'payments', 'early', '2024']


In [11]:
print(cleaned)

['in', 'july', '2023,', 'google', 'announced', 'a', '$2', 'billion', 'investment', 'in', 'bengaluru', 'to', 'expand', 'its', 'cloud', 'data', 'centers', 'sundar', 'pichai', 'emphasized', 'that', 'india', 'is', 'one', 'of', 'the', 'fastest-growing', 'digital', 'markets', 'meanwhile,', 'microsoft', 'signed', 'a', 'partnership', 'with', 'infosys', 'to', 'accelerate', 'ai', 'adoption', 'across', 'asia', 'apple', 'released', 'the', 'iphone', '15', 'in', 'september', '2023,', 'and', 'tim', 'cook', 'visited', 'mumbai', 'for', 'the', 'launch', 'according', 'to', 'the', 'times', 'of', 'india,', 'the', 'reserve', 'bank', 'of', 'india', 'may', 'introduce', 'new', 'regulations', 'for', 'digital', 'payments', 'by', 'early', '2024']


In [12]:
len(cleaned_stopwords)

55

In [13]:
customer_reviews = ['sam was a great help to me in the store', 
                    'the cashier was very rude to me, I think her name was eleanor', 
                    'amazing work from sadeen!', 
                    'sarah was able to help me find the items i needed quickly', 
                    'lucy is such a great addition to the team', 
                    'great service from sara she found me what i wanted'
                   ]

## 🔹 4. Punctuation Removal  
✂️ Remove punctuation marks (`.,!?;:"()[]…`).  
➡️ Reduces noise for models like BoW, TF-IDF, or embeddings. 

In [14]:
import re

In [15]:
# remove punctuations from list of sentences
punc_cleaned = [re.sub(r"[^\w\s]", "", review) for review in customer_reviews]

In [16]:
punc_cleaned

['sam was a great help to me in the store',
 'the cashier was very rude to me I think her name was eleanor',
 'amazing work from sadeen',
 'sarah was able to help me find the items i needed quickly',
 'lucy is such a great addition to the team',
 'great service from sara she found me what i wanted']

In [17]:
customer_reviews

['sam was a great help to me in the store',
 'the cashier was very rude to me, I think her name was eleanor',
 'amazing work from sadeen!',
 'sarah was able to help me find the items i needed quickly',
 'lucy is such a great addition to the team',
 'great service from sara she found me what i wanted']

In [18]:
# remove punctuations from list of tokens
stopwords_list_punc_cleaned = [re.sub(r"[^\w\s]", "", word) for word in stopwords_list]

In [19]:
print(stopwords_list_punc_cleaned)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', 'arent', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', 'couldnt', 'd', 'did', 'didn', 'didnt', 'do', 'does', 'doesn', 'doesnt', 'doing', 'don', 'dont', 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', 'hadnt', 'has', 'hasn', 'hasnt', 'have', 'haven', 'havent', 'having', 'he', 'hed', 'hell', 'her', 'here', 'hers', 'herself', 'hes', 'him', 'himself', 'his', 'how', 'i', 'id', 'if', 'ill', 'im', 'in', 'into', 'is', 'isn', 'isnt', 'it', 'itd', 'itll', 'its', 'its', 'itself', 'ive', 'just', 'll', 'm', 'ma', 'me', 'mightn', 'mightnt', 'more', 'most', 'mustn', 'mustnt', 'my', 'myself', 'needn', 'neednt', 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', 'shant', 'she', 'shed', 'shell', 'sh

In [20]:
print(stopwords_list)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

## 🔹 5. Tokenization  
- 📝 **Sentence Tokenization** → `sent_tokenize()`  
- 🔤 **Word Tokenization** → `word_tokenize()`  

➡️ Converts raw text into sentences and words (tokens). 

In [21]:
# tokenization
nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to /home/priyesh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/priyesh/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [22]:
from nltk.tokenize import word_tokenize, sent_tokenize

para = """In September 2023, Priyesh moved to Bengaluru to join an AI research lab. 
He said, "I’m excited to explore natural language processing, deep learning, and data engineering!"
During weekends, he often visits bookstores, drinks coffee at Church Street, and writes blogs about technology.
"""

In [23]:
# sentence tokenizer
tokenize_sentence = sent_tokenize(para)

In [24]:
tokenize_sentence

['In September 2023, Priyesh moved to Bengaluru to join an AI research lab.',
 'He said, "I’m excited to explore natural language processing, deep learning, and data engineering!"',
 'During weekends, he often visits bookstores, drinks coffee at Church Street, and writes blogs about technology.']

In [25]:
# word tokenizer
tokenize_words = word_tokenize(para)

In [26]:
print(tokenize_words)

['In', 'September', '2023', ',', 'Priyesh', 'moved', 'to', 'Bengaluru', 'to', 'join', 'an', 'AI', 'research', 'lab', '.', 'He', 'said', ',', '``', 'I', '’', 'm', 'excited', 'to', 'explore', 'natural', 'language', 'processing', ',', 'deep', 'learning', ',', 'and', 'data', 'engineering', '!', "''", 'During', 'weekends', ',', 'he', 'often', 'visits', 'bookstores', ',', 'drinks', 'coffee', 'at', 'Church', 'Street', ',', 'and', 'writes', 'blogs', 'about', 'technology', '.']


## 🔹 6. Normalization  
- 🌱 **Stemming** → crude root form (*studies → studi*).  
- 🌳 **Lemmatization** → dictionary form (*running → run*).
➡️ Shrinks vocabulary & improves generalization.

In [27]:
from nltk.stem import PorterStemmer

In [28]:
pps = PorterStemmer()

In [29]:
pps.stem(para)

'in september 2023, priyesh moved to bengaluru to join an ai research lab. \nhe said, "i’m excited to explore natural language processing, deep learning, and data engineering!"\nduring weekends, he often visits bookstores, drinks coffee at church street, and writes blogs about technology.\n'

In [30]:
# stemming
tokenize_stemmed = [pps.stem(word) for word in stopwords_list_punc_cleaned]

In [31]:
print(tokenize_stemmed)

['a', 'about', 'abov', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'ani', 'are', 'aren', 'arent', 'as', 'at', 'be', 'becaus', 'been', 'befor', 'be', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', 'couldnt', 'd', 'did', 'didn', 'didnt', 'do', 'doe', 'doesn', 'doesnt', 'do', 'don', 'dont', 'down', 'dure', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', 'hadnt', 'ha', 'hasn', 'hasnt', 'have', 'haven', 'havent', 'have', 'he', 'hed', 'hell', 'her', 'here', 'her', 'herself', 'he', 'him', 'himself', 'hi', 'how', 'i', 'id', 'if', 'ill', 'im', 'in', 'into', 'is', 'isn', 'isnt', 'it', 'itd', 'itll', 'it', 'it', 'itself', 'ive', 'just', 'll', 'm', 'ma', 'me', 'mightn', 'mightnt', 'more', 'most', 'mustn', 'mustnt', 'my', 'myself', 'needn', 'neednt', 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'onc', 'onli', 'or', 'other', 'our', 'our', 'ourselv', 'out', 'over', 'own', 're', 's', 'same', 'shan', 'shant', 'she', 'shed', 'shell', 'she', 'should', 'shouldn',

In [32]:
tokens_connect = ['connecting', 'connected', 'connectivity', 'connect', 'connects']
for token in tokens_connect:
    print(token, "→", pps.stem(token))

connecting → connect
connected → connect
connectivity → connect
connect → connect
connects → connect


In [33]:
tokens_learn = ['learned', 'learning', 'learn', 'learns', 'learner', 'learners']
for token in tokens_learn:
    print(token, "→", pps.stem(token))

learned → learn
learning → learn
learn → learn
learns → learn
learner → learner
learners → learner


In [34]:
tokens_misc = ['likes', 'better', 'worse']
for token in tokens_misc:
    print(token, "→", pps.stem(token))

likes → like
better → better
worse → wors


In [35]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /home/priyesh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/priyesh/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [36]:
from nltk.stem import WordNetLemmatizer

lem = WordNetLemmatizer()

In [37]:
# lemmatization
tokens_connect = ['connecting', 'connected', 'connectivity', 'connect', 'connects']
for token in tokens_connect:
    print(token, "→", lem.lemmatize(token))

connecting → connecting
connected → connected
connectivity → connectivity
connect → connect
connects → connects


In [38]:
tokens_learn = ['learned', 'learning', 'learn', 'learns', 'learner', 'learners']
for token in tokens_learn:
    print(token, "→", lem.lemmatize(token))

learned → learned
learning → learning
learn → learn
learns → learns
learner → learner
learners → learner


In [39]:
tokens_misc = ['likes', 'better', 'worse']
for token in tokens_misc:
    print(token, "→", lem.lemmatize(token))

likes → like
better → better
worse → worse


In [40]:
print(tokenize_stemmed)

['a', 'about', 'abov', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'ani', 'are', 'aren', 'arent', 'as', 'at', 'be', 'becaus', 'been', 'befor', 'be', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', 'couldnt', 'd', 'did', 'didn', 'didnt', 'do', 'doe', 'doesn', 'doesnt', 'do', 'don', 'dont', 'down', 'dure', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', 'hadnt', 'ha', 'hasn', 'hasnt', 'have', 'haven', 'havent', 'have', 'he', 'hed', 'hell', 'her', 'here', 'her', 'herself', 'he', 'him', 'himself', 'hi', 'how', 'i', 'id', 'if', 'ill', 'im', 'in', 'into', 'is', 'isn', 'isnt', 'it', 'itd', 'itll', 'it', 'it', 'itself', 'ive', 'just', 'll', 'm', 'ma', 'me', 'mightn', 'mightnt', 'more', 'most', 'mustn', 'mustnt', 'my', 'myself', 'needn', 'neednt', 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'onc', 'onli', 'or', 'other', 'our', 'our', 'ourselv', 'out', 'over', 'own', 're', 's', 'same', 'shan', 'shant', 'she', 'shed', 'shell', 'she', 'should', 'shouldn',

## 🔹 8. N-gram Generation  
- **Unigram** → single words (*“data”*).  
- **Bigram** → word pairs (*“data science”*).  
- **Trigram** → three-word phrases (*“new york city”*).  

➡️ Captures short-range context useful in tasks like sentiment analysis (*“not good”*).  

In [41]:
# ngrams
from nltk.util import ngrams
from collections import Counter

In [42]:
unigrams = list(ngrams(tokenize_stemmed, 1))
unigrams_counter = Counter(unigrams)
unigrams_counter.most_common(5)

[(('it',), 3), (('your',), 3), (('be',), 2), (('do',), 2), (('have',), 2)]

In [43]:
bigrams = list(ngrams(tokenize_stemmed, 2))
bigrams_counter = Counter(bigrams)
bigrams_counter.most_common(5)

[(('your', 'your'), 2),
 (('a', 'about'), 1),
 (('about', 'abov'), 1),
 (('abov', 'after'), 1),
 (('after', 'again'), 1)]

In [44]:
trigrams = list(ngrams(tokenize_stemmed, 3))
trigrams_counter = Counter(trigrams)
trigrams_counter.most_common(5)

[(('a', 'about', 'abov'), 1),
 (('about', 'abov', 'after'), 1),
 (('abov', 'after', 'again'), 1),
 (('after', 'again', 'against'), 1),
 (('again', 'against', 'ain'), 1)]