## Text Preprocessing and Linguistic Basics

#### Aim: TO understand how raw text is converted into clean, structured data using NLP preprocessing techniques

#### Techniques
- Tokenization
- Stopword Removal
- Punctuation Removal
- Stemming
- Lemmatization
- Morphology 
- Frequency Analysis 


#### Theory :
Raw text is noisy, Machines cannot directly understand: punctuation, different forms, unnecessary common words. Preprocessing makes text suitable for analysis.

#### Step 1: Input Text (Raw Text)

In [38]:
text= """
In 2026, text preprocessing has evolved to prioritize context-aware noise reduction, ensuring that LLMs interpret subtle linguistic nuances accurately. Modern practitioners now integrate ethical debiasing directly into the cleaning phase to mitigate algorithmic prejudice before model training begins. By mastering these refined cleaning pipelines, students bridge the gap between raw, messy human dialogue and sophisticated machine intelligence.
"""

print("Raw Text:\n", text)

Raw Text:
 
In 2026, text preprocessing has evolved to prioritize context-aware noise reduction, ensuring that LLMs interpret subtle linguistic nuances accurately. Modern practitioners now integrate ethical debiasing directly into the cleaning phase to mitigate algorithmic prejudice before model training begins. By mastering these refined cleaning pipelines, students bridge the gap between raw, messy human dialogue and sophisticated machine intelligence.



#### Step 2: Text Cleaning

##### Cleaning means removing unnecessary symbols and normalizing text, 
Why?
1. "Language" and "language" should be treated same.
2. Punctuation like .,!? has no meaning in most NLP tasks

In [39]:
import re # Regular expression library used to search and modify text

# Remove everything except letter and spaces 
# Internally 
# [^a-zA-Z] means: anything NOT a letter or space 
# re.sub replaces sub characters with empty strings 

clean_text = re.sub(r'[^a-zA-Z] ','',text)
# Parametres of re (pattern, replace, string, count=0, flags=0)

# Convert all text to lowercase 
# This avoids treating "NLP" and "nlp" as different words 

clean_text = clean_text.lower()

print("Cleaned Text:\n", clean_text)

Cleaned Text:
 
in 2026text preprocessing has evolved to prioritize context-aware noise reductionensuring that llms interpret subtle linguistic nuances accuratelymodern practitioners now integrate ethical debiasing directly into the cleaning phase to mitigate algorithmic prejudice before model training beginsby mastering these refined cleaning pipelinesstudents bridge the gap between rawmessy human dialogue and sophisticated machine intelligence.



#### Step 3: Tokenization


##### Breaking text into smaller pieces called tokens 
Tokens can be: Sentences, Words

NLP models do not work on paragraphs 

They work on tokens 

In [40]:
import nltk
# nltk.download('punkt')
# nltk.download('punkt_tab') # To download recent updates

from nltk.tokenize import sent_tokenize, word_tokenize

# Sentence Tokenization -> Splits paragraph into sentences using punctuation rules
sentences = sent_tokenize(text)

words = word_tokenize(clean_text)

print("\nSentence Tokens:")
for s in sentences:
    print("-",s)

print("\nWord Tokens:")
print(words)


Sentence Tokens:
- 
In 2026, text preprocessing has evolved to prioritize context-aware noise reduction, ensuring that LLMs interpret subtle linguistic nuances accurately.
- Modern practitioners now integrate ethical debiasing directly into the cleaning phase to mitigate algorithmic prejudice before model training begins.
- By mastering these refined cleaning pipelines, students bridge the gap between raw, messy human dialogue and sophisticated machine intelligence.

Word Tokens:
['in', '2026text', 'preprocessing', 'has', 'evolved', 'to', 'prioritize', 'context-aware', 'noise', 'reductionensuring', 'that', 'llms', 'interpret', 'subtle', 'linguistic', 'nuances', 'accuratelymodern', 'practitioners', 'now', 'integrate', 'ethical', 'debiasing', 'directly', 'into', 'the', 'cleaning', 'phase', 'to', 'mitigate', 'algorithmic', 'prejudice', 'before', 'model', 'training', 'beginsby', 'mastering', 'these', 'refined', 'cleaning', 'pipelinesstudents', 'bridge', 'the', 'gap', 'between', 'rawmessy'

#### Step 4: Stopword Removal
Stopwords are very common words like: is, am, are, the, in, and, to, of.

These words appear frequently but do not add meaning

In [41]:
# nltk.download('stopwords')
from nltk.corpus import stopwords

stopWords = set(stopwords.words('english'))

# Remove stopwords from tokens 
# Internally:
# For each word, check if it exists in stop_words set
# If not, keep it 

filteredWords =[] # List
for w in words:
    if w.lower() not in stopWords:
        filteredWords.append(w)

print("\nAfter Stopword Removal")
print(filteredWords)


After Stopword Removal
['2026text', 'preprocessing', 'evolved', 'prioritize', 'context-aware', 'noise', 'reductionensuring', 'llms', 'interpret', 'subtle', 'linguistic', 'nuances', 'accuratelymodern', 'practitioners', 'integrate', 'ethical', 'debiasing', 'directly', 'cleaning', 'phase', 'mitigate', 'algorithmic', 'prejudice', 'model', 'training', 'beginsby', 'mastering', 'refined', 'cleaning', 'pipelinesstudents', 'bridge', 'gap', 'rawmessy', 'human', 'dialogue', 'sophisticated', 'machine', 'intelligence', '.']


#### Step 5: Stemming
Cutting word endings to get root form.

It uses simple rules, not dictionary.

Examples:
1. playing -> play
2. studies -> studi (not correct English, but acceptable for machines)

In [42]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmed_words = []
for w in filteredWords:
    root = stemmer.stem(w)   # Removes suffix based on algorithm rules
    stemmed_words.append(root)

print("\nStemmed Words:")
print(stemmed_words)



Stemmed Words:
['2026text', 'preprocess', 'evolv', 'priorit', 'context-awar', 'nois', 'reductionensur', 'llm', 'interpret', 'subtl', 'linguist', 'nuanc', 'accuratelymodern', 'practition', 'integr', 'ethic', 'debias', 'directli', 'clean', 'phase', 'mitig', 'algorithm', 'prejudic', 'model', 'train', 'beginsbi', 'master', 'refin', 'clean', 'pipelinesstud', 'bridg', 'gap', 'rawmessi', 'human', 'dialogu', 'sophist', 'machin', 'intellig', '.']


#### Step 6: Lemmatization
Converting word to dictionary form (lemma)

It considers grammer and meaning 

Examples:
1. playing -> play
2. better -> good

More accurate but slower than stemming

In [43]:
# nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatizedWords =[]

for w in filteredWords:
    lemma = lemmatizer.lemmatize(w)
    lemmatizedWords.append(lemma)

print("\nLemmatized Words:")
print(lemmatizedWords)


Lemmatized Words:
['2026text', 'preprocessing', 'evolved', 'prioritize', 'context-aware', 'noise', 'reductionensuring', 'llm', 'interpret', 'subtle', 'linguistic', 'nuance', 'accuratelymodern', 'practitioner', 'integrate', 'ethical', 'debiasing', 'directly', 'cleaning', 'phase', 'mitigate', 'algorithmic', 'prejudice', 'model', 'training', 'beginsby', 'mastering', 'refined', 'cleaning', 'pipelinesstudents', 'bridge', 'gap', 'rawmessy', 'human', 'dialogue', 'sophisticated', 'machine', 'intelligence', '.']


#### Step 7: Morphology
Study of word structure 

Word is made of: prefix + root + suffix

Example: replaying -> re + play + ing

In [44]:
def morphologicalAnalysis(word):
    # Simple demonstration 
    return {
        "word":word,
        "prefix": word[:2], # first 2 letters
        "root":word, # In real NLP, root is found by stemmer/lemmatizer
        "suffix":word[-3:] # Last 3 letters 
    }

print("Morphological Analysis")
for w in filteredWords:
    print(morphologicalAnalysis(w))


Morphological Analysis
{'word': '2026text', 'prefix': '20', 'root': '2026text', 'suffix': 'ext'}
{'word': 'preprocessing', 'prefix': 'pr', 'root': 'preprocessing', 'suffix': 'ing'}
{'word': 'evolved', 'prefix': 'ev', 'root': 'evolved', 'suffix': 'ved'}
{'word': 'prioritize', 'prefix': 'pr', 'root': 'prioritize', 'suffix': 'ize'}
{'word': 'context-aware', 'prefix': 'co', 'root': 'context-aware', 'suffix': 'are'}
{'word': 'noise', 'prefix': 'no', 'root': 'noise', 'suffix': 'ise'}
{'word': 'reductionensuring', 'prefix': 're', 'root': 'reductionensuring', 'suffix': 'ing'}
{'word': 'llms', 'prefix': 'll', 'root': 'llms', 'suffix': 'lms'}
{'word': 'interpret', 'prefix': 'in', 'root': 'interpret', 'suffix': 'ret'}
{'word': 'subtle', 'prefix': 'su', 'root': 'subtle', 'suffix': 'tle'}
{'word': 'linguistic', 'prefix': 'li', 'root': 'linguistic', 'suffix': 'tic'}
{'word': 'nuances', 'prefix': 'nu', 'root': 'nuances', 'suffix': 'ces'}
{'word': 'accuratelymodern', 'prefix': 'ac', 'root': 'accuratel

#### Step 8: Frequency Analysis
Counting how often words appear

Used in:
1. Search engines 
2. TF-IDF
3. Topic detection 



In [45]:
from collections import Counter 
frequency = Counter(
    [
        w for w in lemmatizedWords if w.isalnum()
    ]
)

print("\nWord Frequency")

for word, count in frequency.items():
    print(word,": ", count)


Word Frequency
2026text :  1
preprocessing :  1
evolved :  1
prioritize :  1
noise :  1
reductionensuring :  1
llm :  1
interpret :  1
subtle :  1
linguistic :  1
nuance :  1
accuratelymodern :  1
practitioner :  1
integrate :  1
ethical :  1
debiasing :  1
directly :  1
cleaning :  2
phase :  1
mitigate :  1
algorithmic :  1
prejudice :  1
model :  1
training :  1
beginsby :  1
mastering :  1
refined :  1
pipelinesstudents :  1
bridge :  1
gap :  1
rawmessy :  1
human :  1
dialogue :  1
sophisticated :  1
machine :  1
intelligence :  1
