### Why Text Data is Challenging for Machines

- **Ambiguity & Polysemy**: Same word, multiple meanings (e.g., *bank*)

- **Synonyms & Paraphrasing**: Different words/phrases can mean the same thing

- **Context Dependency**: Meaning changes based on surrounding text

- **Language Variations**: Slang, misspellings, abbreviations, informal usage

- **Sentiment & Tone**: Sarcasm and irony are hard to detect

- **Structure & Syntax**: Complex or broken grammar in real-world text

- **Data Volume & Noise**:  Large datasets with irrelevant or noisy content

- **Lack of Numerical Form**: Text must be converted into numbers for ML models


### Why Text Cleaning is Crucial
1. **Reduces Noise**: Removes punctuation and symbols that add little semantic value
2. **Avoids Artificial Differences**: Treats words like `apple` and `apple.` as the same token
3. **Improves Frequency Counts**: Prevents numbers and symbols from inflating vocabulary size
4. **Boosts Model Performance**: Helps models focus on meaningful patterns
5. **Lowers Computational Cost**: Smaller, cleaner data → faster training and less memory usage

In [2]:
import string
import re  # for regular expression

def clean_text(text):
    text=text.lower()   # convert to lowercase
    text = ''.join([char for char in text if char not in string.punctuation])  # Remove punctuation
    text = ''.join([char for char in text if not char.isdigit()])  # Remove numbers
    text = ' '.join(text.split())    # Remove speacial character
    return text 
sample_text = "This is a sample sentence, with punctuation! It costs $19.99 and has 2 items. @Awesome!"
cleaned_sample = clean_text(sample_text)
print(cleaned_sample)

this is a sample sentence with punctuation it costs and has items awesome


### Breaking Down Text: The Power of Tokenization

Tokenization is the process of splitting text into smaller units called **tokens**. Tokens can be **words, punctuation, or sub-word units**.          
**Types of Tokenization**
1. **Word Tokenization**
   - Splits text into individual words
   - Uses spaces and punctuation as delimiters
   - Example:  
     `"NLP is fascinating!"` → `["NLP", "is", "fascinating", "!"]`

2. **Sentence Tokenization**
   - Splits text into individual sentences
   - Uses `.`, `?`, `!` as delimiters
   - Each sentence is treated as a separate unit


#### Why Tokenization is Important
- **Enables Feature Extraction**: Converts text into units usable for BoW, TF-IDF, etc.
- **Reduces Complexity**: Breaks text into manageable pieces
- **Supports Linguistic Analysis**: First step for POS tagging, NER, parsing
- **Standardizes Input**: Ensures consistent text processing


In [6]:
import nltk  # provides function to tokenize
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [8]:
from nltk.tokenize import word_tokenize
# word tokenize
text_to_tokenize = "Tokenization is a fundamental step in NLP!"
word_tokens = word_tokenize(text_to_tokenize)
print(word_tokens) 

['Tokenization', 'is', 'a', 'fundamental', 'step', 'in', 'NLP', '!']


In [9]:
from nltk.tokenize import sent_tokenize
# sentence tokenize
paragraph = "Natural Language Processing is fascinating. It allows computers to understand human language. This is a complex but rewarding field!"
sentence_tokens = sent_tokenize(paragraph)
print(sentence_tokens)

['Natural Language Processing is fascinating.', 'It allows computers to understand human language.', 'This is a complex but rewarding field!']


### Filtering the Noise: The Role of Stop Words

These words, such as "the," "a," "is," "in," "and," "of," carry little semantic meaning on their own and can dominate the feature space in our models, potentially obscuring more important words. These common words are known as stop words.

#### Why Remove Stop Words?
1. **Reduces Dimensionality**
  - Decreases vocabulary size
  - Faster training and lower memory usage
2. **Improves Model Performance**
  - Removes less informative words
  - Helps models focus on meaningful terms (e.g., *good*, *bad*)
3. **Enhances Interpretability**
  - Highlights important words in analysis and visualizations


In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [11]:
from nltk.corpus import stopwords
# Get the set of English stop words
stop_words = set(stopwords.words('english'))
sentence = "This is a sample sentence to demonstrate stop word removal."
word_tokens = word_tokenize(sentence)
# Remove stop words
filtered_sentence = [w for w in word_tokens if w.lower() not in stop_words]

print("Original tokens:", word_tokens)
print("Filtered tokens:", filtered_sentence)

Original tokens: ['This', 'is', 'a', 'sample', 'sentence', 'to', 'demonstrate', 'stop', 'word', 'removal', '.']
Filtered tokens: ['sample', 'sentence', 'demonstrate', 'stop', 'word', 'removal', '.']


### Reducing Words to Their Roots: Stemming vs. Lemmatization

In natural language, the same concept can appear in multiple word forms. Words like *run, running, ran,* and *runs* all convey the same core meaning, but machine learning models treat them as separate tokens unless normalization is applied. This increases vocabulary size and weakens the importance of the underlying concept. To address this, **stemming** and **lemmatization** are used to reduce words to their base or root form.

A major challenge is **inflectional variation**. Words such as *play, playing, played* or *computer, computing, computed* are semantically related but appear different to a model. Without normalization, the model considers them unrelated. Stemming and lemmatization map such variations to a common base form, helping models learn more effectively.

1. **Stemming** is a heuristic process that removes prefixes or suffixes from words to obtain a stem. It focuses on speed rather than linguistic accuracy and often produces stems that are not valid English words. For example, *studies* may become *studi*. Stemming is rule-based, fast, and computationally inexpensive, but it can be inaccurate due to over-stemming or under-stemming. Common stemming algorithms include the Porter Stemmer, Snowball Stemmer, and Lancaster Stemmer, with Lancaster being the most aggressive.

2. **Lemmatization**, on the other hand, is a more advanced technique that uses vocabulary and morphological analysis to return the dictionary form of a word, known as the lemma. Unlike stemming, lemmatization produces valid English words and can consider the part of speech to determine the correct base form. For example, the word *better* is correctly lemmatized to *good*. Although lemmatization is slower and requires lexical resources, it is more accurate and linguistically meaningful.

#### Example:
Consider the word "better."
**Stemming**: Might reduce it to "better" or "bett."                                                                                                 
**Lemmatization**: Will correctly identify it as the comparative form of "good" and return "good."


In [12]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...


True

In [13]:
from nltk.stem import PorterStemmer
#Stemming
porter = PorterStemmer()
words_to_stem = ["running", "runs", "ran", "easily", "fairly", "studies", "studying", "computation", "computational", "computer", "better"]
stemmed_words = [porter.stem(word) for word in words_to_stem]

print("Original Words:", words_to_stem)
print("Stemmed Words:", stemmed_words)

Original Words: ['running', 'runs', 'ran', 'easily', 'fairly', 'studies', 'studying', 'computation', 'computational', 'computer', 'better']
Stemmed Words: ['run', 'run', 'ran', 'easili', 'fairli', 'studi', 'studi', 'comput', 'comput', 'comput', 'better']


In [14]:
from nltk.stem import WordNetLemmatizer
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas_no_pos = [lemmatizer.lemmatize(word) for word in words_to_stem]
print("Lemmas (no POS):", lemmas_no_pos)

# Note: NLTK's lemmatizer uses specific POS tags like 'a' for adjective, 'r' for adverb, 'v' for verb, 'n' for noun.
lemma_better_adj = lemmatizer.lemmatize('better', pos='a')
print(f"Lemma for 'better' as adjective: {lemma_better_adj}")

# Example for 'studies' as a verb
lemma_studies_verb = lemmatizer.lemmatize('studies', pos='v')
print(f"Lemma for 'studies' as verb: {lemma_studies_verb}")

Lemmas (no POS): ['running', 'run', 'ran', 'easily', 'fairly', 'study', 'studying', 'computation', 'computational', 'computer', 'better']
Lemma for 'better' as adjective: good
Lemma for 'studies' as verb: study
