## Text Normalization

Text normalization is the process of transforming text into a canonical (standard) form to reduce variations and inconsistencies. Users express the same concepts in many different ways—for example, love can appear as `lovely`, `luv`, `loveee`, `LOVE`, `loving`, etc. To a machine learning model analyzing sentiment, these variations should ideally be treated as essentially the same word, otherwise the model wastes capacity learning that these are all positive terms.

By normalizing text, we reduce vocabulary size, improve model generalization, and help algorithms focus on semantic meaning rather than superficial differences.

### Why Text Normalization Matters

**Without Normalization:**
- `"I love this!"` and `"I LOVE THIS!!"` treated as completely different
- `"running"`, `"runs"`, `"ran"` counted as separate words
- Vocabulary explodes: 100,000+ unique tokens
- Models struggle to generalize patterns

**With Normalization:**
- Variations mapped to common forms
- Vocabulary reduced: 20,000-30,000 tokens
- Better pattern recognition and accuracy
- Faster training and inference

### Common Normalization Techniques

#### 1. **Case Normalization (Lowercasing)**
Convert all text to lowercase to eliminate case variations.
```python
text = "I LOVE This Product!!!"
normalized = text.lower()
# Output: "i love this product!!!"
```

**When to use:** Almost always, unless case carries meaning (e.g., "US" vs "us")

#### 2. **Removing Punctuation & Special Characters**
Strip out non-alphanumeric characters that don't contribute to meaning.
```python
import string
text = "Wow!!! Amazing product... 5/5 stars!!!"
normalized = text.translate(str.maketrans('', '', string.punctuation))
# Output: "Wow Amazing product 55 stars"
```

**Trade-off:** May lose information (e.g., `"don't"` → `"dont"`)

#### 3. **Removing Extra Whitespace**
Collapse multiple spaces into single spaces.
```python
import re
text = "Great    product   here"
normalized = re.sub(r'\s+', ' ', text).strip()
# Output: "Great product here"
```

#### 4. **Expanding Contractions**
Convert shortened forms to full words.
```python
contractions = {
    "don't": "do not",
    "can't": "cannot",
    "it's": "it is"
}
text = "I don't think it's bad"
# Output: "I do not think it is bad"
```

#### 5. **Removing Stop Words**
Filter out common words that carry little meaning (`the`, `is`, `at`, etc.).
```python
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = "this is a great product".split()
filtered = [w for w in words if w not in stop_words]
# Output: ['great', 'product']
```

**Caution:** Can remove important context in some tasks (e.g., `"not good"` → `"good"`)

#### 6. **Stemming**
Reduce words to their root form by chopping off suffixes (crude but fast).
```python
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

words = ["loving", "loves", "loved", "lovely", "loveable"]
stems = [stemmer.stem(w) for w in words]
# Output: ['love', 'love', 'love', 'love', 'love']
```

**Pros:** Fast, reduces vocabulary significantly  
**Cons:** Can create non-words (`"studies"` → `"studi"`)

#### 7. **Lemmatization**
Reduce words to their dictionary base form using linguistic rules (more accurate).
```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

words = ["running", "runs", "ran", "better", "best"]
lemmas = [lemmatizer.lemmatize(w, pos='v') for w in words]
# Output: ['run', 'run', 'run', 'better', 'best']
```

**Pros:** Produces real words, linguistically accurate  
**Cons:** Slower than stemming, requires POS tagging for best results

#### 8. **Handling Repetitions**
Normalize elongated words (e.g., `"yessss"` → `"yes"`).
```python
import re
def normalize_repetition(text):
    # Reduce 3+ repeated characters to 2
    return re.sub(r'(.)\1{2,}', r'\1\1', text)

text = "I loooove this soooo much!!!"
normalized = normalize_repetition(text)
# Output: "I loove this sooo much!!!"
```

#### 9. **Number Normalization**
Replace numbers with tokens or remove them entirely.
```python
text = "I bought 3 items for $25.99"
normalized = re.sub(r'\d+', '<NUM>', text)
# Output: "I bought <NUM> items for $<NUM>.<NUM>"
```

#### 10. **Accent Removal (Diacritics)**
Normalize accented characters to ASCII.
```python
import unicodedata
text = "café naïve résumé"
normalized = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
# Output: "cafe naive resume"
```

### When to Apply Different Techniques

| Technique | Sentiment Analysis | Topic Modeling | Search | NER | Translation |
|-----------|-------------------|----------------|--------|-----|-------------|
| Lowercasing | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No* | ✅ Yes |
| Remove Punctuation | ⚠️ Careful** | ✅ Yes | ⚠️ Careful | ❌ No | ✅ Yes |
| Stemming | ✅ Yes | ✅ Yes | ✅ Yes | ❌ No | ❌ No |
| Lemmatization | ✅ Yes | ✅ Yes | ✅ Yes | ⚠️ Careful | ⚠️ Careful |
| Stop Words | ⚠️ Careful*** | ✅ Yes | ✅ Yes | ❌ No | ❌ No |

\* Capitalization matters for named entities  
\*\* Can lose negation: `"don't"` → `"dont"`  
\*\*\* `"not good"` becomes `"good"` if "not" is removed

### Trade-offs to Consider

**Over-normalization Risks:**
- Loss of sentiment: `"NOT good"` → `"good"`
- Loss of entities: `"US"` (country) → `"us"` (pronoun)
- Loss of context: `"Apple"` (company) vs `"apple"` (fruit)

**Under-normalization Risks:**
- Vocabulary explosion
- Poor generalization
- Increased computational cost

**Best Practice:** Start conservative, normalize incrementally, and evaluate impact on your specific task through experimentation.

So in this notebook, we will discuss these methods in detail and explore how to implement them effectively for different NLP tasks.

----------

In [13]:
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer

> ### Stemming

In [6]:
# PortStemmer
stemmer = PorterStemmer()

words = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer', 
            'plotted']

singles = [stemmer.stem(word) for word in words]

print(', '.join(singles))

caress, fli, die, mule, deni, die, agre, own, humbl, size, meet, state, siez, item, sensat, tradit, refer, colon, plot


> ### Snowball stemmer support different languages

In [7]:
print(", ".join(SnowballStemmer.languages))

arabic, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, porter, portuguese, romanian, russian, spanish, swedish


In [None]:
ar_stemmer = SnowballStemmer("arabic")

ar_stemmer.stem("الجو سماؤه صافية")

'جو سماوه صاف'

-----

> ### Lemmatization

In [15]:
# WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
print("corpuses :", lemmatizer.lemmatize("corpuses"))

# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos ="a"))
print("best :", lemmatizer.lemmatize("best", pos ="a"))

rocks : rock
corpora : corpus
corpuses : corpus
better : good
best : best


-----------

> ### Lemmatization vs Stemming

The key concept here is that stemming sometime destroy the word unlike lemmatization where we keep the meaning.

---------------

> ## `Great Job`

--------