# Objectives

- undertand how text is processed and analysed

# NLP (Natural Language Processing)

NLP is analysing and generating text(Language).

**analysis:** extract meaning, classify, and transalate 
**generating:** create text, summarize, and chat



### Text Processing

**tokens:** the basic text units a language model prcoesses, sometimes words, sometimes parts of words, they may not always be meaningful individually, but the meaning is formed across sequence of tokens.


**reference:**

[nltk](https://www.nltk.org)

In [None]:
! pip install nltk scikit-learn pandas

In [None]:
# TEXT PREPROCESSING: Tokenization


def demo_tokenization():
    
    text = "Don't split this! Dr. Smith's email is test@example.com. Cost: $49.99"




    
    # Edge cases
    print("="*60)
    print("EDGE CASES")


    edge_cases = [
        "It's can't won't",
        "U.S.A. vs USA",
        "covid-19",
        "test@email.com",
        "I'm feeling ðŸ˜Š today!",
    ]
    

demo_tokenization()

A better tokenizer handeles contractions, seperated punctuation and kepps meaningful units together

----

### Notmalization

sometimes we have tokens that have same surface forms but convety the same underlying meaning, so we normalize them

**example:** run, runs, ran, running -> run


this is done via:

- **Stemming:**
- **Lemmatization:**

In [24]:
#TEXT PREPROCESSING: Normalization

def demo_normalization():
    """Compare stemming and lemmatization"""
    
    

demo_normalization()

**STEMMING (Porter Stemmer):**

- Fast - just chops off endings
- Good for search (retrieval)
- Creates non-words: 'caring' â†’ 'care' (good), 'studies' â†’ 'studi' (bad)
    
**LEMMATIZATION:**
- Real words - uses dictionary
- Better for analysis
- Slower
- Needs part-of-speech tag
    

**WHEN TO USE:**
- Search engines â†’ Stemming (speed)
- Sentiment analysis â†’ Lemmatization (accuracy)

----
    

### N-grams & Language Modeling

until now we processed text now lets predict what comes next in a sentence

usecases:

- auto complete

An n-gram model is based on conditional probability: given the previous words, it estimates what word is likely to occur next.

- n=1 (unigram)
- n=2 (bigram)
- n = 3 (trigram) 



In [None]:
# N-GRAMS: Modeling sequences of words
# Predicting the next word based on previous words

from collections import defaultdict, Counter
import random
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt', quiet=True)

class NGramModel:
    """
    Simple n-gram language model
    Predicts next word based on previous n-1 words
    """
    
    def __init__(self, n=2):
        """
        n=1: unigram (no context)
        n=2: bigram (previous 1 word)
        n=3: trigram (previous 2 words)
        """
        self.n = n
        self.ngrams = defaultdict(Counter)
        self.context_counts = Counter()
    
    def train(self, text):
        """Learn n-gram probabilities from text"""
        tokens = word_tokenize(text.lower())
        
        # Add start/end markers
        tokens = ['<START>'] * (self.n - 1) + tokens + ['<END>']
        
        # Count n-grams
        for i in range(len(tokens) - self.n + 1):
            # Context: first n-1 words
            context = tuple(tokens[i:i + self.n - 1])
            # Next word
            next_word = tokens[i + self.n - 1]
            
            self.ngrams[context][next_word] += 1
            self.context_counts[context] += 1
    
    def probability(self, context, word):
        """P(word | context)"""
        context = tuple(context)
        if context not in self.ngrams:
            return 0.0
        
        count = self.ngrams[context][word]
        total = self.context_counts[context]
        return count / total
    
    def generate(self, max_words=20):
        """Generate text using the model"""
        context = ['<START>'] * (self.n - 1)
        result = []
        
        for _ in range(max_words):
            # Get possible next words
            context_tuple = tuple(context[-(self.n-1):])
            
            if context_tuple not in self.ngrams:
                break
            
            # Choose next word based on probabilities
            next_words = self.ngrams[context_tuple]
            next_word = random.choices(
                list(next_words.keys()),
                weights=list(next_words.values())
            )[0]
            
            if next_word == '<END>':
                break
            
            result.append(next_word)
            context.append(next_word)
        
        return ' '.join(result)


def demo_ngrams():
    """Show n-grams in action"""
    
    # Training data: Shakespeare quotes
    corpus = """
    To be or not to be, that is the question.
    All the world's a stage, and all the men and women merely players.
    To be or not to be, to thine own self be true.
    The course of true love never did run smooth.
    All that glitters is not gold.
    What's in a name? A rose by any other name would smell as sweet.
    """
    
    print("="*60)
    print("N-GRAM LANGUAGE MODELS")
    print("="*60)
    print(f"\nTraining corpus:\n{corpus[:200]}...\n")
    
    # Train different n-gram models
    for n in [2, 3]:
        print(f"\n{'='*60}")
        print(f"{n}-GRAM MODEL (context = {n-1} words)")
        print("="*60)
        
        model = NGramModel(n=n)
        model.train(corpus)
        
        # Show some probabilities
        if n == 2:
            context = ['to']
            print(f"\nProbabilities after '{context[0]}':")
            for word, count in model.ngrams[tuple(context)].most_common(5):
                prob = model.probability(context, word)
                print(f"  P({word} | {context[0]}) = {prob:.3f}")
        
        # Generate text
        print(f"\nGenerated text:")
        for i in range(3):
            print(f"  {i+1}. {model.generate()}")
demo_ngrams()

**Observations:**

- Bigrams capture local patterns
- Trigrams more coherent but need more data
- Still nonsensical - no real understanding
- Can only use patterns seen in training (data sparicty problem)


### Activity

experiment with your own n-gram

[n-gram](./n-gram-experiment.py)


----

N-grams predict the next word. But what if we want to classify an entire document, like 'is this review positive or negative?' We need a different approach.


#### Bag of Words & Classification

from sequences to documents

one-hot encode the each token and sum them across the document


In [26]:
# BAG OF WORDS: Representing documents as word counts
# Ignore order, just count presence

def demo_bow():
    """Visualize bag of words representation"""
    
demo_bow()

#### Observations

- Each document is now a vector of word counts
- Same length (vocabulary size)
- Can now use machine learning
    
- Lost word order: "dog bites man" = "man bites dog"
- Lost syntax: "not good" looks like "good"
- Every word is a separate dimension

**1. TEXT PREPROCESSING**

- Tokenization: Breaking text into words
- Normalization: Stemming/lemmatization
- These are fundamental to all NLP
    
**2. N-GRAMS**

- Model probability of word sequences
- Predict next word
- Can't generalize to unseen sequences
- Data sparsity problem
    
**3. BAG OF WORDS + CLASSIFICATION**
- Represent documents as vectors
- Train classifiers (logistic regression)
- Each word is separate dimension
- Synonyms are invisible
- Vocabulary explosion (50,000+ dimensions)

---


current representation does not consider car and automobile as same thing 

'car'        â†’ [0, 0, 1, 0, 0, ...]  (50,000 dimensions)
'automobile' â†’ [0, 1, 0, 0, 0, ...]  (completely different!)


so sparse vectors are not enough to identify the synonyms

