# Objectives

- undertand how text is processed and analysed

# NLP (Natural Language Processing)

NLP is analysing and generating text(Language).

- **analysis:** extract meaning, classify, and transalate 

- **generation:** create text, summarize, and chat



### Text Processing

**tokens:** the basic text units a language model prcoesses, sometimes words, sometimes parts of words, they may not always be meaningful individually, but the meaning is formed across sequence of tokens.


**reference:**

[nltk](https://www.nltk.org)

In [2]:
! pip install nltk scikit-learn pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
# TEXT PREPROCESSING: Tokenization


def demo_tokenization():
    
    text = "Don't split this! Dr. Smith's email is test@example.com. Cost: $49.99"


    print("Original Text:")
    print(text)

    # Naive approach: split by whitespace
    print("\nNaive Tokenization (split by whitespace):")
    print(text.split())

    # NLTK word_tokenize

    import nltk
    nltk.download('punkt')
    from nltk.tokenize import word_tokenize

    print("\nNLTK word_tokenize:")
    print(word_tokenize(text))


    
    # Edge cases
    print("="*60)
    print("EDGE CASES")


    edge_cases = [
        "It's can't won't",
        "U.S.A. vs USA",
        "covid-19",
        "test@email.com",
        "I'm feeling üòä today!",
    ]

    for case in edge_cases:
        print(f"\nOriginal: {case}")
        print("Tokens:", word_tokenize(case))
    

demo_tokenization()

Original Text:
Don't split this! Dr. Smith's email is test@example.com. Cost: $49.99

Naive Tokenization (split by whitespace):
["Don't", 'split', 'this!', 'Dr.', "Smith's", 'email', 'is', 'test@example.com.', 'Cost:', '$49.99']

NLTK word_tokenize:
['Do', "n't", 'split', 'this', '!', 'Dr.', 'Smith', "'s", 'email', 'is', 'test', '@', 'example.com', '.', 'Cost', ':', '$', '49.99']
EDGE CASES

Original: It's can't won't
Tokens: ['It', "'s", 'ca', "n't", 'wo', "n't"]

Original: U.S.A. vs USA
Tokens: ['U.S.A.', 'vs', 'USA']

Original: covid-19
Tokens: ['covid-19']

Original: test@email.com
Tokens: ['test', '@', 'email.com']

Original: I'm feeling üòä today!
Tokens: ['I', "'m", 'feeling', 'üòä', 'today', '!']


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/krishnagopikaurlaganti/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


A better tokenizer handeles contractions, seperated punctuation and keeps meaningful units together

----

### Notmalization

sometimes we have tokens that have same surface forms but convey the same underlying meaning, so we normalize them

**example:** run, runs, ran, running -> run


this is done via:

- **Stemming**
- **Lemmatization**

In [6]:
#TEXT PREPROCESSING: Normalization



def demo_normalization():
    """Compare stemming and lemmatization"""
    import nltk
    from nltk.stem import PorterStemmer, WordNetLemmatizer

    nltk.download('wordnet', quiet=True)
    nltk.download('omw-1.4', quiet=True)

    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    print("Stemming vs Lemmatization")

    words = [
        'running', 'runs', 'ran',
        'better', 'best', 'good',
        'caring', 'cares', 'cared',
        'studies', 'studying', 'studied'
    ]

    print(f"{'Word':<15}{'Stemmed':<15}{'Lemmatized':<15}")

    for word in words:
        stemmed = stemmer.stem(word)
        lemma_v = lemmatizer.lemmatize(word, pos='v')  # specify verb for better results
        lemma_n = lemmatizer.lemmatize(word, pos='n')  # specify noun for better results

        lemma = lemma_v if lemma_v != word else lemma_n

        print(f"{word:<15}{stemmed:<15}{lemma:<15}")




    

demo_normalization()

Stemming vs Lemmatization
Word           Stemmed        Lemmatized     
running        run            run            
runs           run            run            
ran            ran            run            
better         better         better         
best           best           best           
good           good           good           
caring         care           care           
cares          care           care           
cared          care           care           
studies        studi          study          
studying       studi          study          
studied        studi          study          


**STEMMING (Porter Stemmer):**

- Fast - just chops off endings
- Good for search (retrieval)
- Creates non-words: 'caring' ‚Üí 'care' (good), 'studies' ‚Üí 'studi' (bad)
    
**LEMMATIZATION:**
- Real words - uses dictionary
- Better for analysis
- Slower
- Needs part-of-speech tag
    

**WHEN TO USE:**
- Search engines ‚Üí Stemming (speed)
- Sentiment analysis ‚Üí Lemmatization (accuracy)

----
    

### N-grams & Language Modeling

until now we processed text now lets predict what comes next in a sentence

usecases:

- auto complete

An n-gram model is based on conditional probability: given the previous words, it estimates what word is likely to occur next.

- n=1 (unigram)
- n=2 (bigram)
- n = 3 (trigram) 



In [7]:
# N-GRAMS: Modeling sequences of words
# Predicting the next word based on previous words

from collections import defaultdict, Counter
import random
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt', quiet=True)

class NGramModel:
    """
    Simple n-gram language model
    Predicts next word based on previous n-1 words
    """
    
    def __init__(self, n=2):
        """
        n=1: unigram (no context)
        n=2: bigram (previous 1 word)
        n=3: trigram (previous 2 words)
        """
        self.n = n
        self.ngrams = defaultdict(Counter)
        self.context_counts = Counter()
    
    def train(self, text):
        """Learn n-gram probabilities from text"""
        tokens = word_tokenize(text.lower())
        
        # Add start/end markers
        tokens = ['<START>'] * (self.n - 1) + tokens + ['<END>']
        
        # Count n-grams
        for i in range(len(tokens) - self.n + 1):
            # Context: first n-1 words
            context = tuple(tokens[i:i + self.n - 1])
            # Next word
            next_word = tokens[i + self.n - 1]
            
            self.ngrams[context][next_word] += 1
            self.context_counts[context] += 1
    
    def probability(self, context, word):
        """P(word | context)"""
        context = tuple(context)
        if context not in self.ngrams:
            return 0.0
        
        count = self.ngrams[context][word]
        total = self.context_counts[context]
        return count / total
    
    def generate(self, max_words=20):
        """Generate text using the model"""
        context = ['<START>'] * (self.n - 1)
        result = []
        
        for _ in range(max_words):
            # Get possible next words
            context_tuple = tuple(context[-(self.n-1):])
            
            if context_tuple not in self.ngrams:
                break
            
            # Choose next word based on probabilities
            next_words = self.ngrams[context_tuple]
            next_word = random.choices(
                list(next_words.keys()),
                weights=list(next_words.values())
            )[0]
            
            if next_word == '<END>':
                break
            
            result.append(next_word)
            context.append(next_word)
        
        return ' '.join(result)


def demo_ngrams():
    """Show n-grams in action"""
    
    # Training data: Shakespeare quotes
    corpus = """
    To be or not to be, that is the question.
    All the world's a stage, and all the men and women merely players.
    To be or not to be, to thine own self be true.
    The course of true love never did run smooth.
    All that glitters is not gold.
    What's in a name? A rose by any other name would smell as sweet.
    """
    
    print("="*60)
    print("N-GRAM LANGUAGE MODELS")
    print("="*60)
    print(f"\nTraining corpus:\n{corpus[:200]}...\n")
    
    # Train different n-gram models
    for n in [2, 3]:
        print(f"\n{'='*60}")
        print(f"{n}-GRAM MODEL (context = {n-1} words)")
        print("="*60)
        
        model = NGramModel(n=n)
        model.train(corpus)
        
        # Show some probabilities
        if n == 2:
            context = ['to']
            print(f"\nProbabilities after '{context[0]}':")
            for word, count in model.ngrams[tuple(context)].most_common(5):
                prob = model.probability(context, word)
                print(f"  P({word} | {context[0]}) = {prob:.3f}")
        
        # Generate text
        print(f"\nGenerated text:")
        for i in range(3):
            print(f"  {i+1}. {model.generate()}")
demo_ngrams()

N-GRAM LANGUAGE MODELS

Training corpus:

    To be or not to be, that is the question.
    All the world's a stage, and all the men and women merely players.
    To be or not to be, to thine own self be true.
    The course of true love nev...


2-GRAM MODEL (context = 1 words)

Probabilities after 'to':
  P(be | to) = 0.800
  P(thine | to) = 0.200

Generated text:
  1. to thine own self be true . all that is not to be , to be , and all the
  2. to be true . all the question . the question . what 's a rose by any other name ?
  3. to be , that is not to be true love never did run smooth .

3-GRAM MODEL (context = 2 words)

Generated text:
  1. to be or not to be , that is the question . all that glitters is not gold . what
  2. to be or not to be or not to be , that is the question . all the world 's
  3. to be , to thine own self be true . the course of true love never did run smooth .


**Observations:**

- Bigrams capture local patterns
- Trigrams more coherent but need more data
- Still nonsensical - no real understanding
- Can only use patterns seen in training (data sparicty problem)


### Activity

experiment with your own n-gram

[n-gram](./n-gram-experiment.py)


----

N-grams predict the next word. But what if we want to classify an entire document, like 'is this review positive or negative?' We need a different approach.


#### Bag of Words & Classification

from sequences to documents

one-hot encode the each token and sum them across the document


In [9]:
# BAG OF WORDS: Representing documents as word counts
# Ignore order, just count presence

def demo_bow():
    """Visualize bag of words representation"""

    from sklearn.feature_extraction.text import CountVectorizer
    
    documents = [
        "I love this movie.",
        "I hate this movie.",
        "this is a great movie, I love it!",
        "this is a terrible movie, I hate it!"
    ]

    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(documents)

    vocab = vectorizer.get_feature_names_out()
    print(f"Vocabulary: len(vocab) = {len(vocab)}")
    print(sorted(vocab))

    print("Document Vectors (BoW)")

    import pandas as pd
    df = pd.DataFrame(X.toarray(), columns=vocab)
    print(df)

    
    
demo_bow()

Vocabulary: len(vocab) = 8
['great', 'hate', 'is', 'it', 'love', 'movie', 'terrible', 'this']
Document Vectors (BoW)
   great  hate  is  it  love  movie  terrible  this
0      0     0   0   0     1      1         0     1
1      0     1   0   0     0      1         0     1
2      1     0   1   1     1      1         0     1
3      0     1   1   1     0      1         1     1


In [10]:
"""
SENTIMENT CLASSIFICATION using Bag of Words
Classify movie reviews as positive/negative
"""

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

def demo_classifier():
    """Build and evaluate a sentiment classifier"""
    
    # Sample data (in reality, you'd use thousands of reviews)
    reviews = [
        # Positive
        "This movie was excellent and amazing",
        "I loved every minute of this film",
        "Great acting and wonderful story",
        "Fantastic movie, highly recommend",
        "Best film I've seen this year",
        "Brilliant and entertaining",
        "Absolutely loved it, great cast",
        "Wonderful cinematography and plot",
        # Negative
        "This movie was terrible and boring",
        "I hated every minute of this film",
        "Bad acting and awful story",
        "Worst movie, don't watch",
        "Terrible film I've seen this year",
        "Horrible and disappointing",
        "Absolutely hated it, bad cast",
        "Terrible cinematography and plot",
    ]
    
    labels = [1, 1, 1, 1, 1, 1, 1, 1,  # positive
              0, 0, 0, 0, 0, 0, 0, 0]  # negative
    
    print("="*60)
    print("SENTIMENT CLASSIFICATION")
    print("="*60)
    
    # Convert to BoW
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(reviews)
    
    print(f"\nVocabulary size: {len(vectorizer.get_feature_names_out())}")
    print(f"Number of reviews: {len(reviews)}")
    print(f"Feature matrix shape: {X.shape}")
    
    # Train classifier
    X_train, X_test, y_train, y_test = train_test_split(
        X, labels, test_size=0.3, random_state=42
    )
    
    classifier = LogisticRegression()
    classifier.fit(X_train, y_train)
    
    # Evaluate
    y_pred = classifier.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f"\nAccuracy: {accuracy:.2%}")
    
    # Show most important features
    feature_names = vectorizer.get_feature_names_out()
    coefficients = classifier.coef_[0]
    
    print("\n" + "="*60)
    print("MOST IMPORTANT WORDS")
    print("="*60)
    
    # Top positive words
    top_positive_idx = coefficients.argsort()[-5:][::-1]
    print("\nPositive indicators:")
    for idx in top_positive_idx:
        print(f"  {feature_names[idx]:15} ‚Üí {coefficients[idx]:+.3f}")
    
    # Top negative words
    top_negative_idx = coefficients.argsort()[:5]
    print("\nNegative indicators:")
    for idx in top_negative_idx:
        print(f"  {feature_names[idx]:15} ‚Üí {coefficients[idx]:+.3f}")
    
    # Test on new examples
    print("\n" + "="*60)
    print("TESTING ON NEW EXAMPLES")
    print("="*60)
    
    test_examples = [
        "This film was great",
        "Absolutely terrible",
        "I enjoyed it",
    ]
    
    X_new = vectorizer.transform(test_examples)
    predictions = classifier.predict(X_new)
    probabilities = classifier.predict_proba(X_new)
    
    for text, pred, prob in zip(test_examples, predictions, probabilities):
        sentiment = "POSITIVE" if pred == 1 else "NEGATIVE"
        confidence = prob[pred]
        print(f"\n'{text}'")
        print(f"  ‚Üí {sentiment} ({confidence:.1%} confident)")
    
    # NOW THE PROBLEM!
    print("\n" + "="*60)
    print("‚ö†Ô∏è  THE SYNONYM PROBLEM")
    print("="*60)
    
    problem_examples = [
        ("This movie was great", "seen 'great'"),
        ("This film was excellent", "never saw 'excellent'!"),
        ("I loved this movie", "seen 'loved'"),
        ("I adored this film", "never saw 'adored'!"),
    ]
    
    print("\nWatch what happens with synonyms:")
    X_problem = vectorizer.transform([ex[0] for ex in problem_examples])
    predictions = classifier.predict(X_problem)
    probabilities = classifier.predict_proba(X_problem)
    
    for (text, note), pred, prob in zip(problem_examples, predictions, probabilities):
        sentiment = "POSITIVE" if pred == 1 else "NEGATIVE"
        confidence = prob[pred]
        print(f"\n'{text}' ({note})")
        print(f"  ‚Üí {sentiment} ({confidence:.1%})")
    
    print("\n" + "="*60)
    print("KEY INSIGHT")
    print("="*60)
    print("""
    The model treats 'great', 'excellent', 'wonderful' as
    completely separate dimensions!
    """)

demo_classifier()

SENTIMENT CLASSIFICATION

Vocabulary size: 39
Number of reviews: 16
Feature matrix shape: (16, 39)

Accuracy: 20.00%

MOST IMPORTANT WORDS

Positive indicators:
  wonderful       ‚Üí +0.639
  great           ‚Üí +0.493
  best            ‚Üí +0.413
  recommend       ‚Üí +0.311
  fantastic       ‚Üí +0.311

Negative indicators:
  terrible        ‚Üí -0.771
  awful           ‚Üí -0.307
  bad             ‚Üí -0.307
  worst           ‚Üí -0.276
  watch           ‚Üí -0.276

TESTING ON NEW EXAMPLES

'This film was great'
  ‚Üí NEGATIVE (51.6% confident)

'Absolutely terrible'
  ‚Üí NEGATIVE (64.0% confident)

'I enjoyed it'
  ‚Üí POSITIVE (54.9% confident)

‚ö†Ô∏è  THE SYNONYM PROBLEM

Watch what happens with synonyms:

'This movie was great' (seen 'great')
  ‚Üí NEGATIVE (52.0%)

'This film was excellent' (never saw 'excellent'!)
  ‚Üí NEGATIVE (63.5%)

'I loved this movie' (seen 'loved')
  ‚Üí NEGATIVE (55.0%)

'I adored this film' (never saw 'adored'!)
  ‚Üí NEGATIVE (59.7%)

KEY INSIGHT


#### Observations

- Each document is now a vector of word counts
- Same length (vocabulary size)
- Can now use machine learning
    
- Lost word order: "dog bites man" = "man bites dog"
- Lost syntax: "not good" looks like "good"
- Every word is a separate dimension

**1. TEXT PREPROCESSING**

- Tokenization: Breaking text into units
- Normalization: Stemming/lemmatization
- These are fundamental to all NLP
    
**2. N-GRAMS**

- Model probability of word sequences
- Predict next word
- Can't generalize to unseen sequences
- Data sparsity problem
    
**3. BAG OF WORDS + CLASSIFICATION**
- Represent documents as vectors
- Train classifiers (logistic regression)
- Each word is separate dimension
- Synonyms are invisible
- Vocabulary explosion (50,000+ dimensions)



current representation does not consider car and automobile as same thing 

'car'        ‚Üí [0, 0, 1, 0, 0, ...]  (50,000 dimensions)
'automobile' ‚Üí [0, 1, 0, 0, 0, ...]  (completely different!)

----



#### Distributional Hypothesis

words that appear in similar contexts have similar meanings.

*You shall know a word by the company it keeps - Firth, 1957*

In [11]:
#¬†THE DISTRIBUTIONAL HYPOTHESIS
from collections import defaultdict, Counter
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt', quiet=True)

def demo_distributional_hypothesis():
    """Show how context reveals meaning"""
    
    # The famous example from your document
    corpus = """
    Ongchoi is delicious sauteed with garlic.
    Ongchoi is superb over rice.
    Ongchoi leaves with salty sauces are great.
    Spinach sauteed with garlic over rice is delicious.
    Chard stems and leaves are delicious.
    Collard greens and other salty leafy greens are healthy.
    """
    
    print("="*60)
    print("THE DISTRIBUTIONAL HYPOTHESIS")
    print("="*60)
    
    print("\nScenario: You've never heard of 'ongchoi'")
    print("But you see these sentences...\n")
    
    # Show ongchoi contexts
    for line in corpus.strip().split('\n')[:3]:
        if 'ongchoi' in line.lower():
            print(f"  ‚Ä¢ {line.strip()}")
    
    print("\nWhat words appear near 'ongchoi'?")
    
    # Count context words
    def get_context_words(corpus, target_word, window=3):
        """Get words that appear near target word"""
        context_words = Counter()
        
        for sentence in corpus.lower().split('.'):
            tokens = word_tokenize(sentence)
            
            for i, token in enumerate(tokens):
                if token == target_word:
                    # Get window around target
                    start = max(0, i - window)
                    end = min(len(tokens), i + window + 1)
                    
                    for j in range(start, end):
                        if j != i:  # Skip the target itself
                            context_words[tokens[j]] += 1
        
        return context_words
    
    ongchoi_context = get_context_words(corpus, 'ongchoi')
    
    print("\nWords near 'ongchoi':")
    for word, count in ongchoi_context.most_common(10):
        print(f"  {word:15} ‚Üí {count} times")
    
    # Compare to similar words
    print("\n" + "="*60)
    print("COMPARE TO KNOWN WORDS")
    print("="*60)
    
    spinach_context = get_context_words(corpus, 'spinach')
    chard_context = get_context_words(corpus, 'chard')
    
    print("\nWords near 'spinach':")
    for word, count in spinach_context.most_common(5):
        print(f"  {word:15} ‚Üí {count} times")
    
    print("\nWords near 'chard':")
    for word, count in chard_context.most_common(5):
        print(f"  {word:15} ‚Üí {count} times")
    
    print("\n" + "="*60)
    print("CONCLUSION")
    print("="*60)
    print("""
    ongchoi, spinach, and chard all appear with:
      ‚Ä¢ delicious, sauteed, garlic, rice, leaves
    
    Even without knowing what 'ongchoi' is, we can infer:
      ‚Üí It's probably a leafy green vegetable!
    
    This is the DISTRIBUTIONAL HYPOTHESIS:
      Words with similar contexts have similar meanings.
    
    (It's actually Ipomoea aquatica, also called water spinach!)
    """)

demo_distributional_hypothesis()

THE DISTRIBUTIONAL HYPOTHESIS

Scenario: You've never heard of 'ongchoi'
But you see these sentences...

  ‚Ä¢ Ongchoi is delicious sauteed with garlic.
  ‚Ä¢ Ongchoi is superb over rice.
  ‚Ä¢ Ongchoi leaves with salty sauces are great.

What words appear near 'ongchoi'?

Words near 'ongchoi':
  is              ‚Üí 2 times
  delicious       ‚Üí 1 times
  sauteed         ‚Üí 1 times
  superb          ‚Üí 1 times
  over            ‚Üí 1 times
  leaves          ‚Üí 1 times
  with            ‚Üí 1 times
  salty           ‚Üí 1 times

COMPARE TO KNOWN WORDS

Words near 'spinach':
  sauteed         ‚Üí 1 times
  with            ‚Üí 1 times
  garlic          ‚Üí 1 times

Words near 'chard':
  stems           ‚Üí 1 times
  and             ‚Üí 1 times
  leaves          ‚Üí 1 times

CONCLUSION

    ongchoi, spinach, and chard all appear with:
      ‚Ä¢ delicious, sauteed, garlic, rice, leaves

    Even without knowing what 'ongchoi' is, we can infer:
      ‚Üí It's probably a leafy green vegeta

#### Co-occurance vectors

A co-occurrence vector represents a word based on how frequently it appears together with other words in a fixed context window within a corpus. Each dimension corresponds to a word in the vocabulary, and the value in that dimension is the count (or weighted count) of how often the target word appears near that word.

In [None]:

# WORD VECTORS: Representing words by their context
# Building co-occurrence matrices


import numpy as np
from collections import defaultdict, Counter
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt', quiet=True)

def demo_cooccurrence_matrix():
    """Build a simple word-word co-occurrence matrix"""
    
    corpus = """
    I love this movie. This movie is great.
    I hate that film. That film is terrible.
    Great movie, I love it.
    Terrible film, I hate it.
    """ * 3  # Repeat for more data
    
    print("="*60)
    print("CO-OCCURRENCE MATRIX")
    print("="*60)
    
    # Tokenize
    all_tokens = word_tokenize(corpus.lower())
    
    # Build vocabulary (top words)
    word_counts = Counter(all_tokens)
    vocab = [word for word, count in word_counts.most_common(15) 
             if word.isalpha()]  # Skip punctuation
    
    print(f"\nVocabulary: {vocab}\n")
    
    # Build co-occurrence matrix
    window_size = 2
    cooccur = defaultdict(Counter)
    
    for i, word in enumerate(all_tokens):
        if word in vocab:
            # Look at window around word
            start = max(0, i - window_size)
            end = min(len(all_tokens), i + window_size + 1)
            
            for j in range(start, end):
                if i != j and all_tokens[j] in vocab:
                    cooccur[word][all_tokens[j]] += 1
    
    # Display as matrix
    import pandas as pd
    
    matrix = np.zeros((len(vocab), len(vocab)))
    for i, word1 in enumerate(vocab):
        for j, word2 in enumerate(vocab):
            matrix[i, j] = cooccur[word1][word2]
    
    df = pd.DataFrame(matrix, index=vocab, columns=vocab)
    print("Co-occurrence counts (window=2):")
    print(df)
    
    # Show vectors for specific words
    print("\n" + "="*60)
    print("WORD VECTORS")
    print("="*60)
    
    for word in ['love', 'hate', 'movie', 'film']:
        if word in vocab:
            vector = df.loc[word]
            print(f"\nVector for '{word}':")
            print(vector[vector > 0].to_dict())
    
    # Compute similarity
    print("\n" + "="*60)
    print("COSINE SIMILARITY")
    print("="*60)
    
    def cosine_similarity(vec1, vec2):
        """Compute cosine similarity between two vectors"""
        dot = np.dot(vec1, vec2)
        norm1 = np.linalg.norm(vec1)
        norm2 = np.linalg.norm(vec2)
        
        if norm1 == 0 or norm2 == 0:
            return 0
        return dot / (norm1 * norm2)
    
    pairs = [
        ('love', 'hate'),
        ('movie', 'film'),
        ('great', 'terrible'),
    ]
    
    print("\nSimilarity scores:")
    for word1, word2 in pairs:
        if word1 in vocab and word2 in vocab:
            vec1 = df.loc[word1].values
            vec2 = df.loc[word2].values
            sim = cosine_similarity(vec1, vec2)
            print(f"  {word1:10} ‚Üî {word2:10}: {sim:.3f}")
        

demo_cooccurrence_matrix()


### OBSERVATIONS

- 'movie' and 'film' have high similarity (synonyms!)
- Each word is now a vector, not just a one-hot encoding
- Still SPARSE (mostly zeros)
- Still high dimensional (vocabulary size)


# Dense static embeddings

Static embeddings assign exactly one fixed vector to each word, regardless of where or how it is used.


**Word2Vec**

- Predictive model
- Learns embeddings by predicting surrounding words
- Local context focused

**skip-gram**
predicting context from a word (Skip-gram)


**CBOW**
predictng a word from context (CBOG)


In [None]:
! pip3 install gensim

In [None]:
from gensim.models import Word2Vec

documents = [
    ["i", "love", "this", "movie"],
    ["this", "movie", "is", "great"],
    ["i", "hate", "that", "film"],
    ["that", "film", "is", "terrible"],
    ["great", "movie", "i", "love", "it"],
    ["terrible", "film", "i", "hate", "it"],

    # Diverse movie contexts
    ["movie", "director", "actor", "screenplay"],
    ["film", "cinematography", "editing", "soundtrack"],
    ["watch", "movie", "theater"],
    ["movie", "boring", "slow"],
    ["movie", "exciting", "thrilling"],
    ["film", "award", "festival"],
    ["movie", "story", "plot"],
    ["film", "critic", "review"],

    # Rare words
    ["cinematography"],
    ["blockbuster"],
    ["arthouse", "film"],
    ["independent", "movie"]
]

model = Word2Vec(
    sentences=documents, 
    vector_size=50, 
    window=2, 
    min_count=1, 
    sg=1,  # Skip-gram model
    epochs=100,
    seed=42
    )

vector = model.wv['movie']
print(vector.shape)
print("Vector for 'movie':", vector)
print("\nSimilar words to 'movie':", model.wv.most_similar('movie', topn=5))

cbow_model = Word2Vec(
    sentences=documents, 
    vector_size=50, 
    window=2, 
    min_count=1, 
    sg=0,   # CBOW model
    epochs=100,
    seed=99
)

print("\nCBOW vector shape:", cbow_model.wv['movie'].shape)
print("CBOW similar words to 'movie':")
print(cbow_model.wv.most_similar('movie', topn=5))


In [None]:
import numpy as np

sg_vec = model.wv['movie']
cbow_vec = cbow_model.wv['movie']

# Cosine similarity between Skip-gram and CBOW vectors
cos_sim = np.dot(sg_vec, cbow_vec) / (
    np.linalg.norm(sg_vec) * np.linalg.norm(cbow_vec)
)

print("\nCosine similarity between Skip-gram and CBOW vectors for 'movie':")
print(f"{cos_sim:.3f}")


**GloVe**

- Count-based + matrix factorization
- Uses global co-occurrence statistics

Unlike co-occurrence vectors, dense embeddings generalize semantic similarity even when exact context overlap is low

In [None]:
import gensim.downloader as api

embeddings = api.load("glove-twitter-25")

print(len(embeddings))            # vocabulary size
print(embeddings.vector_size)     # embedding dimension
print(embeddings["computer"][5]) # dense vector


In [None]:
print(embeddings.similarity("movie", "film"))
print(embeddings.similarity("movie", "pizza"))

In [None]:
embeddings.most_similar(
    positive=["king", "woman"],
    negative=["man"],
    topn=4
)


lets consider word bank in two sentences

1. I sat by the river bank
2. I deposited money by the bank

- static embeddings consider bank as same vector
- Polysemy leads to contect loss
- sentence meaning is not same as word meaning and static encodings operate at the word level

----

#### Contextual Embeddings

- Unlike static embeddings (Word2Vec, GloVe), contextual embeddings change depending on the sentence.

Example:

- ‚ÄúI went to the bank to deposit money‚Äù ‚Üí bank vector reflects financial meaning.
- ‚ÄúThe river overflowed the bank‚Äù ‚Üí bank vector reflects riverbank meaning.
- Captures polysemy (words with multiple meanings) naturally.


**Dimensionality:**

- Determined by the model architecture.
- Example: all-MiniLM-L6-v2 outputs 384-dimensional vectors for each sentence.
- It‚Äôs fixed by the final hidden layer size of the transformer.


**reference:**

- [Embedding Leaderboard](https://huggingface.co/spaces/mteb/leaderboard)
- [sentence transformers](https://sbert.net)



In [None]:
! pip3 install sentence-transformers tf-keras

In [None]:
from sentence_transformers import SentenceTransformer, util

# Load pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentences
sentences = [
    "I love playing football.",
    "Soccer is my favorite sport.",
    "I enjoy reading books."
]

# Generate embeddings
embeddings = model.encode(sentences)

# Show dimensionality
print("Embedding shape:", embeddings[0].shape)  # e.g., (384,)

# Compute cosine similarity between sentences
similarity_0_1 = util.cos_sim(embeddings[0], embeddings[1])
similarity_0_2 = util.cos_sim(embeddings[0], embeddings[2])

print(f"Similarity between sentence 0 and 1: {similarity_0_1.item():.3f}")
print(f"Similarity between sentence 0 and 2: {similarity_0_2.item():.3f}")
