# Introduction to Natural Language Processing (NLP)

Natural Language Processing, or NLP, is a field at the intersection of computer science, artificial intelligence, and linguistics. It involves the development of algorithms and systems that enable computers to understand, interpret, and generate human language. NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way.

NLP helps resolve ambiguity in language and adds useful numeric structure to the data for many downstream applications, such as speech recognition or text analytics.

![nlp](../assets/nlp.png)

## Challenges in NLP

### Part-of-Speech Tagging

![pos](../assets/pos.jpeg)

Part-of-speech (POS) tagging is the process of assigning a part of speech to each word in a text, such as noun, verb, adjective, etc. This is challenging because:

- **Ambiguity**: A word can have multiple parts of speech based on the context. For example, "book" can be a noun ("I read a book") or a verb ("Book a table").
- **Contextual Use**: Words may be used in a figurative sense, which can confuse POS taggers.
- **New Words**: New words, slang, and jargon keep emerging, and POS taggers need regular updates to handle them.

### Text Segmentation

Text segmentation involves dividing text into meaningful units, such as sentences or topics. Challenges include:

![senseg](../assets/sentence_segmentation.jpeg)

- **Sentence Boundary Detection**: Punctuation marks like periods can be used for abbreviations, decimals, etc., and not always to end sentences.
- **Tokenization**: Different languages and scripts have different tokenization rules, and some don't use whitespace.
- **Topic Segmentation**: Identifying topic shifts in a text requires understanding of the content, which is a non-trivial task.

### Word Sense Disambiguation

Word sense disambiguation is the task of determining which sense of a word is active in a given context. Challenges include:

- **Polysemy**: Many words have multiple meanings, and identifying the correct one is difficult without deep understanding.
- **Limited Context**: Sometimes the surrounding text is not enough to determine the word sense.
- **Lack of Resources**: For less-resourced languages, there might not be enough data to train disambiguation systems.

> - Many plants and animals live in the rainforest.
> - The manufacturing plant produced widgets.

### Syntax Disambiguation

Syntax disambiguation deals with the different ways in which words can be combined to form sentences. Challenges here include:

- **Structural Ambiguity**: Sentences can often be parsed in multiple ways ("I saw the man with the telescope").
- **Complex Constructions**: Some languages have free word order or allow for nested clauses, making parsing difficult.
- **Idiomatic Expressions**: Phrases that don't follow standard syntax rules can confuse parsers.

> - Annie hit a man with an umbrella.
> - I shot an elephant in my pyjamas.
> - The tourist saw the woman with a telescope.

### Imperfect or Irregular Input

Language is often messy and unpredictable, leading to challenges such as:

- **Typos and Spelling Errors**: Mistakes in writing can lead to misinterpretation by NLP systems.
- **Non-standard Language**: Use of slang, abbreviations, and non-standard grammar can be problematic.
- **Multilingual Text**: Text containing multiple languages can complicate processing.

## Applications of NLP

- **Text Classification**: Assigning categories or labels to text, such as spam detection in email services.
- **Machine Translation**: Translating text from one language to another, like Google Translate.
- **Sentiment Analysis**: Identifying the sentiment of text, used in social media monitoring and market research.
- **Chatbots and Virtual Assistants**: Powering conversational agents like Siri, Alexa, and customer service bots.
- **Information Extraction**: Extracting structured information from unstructured text, such as named entity recognition.
- **Summarization**: Generating a shortened version of a text, retaining its most important information.
- **Speech Recognition**: Translating spoken language into text, used in voice user interfaces.
- **Question Answering**: Building systems that automatically answer questions posed by humans in a natural language (ChatGPT).

## Brief History of NLP

The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, a fundamental goal of natural language processing.

### Milestones in the History of NLP

- **1950s**: The era of symbolic NLP, rule-based systems that tried to encode human knowledge and grammar rules into computers.
- **1960s**: Development of the first chatbot, ELIZA, and further work on machine translation.
- **1970s-1980s**: The rise of computational linguistics and the development of more sophisticated models for handling syntax and semantics.
- **1990s**: Introduction of statistical NLP, leveraging large amounts of data and statistical methods to process language.
- **2000s**: The emergence of machine learning in NLP, with systems beginning to learn from data rather than relying on hand-coded rules.
- **2010s-Present**: The rise of deep learning has revolutionized NLP, leading to the development of models like BERT and GPT that can handle complex language tasks with unprecedented accuracy.

# LLM on Why Need to Learn Old NLP Techniques

<details>

<summary>

**Yes, exactly!** You've nailed the fundamental trade-off.

Transformers (BERT, GPT, etc.) can theoretically perform **all** the tasks that older NLP techniques do, and generally with **much better accuracy**, but at the cost of:

- **Computing resources** (GPU vs CPU)
- **Speed** (milliseconds vs seconds)
- **Memory** (MBs vs GBs)
- **Cost** (pennies vs dollars)
- **Energy** (watts vs kilowatts)

</summary>

## The Trade-off Table

| Task | Old Method | Transformer | Accuracy Gain | Speed Loss |
|------|------------|-------------|---------------|------------|
| Language detection | FastText: 1ms | BERT: 50ms | Minimal (both ~99%) | 50x slower |
| Sentiment analysis | Word2Vec+Classifier: 5ms | BERT: 100ms | +15% accuracy | 20x slower |
| Text classification | FastText: 2ms | BERT: 80ms | +10-15% | 40x slower |
| Named Entity Recognition | CRF: 10ms | BERT: 150ms | +20% accuracy | 15x slower |
| Search (1M docs) | TF-IDF: 20ms | BERT: 30 minutes | Much better relevance | 90,000x slower |

## Why the Big Speed Difference?

**FastText:**
```
Input: "I love dogs"
→ Lookup 3 embeddings in table: 0.1ms
→ Average them: 0.01ms
→ Classifier: 0.1ms
Total: ~0.2ms (on CPU)
```

**BERT:**
```
Input: "I love dogs"
→ Tokenization: 1ms
→ Embedding lookup: 1ms
→ 12 transformer layers with attention: 95ms (on GPU!)
→ Classifier: 3ms
Total: ~100ms (needs GPU)
```

## The Precision Difference

**Example: Sentiment Analysis**

**FastText approach:**
```
"This movie is not bad"
→ Average: vec("this") + vec("movie") + vec("is") + vec("not") + vec("bad")
→ Heavy influence from vec("bad") (negative)
→ Prediction: Negative ❌ (WRONG!)
Confidence: 65%
```

**BERT approach:**
```
"This movie is not bad"
→ Attention mechanism understands "not" modifies "bad"
→ Contextualized representation captures negation
→ Prediction: Positive ✓ (CORRECT!)
Confidence: 92%
```

**Results on benchmark:**
- FastText: 82% accuracy
- BERT: 94% accuracy
- **+12% improvement** (but 50x slower)

## Resource Requirements

**Running 1 Million Classifications:**

**FastText:**
- Hardware: Basic CPU server ($50/month)
- Time: ~2 minutes
- Energy: ~0.1 kWh
- Cost: ~$0.01

**BERT:**
- Hardware: GPU server ($500/month) or cloud GPU
- Time: ~28 hours (or parallelize on many GPUs)
- Energy: ~50 kWh
- Cost: ~$20-100

**That's a 2,000-10,000x cost difference!**

## When the Precision Gain Matters

### **Critical Applications (Use Transformer)**

**Medical diagnosis from text:**
- "Patient does not show signs of infection" vs "Patient shows signs of infection"
- FastText might miss "not" → dangerous!
- BERT's precision worth the cost

**Legal document analysis:**
- "The contract is not binding" vs "The contract is binding"
- 12% error rate vs 6% error rate = significant legal risk

**Financial fraud detection:**
- Missing fraud costs millions
- Extra precision worth extra compute

### **Non-Critical Applications (Use FastText)**

**Spam filtering:**
- 82% accuracy catches most spam
- Users can report misses
- Speed matters more (process millions instantly)

**Content categorization:**
- "Is this Sports or Politics?"
- FastText good enough for broad categories
- Small errors acceptable

**Language detection:**
- FastText already 99%+ accurate
- BERT barely improves but 50x slower

## Real Production Decision

**Company scenario:** Process 10M customer emails/day

**Option 1: All BERT**
```
Cost: $500/day in GPU compute
Accuracy: 94%
Latency: 100ms per email
Infrastructure: Complex GPU cluster
```

**Option 2: All FastText**
```
Cost: $10/day in CPU compute
Accuracy: 82%
Latency: 2ms per email
Infrastructure: Simple CPU servers
```

**Option 3: Hybrid (Smart!)**
```
Stage 1 - FastText (filter obvious cases): 80% of emails
→ Clear spam: instant classification
→ Cost: $8/day

Stage 2 - BERT (complex cases): 20% of emails
→ Ambiguous emails get deep analysis
→ Cost: $100/day

Total: $108/day, 91% accuracy, 20ms average latency
```

**Best of both worlds!**

## The "Overkill" Principle

Using BERT for everything is like:

- **Taking an airplane to go 2 blocks** (language detection)
- **Using a supercomputer to add 2+2** (simple keyword matching)
- **MRI scan for a paper cut** (obvious spam detection)

The tool is more powerful, but **wasteful** for simple tasks.

## Technical Capabilities Summary

**Can transformers do everything older NLP does?**

- ✅ Text classification - YES (better)
- ✅ Sentiment analysis - YES (much better)
- ✅ NER - YES (much better)
- ✅ Language detection - YES (marginally better)
- ✅ Search - YES (better relevance, but impractically slow at scale)
- ✅ Clustering - YES (better semantic clusters)
- ✅ Keyword extraction - YES (more context-aware)
- ✅ Similarity - YES (better semantic understanding)

**But should you use transformers for everything?**

❌ Usually NO - because of cost, speed, complexity

## The Future Trend

**What's happening now:**

1. **Model optimization:** Distilled BERT, quantization, pruning
   - Make transformers smaller and faster
   - "Can we get BERT quality at FastText speed?"

2. **Specialized hardware:** TPUs, custom AI chips
   - Make transformers cheaper to run

3. **Hybrid architectures:** Best of both worlds
   - Fast filtering + deep understanding

4. **Edge deployment:** TinyBERT, MobileBERT
   - Run simplified transformers on phones

**Eventually:** As hardware improves and models get optimized, transformers might become fast/cheap enough to replace older techniques entirely. But we're not there yet!

</details>

## Your Understanding is Correct

**Summary:**
- ✅ Transformers CAN do everything with higher precision
- ✅ BUT cost significantly more compute/speed/money
- ✅ Production systems balance precision vs resources
- ✅ Older techniques still valuable in the right context

You've understood the fundamental trade-off perfectly! It's engineering pragmatism: use the right tool for the job, not always the most powerful tool.


- https://claude.ai/share/36ceef3e-030b-40f6-8abc-8eac0ff1e64b

# Text Preprocessing

Text preprocessing is a critical step in NLP. It involves preparing and cleaning text data for further analysis and modeling. The goal is to simplify the text and remove any noise that might distract the machine learning algorithms from understanding the core content.

Raw text data is often messy and unstructured, with various issues:

- Irrelevant characters and symbols
- Inconsistent formatting
- Typos and spelling errors
- Diverse languages and slang
- Stopwords (commonly used words that may not be useful in analysis)

## Tokenization

Tokenization is the process of breaking down text into smaller units, called *tokens*. Tokens can be words, numbers, or punctuation marks. It's the first step in turning unstructured text into a form that can be analyzed.

### White-space Tokenization

This is the simplest form of tokenization. It splits the text by white spaces, including spaces, tabs, and new line characters.

In [1]:
def whitespace_tokenizer(text):
    return text.split()

# Example usage:
text = "Natural language processing is fun."
tokens = whitespace_tokenizer(text)
print(tokens)

['Natural', 'language', 'processing', 'is', 'fun.']


### Punctuation-based Tokenization

This method not only splits by white spaces but also considers punctuation marks as separate tokens.

In [2]:
import re

def punctuation_tokenizer(text):
    return re.findall(r'\b\w+\b', text)

# Example usage:
text = "Natural language processing is fun!"
tokens = punctuation_tokenizer(text)
print(tokens)

['Natural', 'language', 'processing', 'is', 'fun']


### Using NLP Libraries for Tokenization

Libraries like `NLTK` and `spaCy` provide robust tokenization functions that handle edge cases and are more sophisticated than the simple white-space or punctuation-based methods.

In [3]:
import nltk
#nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt_tab to /Users/ds0/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


- https://www.nltk.org/api/nltk.tokenize.punkt.html

In [4]:
text = "Natural language processing is fun!"
tokens = word_tokenize(text)
print(tokens)

['Natural', 'language', 'processing', 'is', 'fun', '!']


In [5]:
import spacy

# Download the spaCy model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


- https://spacy.io/models/en

In [6]:
# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

In [7]:
text = "Natural language processing is fun!"
doc = nlp(text)

tokens = [token.text for token in doc]
print(tokens)

['Natural', 'language', 'processing', 'is', 'fun', '!']


- https://chatgpt.com/share/69774296-028c-8000-b5b4-8c6117c1ee17
- https://gemini.google.com/share/322d8b970a8e

# Text Normalization

Text normalization involves transforming text into a more uniform format to improve the performance of text analysis algorithms. Two common text normalization techniques are *stemming* and *lemmatization*.

## Stemming

Stemming is a process of reducing words to their word stem, base, or root form—generally a written word form. The idea is to remove affixes (prefixes and suffixes) from words to get to the core meaning of the word.

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This process is quite crude and a stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

### Porter Stemmer

The Porter Stemming Algorithm is one of the oldest and most commonly used algorithms. It's designed for the English language and has a series of rules to determine the stripping of suffixes.

### Snowball Stemmer

The Snowball Stemmer, also known as the English Stemmer or Porter2 Stemmer, is a slightly improved version of the Porter stemmer and is part of a larger framework called Snowball. It offers stemmers for several languages besides English.

### Advantages and Disadvantages of Stemming

**Advantages:**
- Simple to implement and fast to run.
- Reduces the corpus of words the model is exposed to.
- Often improves the performance of text classification models.

**Disadvantages:**
- Can produce stems that are not actual words.
- Sometimes too aggressive, cutting off too much of the word and changing the meaning.
- Does not consider the context of the word, which can lead to inaccuracies.

In [8]:
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

In [9]:
# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer(language='english')

- https://www.nltk.org/api/nltk.stem.porter.html

- https://www.nltk.org/api/nltk.stem.SnowballStemmer.html

In [10]:
words = ['run', 'runner', 'running', 'ran', 'runs', 'easily', 'fairly', 'better', 'mice', 'feet']

porter_stems = [porter.stem(word) for word in words]
print(f"Porter Stemmer: {porter_stems}")

snowball_stems = [snowball.stem(word) for word in words]
print(f"Snowball Stemmer: {snowball_stems}")

Porter Stemmer: ['run', 'runner', 'run', 'ran', 'run', 'easili', 'fairli', 'better', 'mice', 'feet']
Snowball Stemmer: ['run', 'runner', 'run', 'ran', 'run', 'easili', 'fair', 'better', 'mice', 'feet']


In [11]:
# from chatgpt
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runs", "studies", "studying"]

[stemmer.stem(w) for w in words]


['run', 'run', 'studi', 'studi']

In [12]:
# my test
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runs", "ran", "studies", "studying"]

[stemmer.stem(w) for w in words]

['run', 'run', 'ran', 'studi', 'studi']

## Lemmatization

Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by the word's lemma, or dictionary form.

While stemming often involves rule-based chopping of ends of words, lemmatization involves a linguistic approach to reduce a word to its base or root form. Lemmatization uses vocabulary and morphological analysis, often with the aid of part-of-speech tagging, to return the base or dictionary form of a word, known as the lemma.

### The Role of Part-of-Speech Tagging in Lemmatization

Part-of-speech (POS) tagging is crucial in lemmatization because many words have different lemmas based on their part of speech in a sentence. For example, the word "saw" can be a verb or a noun, and the lemma would differ accordingly ("see" for the verb, "saw" for the noun).

`nltk` or `spacy` contain pre-trained models for POS tagging and lemmatization.

### Advantages and Disadvantages of Lemmatization

**Advantages:**
- Produces lemmas, which are actual words, improving interpretability.
- More accurate than stemming as it considers the context.

**Disadvantages:**
- More computationally expensive than stemming.
- Requires additional information (POS tags).
- May not improve performance significantly more than stemming for some applications.

In [13]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
# nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/ds0/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/ds0/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

- https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html

In [14]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

In [15]:
# Define a function to convert POS tag to a format recognized by the lemmatizer
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [16]:
# Lemmatize words with POS tags
lemmatized_words = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in words]
print("Lemmatized Words:", lemmatized_words)

Lemmatized Words: ['run', 'run', 'ran', 'study', 'study']


In [17]:
# Let's use spacy on another example to demonstrate pos and lemmatization
sentence = "The striped bats are hanging on their feet for better sleep."

In [18]:
doc = nlp(sentence)

In [19]:
# POS tagging and lemmatization
print(f"{'Text':{8}} {'Lemma':{8}} {'POS':{6}} {'Tag':{6}} {'Explanation'}")
print()
for token in doc:
    print(f"{token.text:{8}} {token.lemma_:{8}} {token.pos_:{6}} {token.tag_:{6}} {spacy.explain(token.tag_)}")

Text     Lemma    POS    Tag    Explanation

The      the      DET    DT     determiner
striped  striped  ADJ    JJ     adjective (English), other noun-modifier (Chinese)
bats     bat      NOUN   NNS    noun, plural
are      be       AUX    VBP    verb, non-3rd person singular present
hanging  hang     VERB   VBG    verb, gerund or present participle
on       on       ADP    IN     conjunction, subordinating or preposition
their    their    PRON   PRP$   pronoun, possessive
feet     foot     NOUN   NNS    noun, plural
for      for      ADP    IN     conjunction, subordinating or preposition
better   well     ADJ    JJR    adjective, comparative
sleep    sleep    NOUN   NN     noun, singular or mass
.        .        PUNCT  .      punctuation mark, sentence closer


In [20]:
# Output the lemmatized form of each word
lemmatized_sentence = " ".join([token.lemma_ for token in doc])
print(lemmatized_sentence)

the striped bat be hang on their foot for well sleep .


In [21]:
# from chatgpt
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("The children were running faster")
[(token.text, token.lemma_) for token in doc]


[('The', 'the'),
 ('children', 'child'),
 ('were', 'be'),
 ('running', 'run'),
 ('faster', 'fast')]

https://chatgpt.com/share/69774296-028c-8000-b5b4-8c6117c1ee17

## End of Part 1

# Language Modeling

Language modeling is a critical task that deals with predicting the probability of a sequence of words. It is used in various applications such as speech recognition, machine translation, and text generation.

## Formula

A language model is a probabilistic model that assigns a probability to a sequence of words, effectively capturing the likelihood that the sequence will occur in a language. In mathematical terms, given a sequence of words $ w_1, w_2, \ldots, w_n $, the language model estimates the probability:

$$ P(w_1, w_2, \ldots, w_n) $$

This probability can be decomposed using the chain rule of probability as:

$$ P(w_1, w_2, \ldots, w_n) = P(w_1) \cdot P(w_2 | w_1) \cdot \ldots \cdot P(w_n | w_1, w_2, \ldots, w_{n-1}) $$

## N-gram Models

An n-gram is a contiguous sequence of $ n $ items from a given sample of text or speech. The items can be phonemes, syllables, letters, words, or base pairs according to the application. In the context of language modeling, we are typically talking about words. It approximates the probability of a word sequence by only considering the $ n-1 $ previous words. This is known as the Markov assumption.

### Unigram Models

A unigram model is the simplest form of a statistical language model. It assumes that the probability of a word is independent of the words before it.

$$ P(w_1, w_2, \ldots, w_n) = P(w_1) \cdot P(w_2) \cdot \ldots \cdot P(w_n) $$

### Bigram Models

A bigram model, also known as a 2-gram model, assumes that the probability of a word depends only on the immediately preceding word.

$$ P(w_n | w_1, w_2, \ldots, w_{n-1}) \approx P(w_n | w_{n-1}) = \frac{Count(w_{n-1}, w_n)}{Count(w_{n-1})} $$

### Trigram Models and Higher-Order Models

Trigram models extend this to consider the two preceding words, and higher-order models consider more history. However, as the history increases, these models become more complex and require more data to estimate the probabilities accurately.

### Challenges

**Sparsity**: As $ n $ increases, the likelihood of encountering unseen n-grams (those not present in the training corpus) increases, leading to sparsity.

**Curse of Dimensionality**: The number of possible n-grams increases exponentially with $ n $, which leads to a combinatorial explosion in the number of parameters to be estimated.

### Smoothing Techniques

Smoothing techniques are used to handle the issue of zero probabilities for unseen n-grams. Common techniques include:

- **Add-One (Laplace) Smoothing**: Adding one to all the n-gram counts.
- **Add-k Smoothing**: Adding a small constant $ k $ to the counts.
- **Backoff and Interpolation**: Using lower-order n-gram probabilities when higher-order n-grams have zero counts.

In [22]:
from nltk import bigrams
from collections import Counter, defaultdict

https://www.geeksforgeeks.org/nlp/generate-bigrams-with-nltk/

In [23]:
# Sample corpus
corpus = "I am Sam. Sam I am. I do not like green eggs and ham."

In [24]:
# Tokenize the corpus
tokens = nltk.word_tokenize(corpus)

In [25]:
# Calculate bigram frequencies
bigram_freqs = Counter(bigrams(tokens))

In [26]:
# Calculate total number of bigrams
total_bigrams = sum(bigram_freqs.values())

In [27]:
# Calculate bigram probabilities
bigram_probs = {bigram: freq / total_bigrams for bigram, freq in bigram_freqs.items()}

In [28]:
# Display bigram probabilities
for bigram, prob in bigram_probs.items():
    print(f"Probability of {bigram}: {prob}")

Probability of ('I', 'am'): 0.125
Probability of ('am', 'Sam'): 0.0625
Probability of ('Sam', '.'): 0.0625
Probability of ('.', 'Sam'): 0.0625
Probability of ('Sam', 'I'): 0.0625
Probability of ('am', '.'): 0.0625
Probability of ('.', 'I'): 0.0625
Probability of ('I', 'do'): 0.0625
Probability of ('do', 'not'): 0.0625
Probability of ('not', 'like'): 0.0625
Probability of ('like', 'green'): 0.0625
Probability of ('green', 'eggs'): 0.0625
Probability of ('eggs', 'and'): 0.0625
Probability of ('and', 'ham'): 0.0625
Probability of ('ham', '.'): 0.0625


# Vector Space Model

The Vector Space Model (VSM) is a mathematical model used to represent text documents as vectors of identifiers, such as index terms. It is used in information retrieval and text mining to measure the similarity between documents. In VSM, each dimension corresponds to a separate term, and the value in each dimension represents the significance of the term in the document.

## Term-Document Matrix

In VSM, a Term-Document Matrix is a mathematical representation of a text corpus. It describes the frequency of terms that occur in the collection of documents. In a Term-Document Matrix, rows correspond to terms in the corpus while columns correspond to documents. Each entry in this matrix denotes the frequency or the weight of a term in a document.

Here's a simple example of a Term-Document Matrix for five documents:

![tf](../assets/tf.png)

Such matrices are often sparse since not all words appear in all documents.

## TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate the importance of a word to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

The TF-IDF value is calculated as follows:

- **Term Frequency (TF)**, which measures how frequently a term occurs in a document. It is calculated as the number of times a term `t` appears in a document `d`, divided by the total number of terms in the document.

$$ TF(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$

- **Inverse Document Frequency (IDF)**, which measures how important a term is within the entire corpus. It is calculated as the logarithm of the number of documents in the corpus divided by the number of documents where the term `t` appears.

$$ IDF(t, D) = \log \left( \frac{\text{Total number of documents in corpus } D}{\text{Number of documents with term } t} \right) $$

- **TF-IDF**, the product of TF and IDF:

$$ TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D) $$

## Cosine Similarity

Cosine similarity is a measure used to determine how similar two documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. This metric is, therefore, a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. The cosine similarity of two documents will range from 0 to 1, where 0 means no similarity and 1 means the same content.

The formula for calculating the cosine similarity between two vectors $ A $ and $ B $ is:

$$ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_{i=1}^{n} A_i \times B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}} $$

## Limitations

- Assumes term independence, which is not always the case.
- Does not capture the semantic relationship between words.
- High-dimensional and sparse vectors due to the size of the vocabulary.

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [30]:
# Sample documents
documents = [
    'The sky is blue',
    'The sun is bright',
    'The sun in the sky is bright',
    'We can see the shining sun, the bright sun'
]

In [31]:
# Initialize a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

In [32]:
# Vectorize the documents
tfidf_matrix = vectorizer.fit_transform(documents)

In [33]:
# Display the TF-IDF matrix
print(tfidf_matrix.toarray())

[[0.65919112 0.         0.         0.         0.42075315 0.
  0.         0.51971385 0.         0.34399327 0.        ]
 [0.         0.52210862 0.         0.         0.52210862 0.
  0.         0.         0.52210862 0.42685801 0.        ]
 [0.         0.3218464  0.         0.50423458 0.3218464  0.
  0.         0.39754433 0.3218464  0.52626104 0.        ]
 [0.         0.23910199 0.37459947 0.         0.         0.37459947
  0.37459947 0.         0.47820398 0.39096309 0.37459947]]


In [None]:
## Display TF-IDF matrix with feature names in pandas format
feature_names = vectorizer.get_feature_names_out()
print("Feature Names:", feature_names)
import pandas as pd
df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print(df)

Feature Names: ['blue' 'bright' 'can' 'in' 'is' 'see' 'shining' 'sky' 'sun' 'the' 'we']
       blue    bright       can        in        is       see   shining  \
0  0.659191  0.000000  0.000000  0.000000  0.420753  0.000000  0.000000   
1  0.000000  0.522109  0.000000  0.000000  0.522109  0.000000  0.000000   
2  0.000000  0.321846  0.000000  0.504235  0.321846  0.000000  0.000000   
3  0.000000  0.239102  0.374599  0.000000  0.000000  0.374599  0.374599   

        sky       sun       the        we  
0  0.519714  0.000000  0.343993  0.000000  
1  0.000000  0.522109  0.426858  0.000000  
2  0.397544  0.321846  0.526261  0.000000  
3  0.000000  0.478204  0.390963  0.374599  


In [36]:
documents

['The sky is blue',
 'The sun is bright',
 'The sun in the sky is bright',
 'We can see the shining sun, the bright sun']

In [37]:
# Calculate Cosine Similarity between the 2nd document with all others
cosine_similarities = cosine_similarity(tfidf_matrix[1], tfidf_matrix)
print(cosine_similarities)

[[0.36651513 1.         0.72875508 0.54139736]]


In [38]:
# Let's initialize another TF-IDF Vectorizer, with stop words removal
vectorizer = TfidfVectorizer(stop_words='english')

In [39]:
# Vectorize the documents
tfidf_matrix = vectorizer.fit_transform(documents)

In [40]:
# Display the TF-IDF matrix
print(tfidf_matrix.toarray())

[[0.78528828 0.         0.         0.6191303  0.        ]
 [0.         0.70710678 0.         0.         0.70710678]
 [0.         0.53256952 0.         0.65782931 0.53256952]
 [0.         0.36626037 0.57381765 0.         0.73252075]]


In [41]:
# Calculate Cosine Similarity between the 2nd document with all others
cosine_similarities = cosine_similarity(tfidf_matrix[1], tfidf_matrix)
print(cosine_similarities)

[[0.         1.         0.75316704 0.77695558]]


- https://chatgpt.com/share/69774296-028c-8000-b5b4-8c6117c1ee17
- https://claude.ai/share/65fc32ae-314b-4e60-80bc-5f3426b8c7f4

Reference:

https://nlp.stanford.edu/IR-book/

https://web.stanford.edu/~jurafsky/slp3/