In [2]:
import re
import math
import numpy as np

# TF-IDF (Term Frequency – Inverse Document Frequency)

Machine learning models cannot directly work with raw text. In many NLP and information
retrieval problems, we first need to convert documents into numerical vectors.

A simple idea is to count word occurrences (Bag of Words). The limitation is that very common
terms across the corpus (e.g. *is*, *the*, *and*) may dominate the representation, even though
they do not help distinguish documents.

TF-IDF is a classical weighting scheme that balances:

- **Term Frequency (TF):** how relevant a term is inside a document (local importance)
- **Inverse Document Frequency (IDF):** how informative a term is across the corpus (global rarity)

As a result, TF-IDF highlights words that characterize a document and down-weights words that
appear in many documents.

TF-IDF is widely used in:
- document search and ranking
- keyword extraction
- document clustering and similarity

## 1. Why do we need TF-IDF?

When working with text data, machine learning models require numerical representations.
A straightforward approach is to count how many times each word appears in a document
(Bag of Words representation).

However, this approach presents an important limitation:  
words that appear very frequently across the entire corpus tend to dominate the representation,
even if they do not provide useful information to distinguish documents.

For example, terms such as *is*, *and* or *the* may appear in almost every document.
Although they have high frequency, they carry little semantic value for identifying
the topic or meaning of a document.

TF-IDF (Term Frequency – Inverse Document Frequency) addresses this issue by balancing
two complementary ideas:

- **Local importance:** how relevant a word is within a specific document.
- **Global importance:** how rare or informative that word is across the whole corpus.

By combining these two aspects, TF-IDF assigns:
- high weights to words that are frequent in a document but rare in the corpus,
- low weights to words that appear in many documents.

As a result, TF-IDF produces representations that better capture what makes a document
distinct from others, which is essential for tasks such as:
- document search and ranking,
- similarity-based retrieval,
- clustering and topic analysis.

TF-IDF remains a fundamental technique in natural language processing because it is
simple, interpretable, and often very effective despite its simplicity.


## 2. Corpus and basic preprocessing concepts

A **corpus** is a collection of documents. TF-IDF is defined with respect to the corpus because
IDF depends on how many documents contain each term.

A **document** can be any piece of text: an email, a webpage, a paragraph, etc.
In practice, we often apply preprocessing to reduce noise and ensure consistent counting.



## 3. Text preprocessing

Preprocessing aims to represent similar textual forms in a consistent way.

Typical steps:
- **Lowercasing**: makes "Learning" and "learning" identical.
- **Removing punctuation / non-alphabetic characters**: reduces noise.
- **Tokenization**: splits text into individual terms (tokens).

The choices here depend on the task: aggressive cleaning may remove useful information.



## 4. Vocabulary and vector representation

To represent documents as vectors, we define a **vocabulary**: the set of all unique terms in the corpus.

Each term is assigned an index so that:
- every document can be represented as a fixed-length vector
- the same term always corresponds to the same vector position

`token2idx` maps each term to its vector index, and `idx2token` provides the inverse mapping.
This is the standard way to build a Bag-of-Words / TF-IDF representation.



## 5. Term Frequency (TF)

**TF** measures how much a term matters inside a document.

If a word appears many times in a document, it is likely related to that document’s topic.
However, longer documents naturally have more occurrences, so we commonly normalize:

\[
TF(t, d) = \frac{\text{count}(t \in d)}{|d|}
\]

where \(|d|\) is the number of tokens in the document.
This produces comparable values across documents of different lengths.



In [11]:
def tf_vector(tokens, token2idx):
    vec = np.zeros(len(token2idx), dtype=float)
    for t in tokens:
        if t in token2idx:
            vec[token2idx[t]] += 1.0
    denom = max(len(tokens), 1)
    return vec / denom

##Generative AI tools were used to help refine wording and structure.


## 6. Document Frequency (DF) and Inverse Document Frequency (IDF)

TF alone cannot distinguish between informative terms and terms that appear everywhere.

- **DF(t)** counts how many documents contain a term:
\[
DF(t) = |\{d \in D : t \in d\}|
\]

- **IDF(t)** down-weights terms that appear in many documents and up-weights rarer terms:

\[
IDF(t) = \log\left(\frac{1 + N}{1 + DF(t)}\right) + 1
\]

where \(N\) is the number of documents.

The `+1` smoothing avoids division by zero and keeps the scale stable for small corpora.



In [12]:
def document_frequency(tokenized_docs, token2idx):
    df = np.zeros(len(token2idx), dtype=float)
    for doc in tokenized_docs:
        for t in set(doc):
            if t in token2idx:
                df[token2idx[t]] += 1.0
    return df

##Generative AI tools were used to help refine wording and structure.


## 7. TF-IDF

TF-IDF combines **local importance** (TF) with **global importance** (IDF).

\[
tfidf(t, d) = tf(t, d) \cdot idf(t)
\]

- A word gets a **high TF-IDF score** if:
  - it appears frequently in a document
  - and appears in few documents overall

- A word gets a **low TF-IDF score** if:
  - it appears in almost every document (low discriminative power)

TF-IDF is one of the most widely used techniques for classical text representation.


## 8. TF-IDF vectors and document similarity

Once each document is represented as a TF-IDF vector, we can compare documents mathematically.

A common choice is **cosine similarity**, which measures the angle between two vectors:

\[
\cos(\theta) = \frac{x \cdot y}{\|x\|\|y\|}
\]

Cosine similarity is preferred in text because it focuses on direction (term importance pattern)
rather than raw magnitude (document length). This is useful for:
- ranking documents for a search query
- finding similar documents
- clustering documents by topic



### Contribution
Nombre: Álvaro Aguilar Dávila  
Asignatura: Apendizaje Automáico  
Grado: Ingeniería Informática + ADE  
Universidad: CUNEF  
Año: 2026
