
#### 🔡 Text Encoding - The First Language AI Learns
##### Before a machine can think, answer, or create it must first understand.
##### And to understand text, we must speak the language computers(numbers).
##### That’s what Text Encoding is all about transforming words, phrases, or entire documents into numerical representations that machine learning models can process.

Before any model can learn from text, we need to convert words into numbers. That’s where text encoding begins but not from a single sentence.

It starts from the corpusa large collection of text, such as:

Product reviews | Chat logs | Support tickets | Articles, etc.

Let’s break it down: Corpus ➝ Documents ➝ Sentences ➝ Words ➝ Tokens ➝ Vocabulary

- Corpus: Entire dataset (e.g., thousands of reviews)
- Document: Each review (collection of sentences)
- Sentence: Basic unit of text
- Words: Words in each sentence
- Tokens: Standardized words (after lowercasing, removing punctuation, etc.)
- Vocabulary: Set of unique tokens across the entire corpus



### 1. One-Hot Encoding (OHE) Foundation of Text Vectorization
One-Hot Encoding creates a binary vector for each word in the vocabulary. Each position in the vector represents one unique word in the vocabulary.
If a word exists in the sentence, the corresponding position is marked 1 and all others are 0. Think of it as a dictionary index where each word is assigned a position, and only the active word gets “highlighted.”

In [6]:
## Import the libraries
import pandas as pd

In [1]:
from sklearn.preprocessing import OneHotEncoder

In [14]:
data = {'Employee id': [10, 20, 15, 25, 30],
        'Gender': ['M', 'F', 'F', 'M', 'F'],
        'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice'],
        }
df = pd.DataFrame(data)
print(f"Employee data : \n{df}")

categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
encoder = OneHotEncoder(sparse_output=False)

one_hot_encoded = encoder.fit_transform(df[categorical_columns])

one_hot_df = pd.DataFrame(one_hot_encoded, columns=encoder.get_feature_names_out(categorical_columns))

df_encoded = pd.concat([df, one_hot_df], axis=1)

df_encoded = df_encoded.drop(categorical_columns, axis=1)
print(f"Encoded Employee data : \n{df_encoded}")

Employee data : 
   Employee id Gender Remarks
0           10      M    Good
1           20      F    Nice
2           15      F    Good
3           25      M   Great
4           30      F    Nice
Encoded Employee data : 
   Employee id  Gender_F  Gender_M  Remarks_Good  Remarks_Great  Remarks_Nice
0           10       0.0       1.0           1.0            0.0           0.0
1           20       1.0       0.0           0.0            0.0           1.0
2           15       1.0       0.0           1.0            0.0           0.0
3           25       0.0       1.0           0.0            1.0           0.0
4           30       1.0       0.0           0.0            0.0           1.0


### 2. What Is BoW (Bag of Words)?
BoW is a text encoding technique that.....Converts sentences/documents into numeric form.
Ignores word order, but counts how many times each word occurs.
Builds a vocabulary (set of unique words) from all documents.

Creates a document-term matrix where:
- Rows = documents (sentences).
- Columns = words in vocabulary.
- Values = word frequency in that document.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

In [22]:
## Lets Create Dataframe to understand the workflow
data = pd.DataFrame({"text":["people watch youtube","youtube watch youtube","people write comments","youtube write comments"], "output":[1,1,0,0]})
data.head()

Unnamed: 0,text,output
0,people watch youtube,1
1,youtube watch youtube,1
2,people write comments,0
3,youtube write comments,0


In [18]:
## Document matrix(Bag of Word)

In [38]:
## Initialize BoW
Bow = CountVectorizer()

## Fit and transform the data
documnet_matrix = Bow.fit_transform(data["text"])

## View Vocabulary
print("Vocabulary:", Bow.vocabulary_)

## As here we see that our CountVectorizer written code aplhabetically
print("Sentence 0:", documnet_matrix[0].toarray())
print("Sentence 1:", documnet_matrix[1].toarray())
print("Sentence 2:", documnet_matrix[2].toarray())
print("Sentence 3:", documnet_matrix[3].toarray())


Vocabulary: {'people': 1, 'watch': 2, 'youtube': 4, 'write': 3, 'comments': 0}
Sentence 0: [[0 1 1 0 1]]
Sentence 1: [[0 0 1 0 2]]
Sentence 2: [[1 1 0 1 0]]
Sentence 3: [[1 0 0 1 1]]


📘 What Are Bigrams?

A bigram is a sequence of 2 consecutive words in a sentence.
Instead of treating single words (like "youtube" or "watch") as features (like in BoW), bigrams treat pairs of words (like "youtube watch") as features.

Bigrams help capture word relationships and meaning that single-word features (unigrams) often miss.

In [41]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the CountVectorizer with ngram_range=(2,2) to capture ONLY bigrams
bigram = CountVectorizer(ngram_range=(2, 2))

# Fit and transform the text data
bigramvoc = bigram.fit_transform(data["text"])

# View the vocabulary
bigram.vocabulary_


{'people watch': 0,
 'watch youtube': 2,
 'youtube watch': 4,
 'people write': 1,
 'write comments': 3,
 'youtube write': 5}

In Row 0 ("people watch youtube"):

- "people watch" and "watch youtube" appear once.

- Others = 0.

In Row 1 ("youtube watch youtube"):

- "youtube watch" and "watch youtube" appear once.

In Row 2 ("people write comments"):

- "people write" and "write comments" appear once.



Each bigram is assigned a unique index (not alphabetically sorted unless specified).
This shows the set of unique bigrams present across all documents.
Bigrams capture context and phrase-level meaning.


📘 What Are Trigrams?

A trigram is a sequence of 3 consecutive words in a sentence.
This helps capture even richer context than bigrams especially useful when the meaning is shaped by full phrases, not just pairs of words.

Trigrams are especially powerful in tasks like text generation, summarization, and predictive modeling where full phrases matter.

In [42]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with ngram_range=(3,3) for trigrams
trigram = CountVectorizer(ngram_range=(3, 3))

# Fit and transform the text data
trigram_matrix = trigram.fit_transform(data["text"])

# View the vocabulary
trigram.vocabulary_


{'people watch youtube': 0,
 'youtube watch youtube': 2,
 'people write comments': 1,
 'youtube write comments': 3}

In "people watch youtube":
- The trigram "people watch youtube" appears once.

In "youtube watch youtube":
- The trigram "youtube watch youtube" appears once.
- Others = 0 as they don't appear in those rows.

In [None]:
mix = CountVectorizer(ngram_range=(1,2))

In [None]:
mixvoc=mix.fit_transform(data["text"])

In [None]:
mix.vocabulary_

{'people': 1,
 'watch': 4,
 'youtube': 8,
 'people watch': 2,
 'watch youtube': 5,
 'youtube watch': 9,
 'write': 6,
 'comments': 0,
 'people write': 3,
 'write comments': 7,
 'youtube write': 10}

📘 What Is TF-IDF?

- TF-IDF stands for Term Frequency–Inverse Document Frequency.
- It’s a statistical method used to measure how important a word is in a document relative to a collection of documents (corpus).
- It goes beyond just counting how often a word appears (like Bag of Words) it also considers how rare or common that word is across all documents.

Why Use TF-IDF Instead of CountVectorizer?
- CountVectorizer only counts word occurrences — common words like "the", "is", "watch" get high weight even if they’re not meaningful.

TF-IDF balances this by:
- Giving high scores to words that appear frequently in one document but rarely in others
- Giving low scores to common words across all documents

This makes it perfect for tasks like:
- Search engines
- Keyword extraction
- Text classification

🔍 TF-IDF Formula
1. TF (Term Frequency):

How often a word appears in a document
TF(w) = (Number of times word w appears) / (Total words in document)

2. IDF (Inverse Document Frequency):

How unique the word is across all documents
- IDF(w) = log_e(Total number of documents / Number of documents containing w)

- TF-IDF = TF × IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
Tfidf = TfidfVectorizer()

In [None]:
Tfidf.fit_transform(data["text"]).toarray()

array([[0.        , 0.61366674, 0.61366674, 0.        , 0.49681612],
       [0.        , 0.        , 0.52546357, 0.        , 0.8508161 ],
       [0.57735027, 0.57735027, 0.        , 0.57735027, 0.        ],
       [0.61366674, 0.        , 0.        , 0.61366674, 0.49681612]])

In [None]:
Tfidf.get_feature_names_out()

array(['comments', 'people', 'watch', 'write', 'youtube'], dtype=object)

In [None]:
Tfidf.idf_

array([1.51082562, 1.51082562, 1.51082562, 1.51082562, 1.22314355])