**Text Representation Methods (Bag of Words and TF-IDF)**

**What is Bag of Words (BoW)?** <br>
BOW is one of the most basic text processing techniques that represent texts numerically. This method turns the text into a numerical vector by taking into account the frequency of words in a text.

**How Does It Work?**
1. A vocabulary consisting of all documents (texts) is created.
2. Each document is represented by the presence or absence and frequency of words in this vocabulary.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
texts=["Cat black black","Dog black white"]

In [None]:
# BOW Representation
vectorizer=CountVectorizer()
x_bow=vectorizer.fit_transform(texts)
print("Bag of Words Representation:")
print(x_bow.toarray())
print("Vocabulary:", vectorizer.get_feature_names_out())

Bag of Words Representation:
[[2 1 0 0]
 [1 0 1 1]]
Vocabulary: ['black' 'cat' 'dog' 'white']


**What is TF-IDF (Term Frequency - Inverse Document Frequency)?** <br>
TF-IDF digitizes words in a way similar to the BoW method, but provides a more meaningful text representation by penalizing words that occur frequently and do not carry much meaning across documents.

**How Does It Work?**
1. TF (Term Frequency): It is the frequency of a word in that text.
2. IDF (Inverse Document Frequency): It measures how many documents a word appears in. If it appears in too many documents, the value of the word decreases.

In [None]:
# TF-IDF Representation
tfidf_vectorizer=TfidfVectorizer()
x_tfidf=tfidf_vectorizer.fit_transform(texts)
print("\nTF-IDF Representation:")
print(x_tfidf.toarray())
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())


TF-IDF Representation:
[[0.81818021 0.57496187 0.         0.        ]
 [0.44943642 0.         0.6316672  0.6316672 ]]
Vocabulary: ['black' 'cat' 'dog' 'white']


Source: <br>
* https://www.kaggle.com/code/vipulgandhi/bag-of-words-model-for-beginners