## Text Encoding

Bag of Words (BoW):

The Bag of Words (BoW) model is one of the simplest methods of text encoding. Here's how it works:

- Vocabulary Creation:
    A vocabulary is created by listing all the unique words in the text corpus.
    Commonly, preprocessing steps like removing punctuation, converting to lowercase, and stemming/lemmatization are applied to standardize the text and reduce the vocabulary size.

- Text Vectorization:
    Each document/text is represented as a vector in a multi-dimensional space, where each dimension corresponds to a term (word) in the vocabulary.
    The value in each dimension is the frequency of that term in the document.

For example, consider two documents:

    Doc1: "I love programming."
    Doc2: "Programming is fun."

The vocabulary will be:

    ['I', 'love', 'programming', 'is', 'fun']

The BoW representations will be:

    BoW(Doc1) = [1, 1, 1, 0, 0]
    BoW(Doc2) = [0, 0, 1, 1, 1]

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd

# Toy text data
documents = [
    'I love programming.',
    'Python is a versatile language.',
    'Data science is an interesting field.',
    'Machine learning is a subset of data science.'
]

# -------------------
# Bag of Words
# -------------------
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform(documents)
df_bow = pd.DataFrame(X_bow.toarray(), columns=vectorizer_bow.get_feature_names_out())
df_bow

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,an,data,field,interesting,is,language,learning,love,machine,of,programming,python,science,subset,versatile
0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0
1,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1
2,1,1,1,1,1,0,0,0,0,0,0,0,1,0,0
3,0,1,0,0,1,0,1,0,1,1,0,0,1,1,0


TF-IDF is a bit more complex and attempts to take into account not just the occurrence of terms in a single document (Term Frequency, TF), but also how unique the terms are across the entire corpus (Inverse Document Frequency, IDF).

**Term Frequency (TF)**
    Like in BoW, TF is the frequency of a term in a document. It's calculated similarly.
    
    TF(t)=Number of times term t appears in a document

**Inverse Document Frequency (IDF)**
    IDF measures the importance of a term in the corpus.
    
    IDF(t)=log(Total number of documents / Number of documents with term t in it)

**TF-IDF Score**
    The TF-IDF score for a term is the product of its TF and IDF scores.
    
    TF-IDF(t)=TF(t)×IDF(t)

The TF-IDF score is high for terms that are common in a particular document but rare across other documents in the corpus, thereby capturing terms that are potentially more informative.

The numerical vectors obtained through BoW and TF-IDF can then be used as features for machine learning models.

In [2]:
df_bow.values

array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0]])

In [3]:
(df_bow > 0).sum(axis=0)

an             1
data           2
field          1
interesting    1
is             3
language       1
learning       1
love           1
machine        1
of             1
programming    1
python         1
science        2
subset         1
versatile      1
dtype: int64

In [4]:
import numpy as np

# Term Frequency (TF)
tf = df_bow.values

# Document Frequency (DF)
df = (df_bow > 0).sum(axis=0)

# Inverse Document Frequency (IDF)
idf = np.log((len(documents) + 1) / (df + 1)) + 1  # Adding 1 to avoid division by zero and following sklearn's formula

df = np.array(df)
idf = np.array(idf)

# TF-IDF
tfidf_manual = pd.DataFrame(tf * idf, columns=df_bow.columns)
tfidf_manual


Unnamed: 0,an,data,field,interesting,is,language,learning,love,machine,of,programming,python,science,subset,versatile
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.916291,0.0,0.0,1.916291,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.223144,1.916291,0.0,0.0,0.0,0.0,0.0,1.916291,0.0,0.0,1.916291
2,1.916291,1.510826,1.916291,1.916291,1.223144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.510826,0.0,0.0
3,0.0,1.510826,0.0,0.0,1.223144,0.0,1.916291,0.0,1.916291,1.916291,0.0,0.0,1.510826,1.916291,0.0


Scikit-learn has a transformer class that performs both the BoW and TF-IDF for simplicity.

In [5]:
vectorizer_tfidf = TfidfVectorizer(norm=None)  # Disable L2 normalization for comparison
X_tfidf = vectorizer_tfidf.fit_transform(documents)
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=vectorizer_tfidf.get_feature_names_out())
df_tfidf


Unnamed: 0,an,data,field,interesting,is,language,learning,love,machine,of,programming,python,science,subset,versatile
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.916291,0.0,0.0,1.916291,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.223144,1.916291,0.0,0.0,0.0,0.0,0.0,1.916291,0.0,0.0,1.916291
2,1.916291,1.510826,1.916291,1.916291,1.223144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.510826,0.0,0.0
3,0.0,1.510826,0.0,0.0,1.223144,0.0,1.916291,0.0,1.916291,1.916291,0.0,0.0,1.510826,1.916291,0.0


## Training models on the IMDB dataset

We use the IMDB movie reviews dataset for a binary classification task, where the goal is to classify movie reviews as either positive or negative. We apply both BoW and TF-IDF encoding techniques to the text data, train a Logistic Regression model, and evaluate the model's performance using a classification report. The classification report provides key metrics such as precision, recall, and F1-score, giving a comprehensive view of how well the model performs for each class (positive and negative reviews) under both encoding schemes. 

Experimenting with different text encoding techniques is a crucial step in handling text classification problems, as the choice of encoding can significantly impact the model's performance.

In [6]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.3 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm>=4.66.3 (from datasets)
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m430.0 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting huggingface-hub>=0.22.0 (from datasets)
  Downloading huggingface_hub-0.25.1-py3-none-any.whl.metadata (13 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m523.2 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloa

In [7]:
from datasets import load_dataset
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load the IMDB dataset using Hugging Face's datasets library
dataset = load_dataset("imdb")

# The data is split into train, test, and unsupervised (which we won't use here)
X_train, y_train = dataset['train']['text'], dataset['train']['label']
X_test, y_test = dataset['test']['text'], dataset['test']['label']

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [8]:
(len(X_train), len(X_test))

(25000, 25000)

In [9]:
# Encoding and Model Training: Bag of Words
vectorizer_bow = CountVectorizer()
X_train_bow = vectorizer_bow.fit_transform(X_train)
X_test_bow = vectorizer_bow.transform(X_test)
model_bow = LogisticRegression(max_iter=1000)
model_bow.fit(X_train_bow, y_train)
y_pred_bow = model_bow.predict(X_test_bow)
print(f'Classification Report (Bag of Words):\n{classification_report(y_test, y_pred_bow)}')

Classification Report (Bag of Words):
              precision    recall  f1-score   support

           0       0.86      0.87      0.87     12500
           1       0.87      0.86      0.87     12500

    accuracy                           0.87     25000
   macro avg       0.87      0.87      0.87     25000
weighted avg       0.87      0.87      0.87     25000



In [10]:
X_train_bow.shape

(25000, 74849)

In [11]:
# Encoding and Model Training: TF-IDF
vectorizer_tfidf = TfidfVectorizer()
X_train_tfidf = vectorizer_tfidf.fit_transform(X_train)
X_test_tfidf = vectorizer_tfidf.transform(X_test)
model_tfidf = LogisticRegression(max_iter=1000)
model_tfidf.fit(X_train_tfidf, y_train)
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)
print(f'Classification Report (TF-IDF):\n{classification_report(y_test, y_pred_tfidf)}')


Classification Report (TF-IDF):
              precision    recall  f1-score   support

           0       0.88      0.88      0.88     12500
           1       0.88      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000

