<a href="https://colab.research.google.com/github/mpuigcor/hello-world/blob/master/Testing_BERT_model_for_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import tensorflow_datasets as tfds

# Download and load the AG News dataset
# as_supervised=True loads the dataset as (text, label) tuples
# with_info=True gets the dataset information
(ds_train, ds_test), ds_info = tfds.load(
    'ag_news_subset',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)




Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/ag_news_subset/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/ag_news_subset/incomplete.BXML3F_1.0.0/ag_news_subset-train.tfrecord*...: …

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/ag_news_subset/incomplete.BXML3F_1.0.0/ag_news_subset-test.tfrecord*...:  …

Dataset ag_news_subset downloaded and prepared to /root/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.


In [4]:
import pandas as pd

# Take a sample from the training dataset
sample_size = 10000  # You can adjust the sample size as needed
sample_data = list(ds_train.take(sample_size).as_numpy_iterator())

# Convert the sample data to a pandas DataFrame
# The dataset yields (text, label) tuples, where text is bytes and label is an integer
sample_df = pd.DataFrame(sample_data, columns=['text', 'label'])

# Decode the text from bytes to string
sample_df['text'] = sample_df['text'].apply(lambda x: x.decode('utf-8'))

# Display the first few rows of the sample DataFrame
display(sample_df.head())

# Print the shape of the sample DataFrame
print("\nSample DataFrame shape:", sample_df.shape)

Unnamed: 0,text,label
0,AMD #39;s new dual-core Opteron chip is design...,3
1,Reuters - Major League Baseball\Monday announc...,1
2,President Bush #39;s quot;revenue-neutral quo...,2
3,Britain will run out of leading scientists unl...,3
4,"London, England (Sports Network) - England mid...",1



Sample DataFrame shape: (10000, 2)


# Simple Feature Extractor using BERT Model
---

# Installs
---

# Loading the Dataset
---

In [5]:


# Print the first few rows of the dataset
sample_df.head()

Unnamed: 0,text,label
0,AMD #39;s new dual-core Opteron chip is design...,3
1,Reuters - Major League Baseball\Monday announc...,1
2,President Bush #39;s quot;revenue-neutral quo...,2
3,Britain will run out of leading scientists unl...,3
4,"London, England (Sports Network) - England mid...",1


In [6]:
sample_df.tail()

Unnamed: 0,text,label
9995,NEW YORK (Reuters) - U.S. Treasury prices fel...,2
9996,Grid architecture aims to protect corporate ne...,3
9997,WASHINGTON (Reuters) - The U.S. Securities an...,2
9998,The Securities and Exchange Commission ordered...,2
9999,AFP - Hundreds of people marked the 100 days i...,0


In [7]:
sample_df.shape

(10000, 2)

# EDA
---


In [8]:
sample_df.isna().any()

Unnamed: 0,0
text,False
label,False


In [9]:
sample_df.dtypes

Unnamed: 0,0
text,object
label,int64


# Generating Features with Transformer
---

In [10]:
from transformers import BertModel, BertTokenizer
import torch

In [13]:
# Load the pre-trained BERT model and tokenizer
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [11]:
# Define a function to extract features for each transaction
def extract_features_CLS(text):
    # Tokenize the text
    input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True)])
    # Get the hidden states for each token
    with torch.no_grad():
        outputs = model(input_ids)
        hidden_states = outputs[2]
    # Get the hidden states for the CLS token from the last 4 layers
    cls_token_vecs = []
    for layer in range(-4, 0):
        cls_token_vecs.append(hidden_states[layer][0][0]) # [0][0] gets the CLS token's hidden state
    # Concatenate the features from the last 4 layers
    return torch.cat(cls_token_vecs)

This function extract_features is used to extract features from text using a pre-trained transformer model. Here's a detailed description of what the function does:

1. The input to the function is a string of text (in this case, a news headline).

2. The "tokenizer.encode" method is used to tokenize the text. Tokenization involves breaking down the text into smaller units (called tokens) that the model can understand. In this case, the tokenizer used is from the Hugging Face Transformers library.

3. The resulting list of tokens is then converted to a PyTorch tensor using the "torch.tensor" method.

4. The PyTorch tensor is then passed through the pre-trained transformer model using the "model" variable. This variable contains the pre-trained model loaded using the Hugging Face Transformers library.

5. The "outputs" variable contains the outputs from the model, which include the hidden states for each token.

6. The "hidden_states" variable contains a list of tensors, where each tensor represents the hidden states for a particular layer in the transformer.

7. The "token_vecs" variable is a list of tensors, where each tensor represents the hidden states for a particular token in the last 4 layers of the transformer.

8. The "features" variable is a list of tensors, where each tensor represents the mean of the hidden states for a particular token in the last 4 layers of the transformer.

9. Finally, the "features" are stacked into a single PyTorch tensor using the "torch.stack" method, and this tensor is returned as the output of the function.

So, this function extracts useful features from text using a pre-trained transformer model. These features can then be used in a variety of downstream tasks such as sentiment analysis, text classification, named entity recognition, and more.

# Feature Extraction
---

In [14]:
# Extract features for each transaction
features = []
for i in range(len(sample_df)):
    # Use the extract_features_CLS function to get a single feature vector per text
    features.append(extract_features_CLS(sample_df.iloc[i]["text"]))

# Concatenate the features and convert to a numpy array
features = torch.stack(features).numpy()

In [None]:
features

array([[ 0.1668    , -0.81575555, -0.16389446, ..., -0.36543572,
        -0.21953402,  0.4504192 ],
       [-0.16256328, -0.7151536 , -0.39481637, ..., -0.00268058,
         0.69796187,  0.13372932],
       [ 0.1150583 , -0.65717053, -0.2132323 , ..., -0.32905242,
         0.43736273,  0.3176901 ],
       ...,
       [-0.06504016, -1.0908614 , -0.6053405 , ..., -0.37211984,
         0.2043658 ,  0.2496659 ],
       [ 0.25928223, -0.6746095 , -0.2460485 , ..., -0.15313435,
         0.50937295,  0.57639265],
       [-0.0303058 , -0.6471238 , -0.47653496, ..., -0.23611926,
         0.80487597,  0.22889309]], dtype=float32)

In [15]:
features.shape

(10000, 3072)

This code is using the previously defined "extract_features" function to extract features for each transaction in a dataset stored in a Pandas DataFrame "df".

The code iterates through each row in the DataFrame and passes the news headline text to the "extract_features" function to obtain the features for that transaction. The resulting feature vectors are appended to a list "features".

After all headlines have been processed, the code concatenates the feature vectors and converts them into a numpy array using PyTorch's "torch.cat" and "numpy" functions, respectively.

The resulting numpy array "features" can then be used as input to train a machine learning model or perform other analyses.

In [16]:
labels = sample_df['label'].values
labels

array([3, 1, 2, ..., 2, 2, 0])

In [17]:
print(features.shape)
print(labels.shape)

(10000, 3072)
(10000,)


The output numpy array features contains the extracted features for each text in the "text" column of the dataframe. The number of features per text depends on the number of tokens in the text and the "hidden_size" of the transformer model used for feature extraction. In this case, since we've used the pre-trained bert-base-uncased model, the hidden_size is 768.

The BERT model uses WordPiece tokenization, which splits words into subwords and then maps each subword to an embedding. Therefore, each input text is divided into multiple subwords, resulting in more than one feature vector per input text. In this case, it seems that each input text is being split into an average of 4 subwords, resulting in 4 feature vectors per text and hence, 4000 rows in total for 1000 input texts.

In [18]:

import numpy as np
# concatenate the feature array with the label array horizontally
dataset = np.hstack((features, labels.reshape((-1, 1))))

# dataset is a 2D numpy array of size 1000x(4000*768+1)

In [19]:
dataset.shape

(10000, 3073)

The above code takes two numpy arrays as input:

1. "features" array which is a 2D numpy array of size 4000x768 representing the features extracted for 1000 rows of text.
2. "labels" array which is a 1D numpy array of size 1000 representing the target class labels for each row of text.

The "features" array is reshaped to a size of 1000x(768x4) using the reshape() function. The reshaped array is then concatenated horizontally with the "labels" array using the hstack() function to create a 2D numpy array named "dataset" of size 1000x(768x4+1). This concatenated array "dataset" will be used as the input dataset for machine learning modeling.

The original "features" array is a 2D numpy array of size 4000x768. This means it has 4000 rows and 768 columns. Each row represents the features extracted from a single text.

In the "extract_features" function, we extracted the last 4 hidden states of each token in the text and concatenated them to get a representation for the text. This resulted in a vector of size 768 for each token. Since we are taking the last 4 hidden states, we end up with a vector of size 768 x 4 = 3072 for the text.

We then take the mean of the vectors to get a single vector of size 768 that represents the entire text. This is what we end up with in the features array.

Now, we reshape the features array to size 1000x3072, since we have only 1000 texts in our dataset. We then concatenate the reshaped features array horizontally with the label array of size 1000x1 to get a 2D numpy array of size 1000x(3072+1) = 1000x3073.

# BERT + Logistic regression Modelling
---

In [20]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

# Convert the training and testing sets back into separate feature and label arrays
X_train, y_train = train_data[:, :-1], train_data[:, -1]
X_test, y_test = test_data[:, :-1], test_data[:, -1]

In [21]:
X_train.shape

(8000, 3072)

In [22]:
X_train.dtype

dtype('float64')

In [23]:
from sklearn.linear_model import LogisticRegression

# Train a logistic regression classifier on the training set
clf = LogisticRegression(max_iter = 1000)
clf.fit(X_train, y_train)

In [24]:
clf.coef_.shape

(4, 3072)

In [25]:
# Evaluate the classifier on the testing set
score = clf.score(X_test, y_test)
print("Accuracy:", score)

Accuracy: 0.882


In [26]:
# Predict the labels of the testing set
y_pred = clf.predict(X_test)

In [27]:
y_pred[1:5]

array([2., 0., 1., 1.])

In [28]:
from sklearn.metrics import confusion_matrix, classification_report

In [29]:
# Generate the confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)

In [30]:
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[484   8  21  22]
 [ 12 483   4   7]
 [ 25   4 396  54]
 [ 17   4  58 401]]


In [31]:
print("\nClassification Report:\n", cr)


Classification Report:
               precision    recall  f1-score   support

         0.0       0.90      0.90      0.90       535
         1.0       0.97      0.95      0.96       506
         2.0       0.83      0.83      0.83       479
         3.0       0.83      0.84      0.83       480

    accuracy                           0.88      2000
   macro avg       0.88      0.88      0.88      2000
weighted avg       0.88      0.88      0.88      2000



---

This is my trial notebook for utilizing a transformer model for feature extraction. Specifically, I have implemented the BERT model to extract features from the text column of the AG News Classification dataset, which I have restricted to only 1000 rows for the purpose of training. Notably, the text column consists of news headlines, while the label column comprises four distinct labels.

In this notebook,

1. Firstly, I have loaded the dataset and subsequently preprocessed it for use with the transformer model.
2. Next, I have employed the BERT transformer model to extract features from the dataset.
3. These extracted features have then been utilized for machine learning modelling, with the Logistic Regression classification model being employed in this regard.
4. The overall accuracy of the model was observed to be 86.5%.
5. Other accuracy measures such as Confusion Matrix and Classification Report were used to undertand the efficiency of the Ml Model used.

## Using BERT + NN model

In [32]:
import tensorflow as tf

In [95]:
def build_model():
    input_features = tf.keras.Input(shape=(3072,), name="X_train")
    lay = tf.keras.layers.Dropout(0.2)(input_features)
    lay = tf.keras.layers.Dense(10, activation='relu')(lay)
    lay = tf.keras.layers.Dropout(0.2)(lay)
    out = tf.keras.layers.Dense(4, activation='softmax')(lay)

    model = tf.keras.models.Model(inputs=input_features, outputs=out)
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])

    return model


In [98]:
model = build_model()
model.summary()

In [35]:
from sklearn import preprocessing

label = preprocessing.LabelEncoder()
labels_train = label.fit_transform(y_train)
labels_train = tf.keras.utils.to_categorical(labels_train)
print(labels_train[:5])

[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]]


In [36]:
y_train[1:10]

array([2., 0., 3., 0., 3., 0., 2., 3., 1.])

In [37]:
labels_train[1:10]

array([[0., 0., 1., 0.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.]])

In [38]:
labels_train.dtype

dtype('float64')

In [99]:
train_sh = model.fit(
    X_train, labels_train,
    validation_split=0.2,
    epochs=10,
    #callbacks=[checkpoint, earlystopping],
    batch_size=32,
    verbose=1
)

Epoch 1/10
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 9ms/step - accuracy: 0.4760 - loss: 1.1434 - val_accuracy: 0.8331 - val_loss: 0.5525
Epoch 2/10
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7226 - loss: 0.6765 - val_accuracy: 0.8687 - val_loss: 0.4544
Epoch 3/10
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7558 - loss: 0.6056 - val_accuracy: 0.8706 - val_loss: 0.4128
Epoch 4/10
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7665 - loss: 0.5618 - val_accuracy: 0.8769 - val_loss: 0.3938
Epoch 5/10
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7742 - loss: 0.5486 - val_accuracy: 0.8781 - val_loss: 0.3769
Epoch 6/10
[1m200/200[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7774 - loss: 0.5482 - val_accuracy: 0.8825 - val_loss: 0.3598
Epoch 7/10
[1m200/200[0m 

In [100]:
# Predict the probabilities for the test set
y_pred_probs = model.predict(X_test)

# Get the predicted class labels by taking the argmax of the probabilities
y_pred_nn = np.argmax(y_pred_probs, axis=1)

# Display the first few predicted labels
print("Predicted labels (Neural Network):", y_pred_nn[:10])

[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
Predicted labels (Neural Network): [3 2 0 1 1 2 1 3 3 3]


In [101]:
# Generate the confusion matrix and classification report
cm_nn = confusion_matrix(y_test, y_pred_nn)
cr_nn = classification_report(y_test, y_pred_nn)

In [102]:
print("Confusion Matrix:\n", cm_nn)

Confusion Matrix:
 [[475  12  24  24]
 [  6 492   2   6]
 [ 20   6 382  71]
 [ 20   4  36 420]]


In [103]:
# Calculate the accuracy from the confusion matrix
accuracy_nn = np.trace(cm_nn) / np.sum(cm_nn)

print("Accuracy from Confusion Matrix (Neural Network):", accuracy_nn)

Accuracy from Confusion Matrix (Neural Network): 0.8845
