<a href="https://colab.research.google.com/github/mpuigcor/hello-world/blob/master/Simple_Feature_Extractor_%7C_BERT_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import tensorflow_datasets as tfds

# Download and load the AG News dataset
# as_supervised=True loads the dataset as (text, label) tuples
# with_info=True gets the dataset information
(ds_train, ds_test), ds_info = tfds.load(
    'ag_news_subset',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)




Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/ag_news_subset/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/ag_news_subset/incomplete.G3XMEH_1.0.0/ag_news_subset-train.tfrecord*...: …

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/ag_news_subset/incomplete.G3XMEH_1.0.0/ag_news_subset-test.tfrecord*...:  …

Dataset ag_news_subset downloaded and prepared to /root/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.


In [6]:
import pandas as pd

# Take a sample from the training dataset
sample_size = 1000  # You can adjust the sample size as needed
sample_data = list(ds_train.take(sample_size).as_numpy_iterator())

# Convert the sample data to a pandas DataFrame
# The dataset yields (text, label) tuples, where text is bytes and label is an integer
sample_df = pd.DataFrame(sample_data, columns=['text', 'label'])

# Decode the text from bytes to string
sample_df['text'] = sample_df['text'].apply(lambda x: x.decode('utf-8'))

# Display the first few rows of the sample DataFrame
display(sample_df.head())

# Print the shape of the sample DataFrame
print("\nSample DataFrame shape:", sample_df.shape)

Unnamed: 0,text,label
0,AMD #39;s new dual-core Opteron chip is design...,3
1,Reuters - Major League Baseball\Monday announc...,1
2,President Bush #39;s quot;revenue-neutral quo...,2
3,Britain will run out of leading scientists unl...,3
4,"London, England (Sports Network) - England mid...",1



Sample DataFrame shape: (1000, 2)


# Simple Feature Extractor using BERT Model
---

![](https://factored.ai/wp-content/uploads/2021/09/image4.png)

[Img Source](https://www.google.com/url?sa=i&url=https%3A%2F%2Ffactored.ai%2Ftransformer-based-language-models%2F&psig=AOvVaw283KJbL5Izt7Ej4lPz2Ozu&ust=1680451002823000&source=images&cd=vfe&ved=0CBAQjRxqFwoTCLCupv6Fif4CFQAAAAAdAAAAABAE)

# Installs
---

# Loading the Dataset
---

In [7]:


# Print the first few rows of the dataset
sample_df.head()

Unnamed: 0,text,label
0,AMD #39;s new dual-core Opteron chip is design...,3
1,Reuters - Major League Baseball\Monday announc...,1
2,President Bush #39;s quot;revenue-neutral quo...,2
3,Britain will run out of leading scientists unl...,3
4,"London, England (Sports Network) - England mid...",1


In [8]:
sample_df.tail()

Unnamed: 0,text,label
995,Boston- Maybe this really is the Red Sox #39;s...,1
996,Head teachers in England could gain powers to ...,0
997,At least seven people are confirmed dead as Ty...,0
998,Reuters - The last Atlas 2 rocket was\launched...,3
999,AP - China's police ministry on Sunday handed ...,0


In [9]:
sample_df.shape

(1000, 2)

# EDA
---


In [10]:
sample_df.isna().any()

Unnamed: 0,0
text,False
label,False


In [11]:
sample_df.dtypes

Unnamed: 0,0
text,object
label,int64


# Generating Features with Transformer
---

In [12]:
from transformers import BertModel, BertTokenizer
import torch

In [13]:
# Load the pre-trained BERT model and tokenizer
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states=True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [14]:
# Define a function to extract features for each transaction
def extract_features(text):
    # Tokenize the text
    input_ids = torch.tensor([tokenizer.encode(text, add_special_tokens=True)])
    # Get the hidden states for each token
    with torch.no_grad():
        outputs = model(input_ids)
        hidden_states = outputs[2]
    # Concatenate the last 4 hidden states
    token_vecs = []
    for layer in range(-4, 0):
        token_vecs.append([hidden_states[layer][0]])
    # Calculate the mean of the last 4 hidden states
    features = []
    for token in token_vecs:
        features.append(torch.mean(token, dim=0))
    # Return the features as a tensor
    return torch.stack(features)

In [19]:
input_ids = torch.tensor([tokenizer.encode(sample_df.iloc[i]["text"], add_special_tokens=True)])

In [20]:
input_ids

tensor([[  101,  9706,  1011,  2859,  1005,  1055,  2610,  3757,  2006,  4465,
          4375,  2041, 19054,  1997,  2039,  2000,  1001,  4029,  1025, 11212,
          2000,  2111,  2040,  2988, 26932,  4773,  4573,  1999,  1037,  3049,
          2000, 11359,  2041,  3784, 15488,  4904,  1010,  1996,  2231,  2056,
          1012,   102]])

In [21]:
    with torch.no_grad():
        outputs = model(input_ids)
        hidden_states = outputs[2]

In [25]:
hidden_states[-1].shape

torch.Size([1, 42, 768])

In [29]:
hidden_states[-1][0].shape

torch.Size([42, 768])

In [31]:
torch.mean(hidden_states[-1][0], dim=0).shape


torch.Size([768])

This function extract_features is used to extract features from text using a pre-trained transformer model. Here's a detailed description of what the function does:

1. The input to the function is a string of text (in this case, a news headline).

2. The "tokenizer.encode" method is used to tokenize the text. Tokenization involves breaking down the text into smaller units (called tokens) that the model can understand. In this case, the tokenizer used is from the Hugging Face Transformers library.

3. The resulting list of tokens is then converted to a PyTorch tensor using the "torch.tensor" method.

4. The PyTorch tensor is then passed through the pre-trained transformer model using the "model" variable. This variable contains the pre-trained model loaded using the Hugging Face Transformers library.

5. The "outputs" variable contains the outputs from the model, which include the hidden states for each token.

6. The "hidden_states" variable contains a list of tensors, where each tensor represents the hidden states for a particular layer in the transformer.

7. The "token_vecs" variable is a list of tensors, where each tensor represents the hidden states for a particular token in the last 4 layers of the transformer.

8. The "features" variable is a list of tensors, where each tensor represents the mean of the hidden states for a particular token in the last 4 layers of the transformer.

9. Finally, the "features" are stacked into a single PyTorch tensor using the "torch.stack" method, and this tensor is returned as the output of the function.

So, this function extracts useful features from text using a pre-trained transformer model. These features can then be used in a variety of downstream tasks such as sentiment analysis, text classification, named entity recognition, and more.

# Feature Extraction
---

In [16]:
# Extract features for each transaction
features = []
for i in range(len(sample_df)):
    features.append(extract_features(sample_df.iloc[i]["text"]))
# Concatenate the features and convert to a numpy array
features = torch.cat(features).numpy()

In [17]:
features

array([[ 0.01166427,  0.31639254,  0.08030723, ..., -0.3445894 ,
         0.14850494,  0.359372  ],
       [-0.23773313,  0.06683473,  0.3281017 , ..., -0.2812955 ,
         0.12976453,  0.22252765],
       [-0.09940622,  0.21775264,  0.45618796, ..., -0.30170876,
        -0.1138118 ,  0.29964206],
       ...,
       [-0.18117583, -0.17706801,  0.1537665 , ..., -0.34712768,
         0.06753789, -0.27153915],
       [-0.10814629, -0.07839631,  0.16113684, ..., -0.34167272,
         0.17220178, -0.25063083],
       [-0.00263924,  0.03040321,  0.18756595, ..., -0.10317597,
         0.33503714, -0.02757266]], dtype=float32)

In [18]:
features.shape

(4000, 768)

This code is using the previously defined "extract_features" function to extract features for each transaction in a dataset stored in a Pandas DataFrame "df".

The code iterates through each row in the DataFrame and passes the news headline text to the "extract_features" function to obtain the features for that transaction. The resulting feature vectors are appended to a list "features".

After all headlines have been processed, the code concatenates the feature vectors and converts them into a numpy array using PyTorch's "torch.cat" and "numpy" functions, respectively.

The resulting numpy array "features" can then be used as input to train a machine learning model or perform other analyses.

In [None]:
import numpy as np

In [None]:
# Save the features as a .npy file
np.save("ag_news_features.npy", features)

In [33]:
labels = sample_df['label'].values
labels

array([3, 1, 2, 3, 1, 0, 3, 0, 0, 1, 0, 3, 3, 0, 2, 1, 0, 3, 3, 1, 1, 2,
       2, 1, 3, 1, 2, 2, 0, 1, 0, 1, 3, 2, 0, 1, 0, 3, 0, 3, 0, 3, 0, 0,
       3, 0, 2, 1, 1, 3, 3, 0, 3, 3, 0, 0, 0, 1, 2, 3, 0, 1, 3, 0, 2, 1,
       0, 3, 3, 3, 3, 0, 2, 2, 3, 2, 0, 2, 3, 3, 1, 1, 3, 1, 0, 3, 0, 0,
       0, 3, 0, 3, 3, 3, 0, 1, 2, 2, 2, 1, 0, 1, 3, 2, 3, 2, 1, 1, 0, 3,
       3, 1, 2, 3, 0, 2, 3, 0, 2, 0, 2, 1, 3, 0, 0, 3, 2, 2, 1, 1, 1, 2,
       1, 2, 1, 3, 2, 2, 0, 2, 1, 1, 1, 1, 2, 1, 2, 3, 0, 2, 1, 3, 0, 3,
       2, 1, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 2, 0, 3, 2, 2, 0, 0, 3, 3, 0,
       0, 3, 0, 0, 1, 0, 3, 1, 0, 1, 2, 3, 3, 3, 0, 0, 3, 0, 0, 0, 0, 2,
       1, 2, 2, 2, 0, 0, 2, 1, 2, 2, 0, 0, 0, 0, 0, 3, 1, 3, 2, 1, 3, 2,
       2, 2, 0, 2, 3, 1, 2, 2, 0, 0, 1, 0, 3, 0, 3, 3, 0, 2, 3, 3, 1, 0,
       3, 0, 1, 3, 0, 2, 1, 1, 3, 3, 1, 2, 2, 2, 0, 2, 3, 0, 1, 1, 1, 0,
       3, 2, 1, 2, 1, 0, 3, 1, 3, 0, 2, 1, 3, 3, 3, 0, 3, 1, 3, 3, 1, 3,
       3, 0, 1, 2, 0, 2, 1, 2, 3, 1, 1, 0, 3, 3, 0,

In [34]:
print(features.shape)
print(labels.shape)

(4000, 768)
(1000,)


The output numpy array features contains the extracted features for each text in the "text" column of the dataframe. The number of features per text depends on the number of tokens in the text and the "hidden_size" of the transformer model used for feature extraction. In this case, since we've used the pre-trained bert-base-uncased model, the hidden_size is 768.

The BERT model uses WordPiece tokenization, which splits words into subwords and then maps each subword to an embedding. Therefore, each input text is divided into multiple subwords, resulting in more than one feature vector per input text. In this case, it seems that each input text is being split into an average of 4 subwords, resulting in 4 feature vectors per text and hence, 4000 rows in total for 1000 input texts.

In [36]:
# features is a 2D numpy array of size 4000x768
# labels is a 1D numpy array of size 1000
# reshape the feature array to size 1000x(768*4)
import numpy as np
features_reshaped = features.reshape((1000, -1))

# concatenate the feature array with the label array horizontally
dataset = np.hstack((features_reshaped, labels.reshape((-1, 1))))

# dataset is a 2D numpy array of size 1000x(4000*768+1)

In [37]:
features_reshaped.shape

(1000, 3072)

The above code takes two numpy arrays as input:

1. "features" array which is a 2D numpy array of size 4000x768 representing the features extracted for 1000 rows of text.
2. "labels" array which is a 1D numpy array of size 1000 representing the target class labels for each row of text.

The "features" array is reshaped to a size of 1000x(768x4) using the reshape() function. The reshaped array is then concatenated horizontally with the "labels" array using the hstack() function to create a 2D numpy array named "dataset" of size 1000x(768x4+1). This concatenated array "dataset" will be used as the input dataset for machine learning modeling.

In [38]:
dataset

array([[ 0.01166427,  0.31639254,  0.08030723, ..., -0.4404065 ,
         0.11313729,  3.        ],
       [-0.12989713, -0.29392609, -0.04320088, ...,  0.03608233,
        -0.05498757,  1.        ],
       [ 0.13276501, -0.03820173,  0.30393651, ...,  0.07398565,
         0.31864855,  2.        ],
       ...,
       [-0.26375505,  0.02432426,  0.28421736, ...,  0.15579741,
        -0.03092136,  0.        ],
       [-0.19679573, -0.29919502,  0.57228309, ...,  0.21926799,
         0.19579075,  3.        ],
       [-0.08812498, -0.01065889,  0.15644448, ...,  0.33503714,
        -0.02757266,  0.        ]])

In [39]:
dataset.shape

(1000, 3073)

The original "features" array is a 2D numpy array of size 4000x768. This means it has 4000 rows and 768 columns. Each row represents the features extracted from a single text.

In the "extract_features" function, we extracted the last 4 hidden states of each token in the text and concatenated them to get a representation for the text. This resulted in a vector of size 768 for each token. Since we are taking the last 4 hidden states, we end up with a vector of size 768 x 4 = 3072 for the text.

We then take the mean of the vectors to get a single vector of size 768 that represents the entire text. This is what we end up with in the features array.

Now, we reshape the features array to size 1000x3072, since we have only 1000 texts in our dataset. We then concatenate the reshaped features array horizontally with the label array of size 1000x1 to get a 2D numpy array of size 1000x(3072+1) = 1000x3073.

# ML Modelling
---

In [41]:
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=42)

# Convert the training and testing sets back into separate feature and label arrays
X_train, y_train = train_data[:, :-1], train_data[:, -1]
X_test, y_test = test_data[:, :-1], test_data[:, -1]

In [42]:
from sklearn.linear_model import LogisticRegression

# Train a logistic regression classifier on the training set
clf = LogisticRegression(max_iter = 1000)
clf.fit(X_train, y_train)

In [43]:
# Evaluate the classifier on the testing set
score = clf.score(X_test, y_test)
print("Accuracy:", score)

Accuracy: 0.89


In [44]:
# Predict the labels of the testing set
y_pred = clf.predict(X_test)

In [45]:
from sklearn.metrics import confusion_matrix, classification_report

In [46]:
# Generate the confusion matrix and classification report
cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)

In [47]:
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[52  0  2  3]
 [ 2 46  0  0]
 [ 2  1 39  6]
 [ 1  0  5 41]]


In [48]:
print("\nClassification Report:\n", cr)


Classification Report:
               precision    recall  f1-score   support

         0.0       0.91      0.91      0.91        57
         1.0       0.98      0.96      0.97        48
         2.0       0.85      0.81      0.83        48
         3.0       0.82      0.87      0.85        47

    accuracy                           0.89       200
   macro avg       0.89      0.89      0.89       200
weighted avg       0.89      0.89      0.89       200



---

This is my trial notebook for utilizing a transformer model for feature extraction. Specifically, I have implemented the BERT model to extract features from the text column of the AG News Classification dataset, which I have restricted to only 1000 rows for the purpose of training. Notably, the text column consists of news headlines, while the label column comprises four distinct labels.

In this notebook,

1. Firstly, I have loaded the dataset and subsequently preprocessed it for use with the transformer model.
2. Next, I have employed the BERT transformer model to extract features from the dataset.
3. These extracted features have then been utilized for machine learning modelling, with the Logistic Regression classification model being employed in this regard.
4. The overall accuracy of the model was observed to be 86.5%.
5. Other accuracy measures such as Confusion Matrix and Classification Report were used to undertand the efficiency of the Ml Model used.

---
# Thank You!