# HausaSentiLex fine tuning:

Pretrained model fine-tuning is also called Transfer learning. It refers to training a model with a small dataset while leveraging the stored information from a model trained with a large dataset for another task. In this article, we present how to use a small lexicon dataset in Hausa Language to build a sentiment prediction model while leveraging a pretrained transformer model.

Let's get started!

# Step 1: Install And Import Python Libraries

In step 1, we will install and import python libraries.

Firstly, let's import `transformers` and `datasets`.

In [None]:
# Install libraries
!pip install transformers datasets

After installing the python packages, we will import the python libraries.
* `pandas` and `numpy` are imported for data processing.
* `train_test_split` is imported from `sklearn` to split dataset.
* `tensorflow` and `transformers` are imported for modeling.
* `Dataset` is imported for the Hugging Face dataset format.
* The `accuracy_score` is imported for model performance evaluation.

In [2]:
# Data processing
import pandas as pd
import numpy as np

# Train test split
from sklearn.model_selection import train_test_split

# Modeling
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

# Hugging Face Dataset
from datasets import Dataset

# Import accuracy_score to check performance
from sklearn.metrics import accuracy_score

# Step 2: Download And Read Data

Those who are using Google Colab for this analysis need to mount Google Drive to read the dataset. You can ignore the code below if you are not using Google Colab.
* `drive.mount` is used to mount to the Google drive so the colab notebook can access the data on the Google drive.
* `os.chdir` is used to change the default directory on Google drive. I set the default directory to the folder where the review dataset is saved.
* `!pwd` is used to print the current working directory.

Please check out [Google Colab Tutorial for Beginners](https://medium.com/towards-artificial-intelligence/google-colab-tutorial-for-beginners-834595494d44) for details about using Google Colab for data science projects.

In [3]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Change directory
import os
os.chdir("/content/drive/MyDrive/HausaSentiLex/")

# Print out the current directory
!pwd

/content/drive/MyDrive/HausaSentiLex


Now let's read the data into a `pandas` dataframe and see what the dataset looks like.

The dataset has two columns. One column contains the reviews and the other column contains the sentiment label for the review.

In [5]:
# Read in data
hausa_recourse_data = pd.read_csv('HausaSentiLexTrainDataset.txt', sep=',', names=['text', 'label'])

# Take a look at the data
hausa_recourse_data.head()

Unnamed: 0,text,label
0,A ajizance,1
1,A amince,1
2,A arfafe haka,0
3,A arfafe haka,0
4,A arha,1


`.info` helps us to get information about the dataset.

From the output, we can see that this data set has 13,392 records and no missing data. The `review` column is the `object` type and the `label` column is the `int64` type.

In [6]:
# Get the dataset information
hausa_recourse_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13392 entries, 0 to 13391
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    13392 non-null  object
 1   label   13392 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 209.4+ KB


The label value of 0 represents negative lexicons and the label value of 1 represents positive lexicons. The dataset has 6,178 positive lexicons and 7,214 negative lexicons. It is balanced, so we can use  accuracy as the metric to evaluate the model performance.

In [7]:
# Check the label distribution
hausa_recourse_data['label'].value_counts()

0    7214
1    6178
Name: label, dtype: int64

# Step 3: Train Test Split

In step 3, we will split the dataset and have 80% as the training dataset and 20% as the testing dataset.

In [8]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(hausa_recourse_data['text'],
                                                    hausa_recourse_data['label'],
                                                    test_size = 0.20,
                                                    random_state = 42)

# Check the number of records in training and testing dataset.
print(f'The training dataset has {len(X_train)} records.')
print(f'The testing dataset has {len(X_test)} records.')

The training dataset has 10713 records.
The testing dataset has 2679 records.


After the train test split, there are 10,713 lexicons in the training dataset and 2,679 lexicons in the testing dataset.

# Step 4: Tokenize Text

In step 4, we will tokenize the review text using a tokenizer.

A tokenizer converts text into numbers to use as the input of the NLP (Natural Language Processing) models. Each number represents a token, which can be a word, part of a word, punctuation, or special tokens.
* `AutoTokenizer.from_pretrained("bert-base-cased")` downloads vocabulary from the pretrained `bert-base-cased` model.
* `return_tensors="np"` indicates that the return format is numpy array. Besides `np`, `return_tensors` can take the value of `tf` or `pt`, where `tf` returns Tensorflow `tf.constant` object and `pt` returns PyTorch `torch.tensor` object. If not set, it returns a list of python integers.
* `padding` means adding zeros to shorter reviews in the dataset. The `padding` argument controls how `padding` is implemented.  
 * `padding=True` is the same as `padding='longest'`. It checks the longest sequence in the batch and pads zeros to that length. There is no padding if only one text document is provided.
 * `padding='max_length'` pads to `max_length` if it is specified, otherwise, it pads to the maximum acceptable input length for the model.
 * `padding=False` is the same as `padding='do_not_pad'`. It is the default, indicating that no padding is applied, so it can output a batch with sequences of different lengths.

The labels for the reviews are converted to one-dimensional numpy arrays.

In [9]:
# Tokenizer from a pretrained model
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Tokenize the reviews
tokenized_data_train = tokenizer(X_train.to_list(), return_tensors="np", padding=True)
tokenized_data_test = tokenizer(X_test.to_list(), return_tensors="np", padding=True)

# Labels are one-dimensional numpy or tensorflow array of integers
labels_train = np.array(y_train)
labels_test = np.array(y_test)

# Tokenized ids
print(tokenized_data_train["input_ids"][0])

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

[  101 14973 12477  3820  1161  5358 11078  3624   102     0     0     0
     0     0     0     0     0     0     0     0     0     0]


After printing out the tokenized IDs for the first review, we can see that the tokenized words are converted into integers, and the sentence is padded with zeros at the end of the review.

There are two special tokens in the token IDs, `101` at the beginning of the sentence and `102` at the end of the sentence. The BERT tokenizer uses `101` to encode the special token [CLS] and `102` to encode the special token [SEP], but the other models may use other special tokens.

# Step 5: Compile Transfer Learning Model for Sentiment Analysis

In step 5, we will build a customized transfer learning model for sentiment analysis in Hausa Language.

* `TFAutoModelForSequenceClassification` loads the BERT model without the sequence classification head.
* The method `from_pretrained()` loads the weights from the pretrained model into the new model, so the weights in the new model are not randomly initialized. Note that the new weights for the new sequence classification head are going to be randomly initialized.
* `bert-base-cased` is the name of the pretrained model. We can change it to a different model based on the nature of the project.
* `num_labels` indicates the number of classes. Our dataset has two classes, positive and negative, so `num_labels=2`.

In [10]:
# Load model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


After loading the pretrained model, we will compile the model.
* `SparseCategoricalCrossentropy` is used as the loss function, but the Hugging Face documentation mentioned that Hugging Face models automatically choose a loss that is appropriate for their task and model architecture if the loss is not explicitly specified.
* `from_logits=True` informs the loss function that the output values are logits before applying softmax, so the values do not represent probabilities.
* We are using Adam as the optimizer and the number `5e-6` is the learning rate. A smaller learning rate corresponds to a more stable weights value update and a slower training process.
* `accuracy` is used as the metrics because we have a balanced dataset.

In [11]:
# Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile model
model.compile(optimizer=Adam(5e-6), loss=loss, metrics=['accuracy'])

# Step 6: Train hausaBERT for Sentiment Analysis

In step 6, we will fit the model using Tensorflow Keras.

When fitting the model, we convert the tokenized dataset into a dictionary for Keras. `batch_size=4` means that four reviews are processed for each weights and bias update. `epochs=3` means that the model fitting process will go through the training dataset 3 times.

We can see that the accuracy is 91 percent in just 3 epochs.

In [12]:
# Fit the model
model.fit(dict(tokenized_data_train),
          labels_train,
          validation_data=(dict(tokenized_data_test), labels_test),
          batch_size=4,
          epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7c34043c87f0>

All the weights will be updated by default for the transfer learning model.

* If we would like to keep the pretrained model weights as is and only update the weights and bias of the output layer, we can use `model.layers[0].trainable = False` to freeze the weights of the BERT model.
* If we would like to keep the weights of some layers and update others, we can use `model.bert.encoder.layer[i].trainable = False` to freeze the weights of the corresponding layers.
* In general, if the dataset for the transfer learning model is large, it is suggested to update all weights, and if the dataset for the transfer learning model is small, it is suggested to freeze the pretrained model weights. But we can always compare the model performance by adding the tunable pretrained model layers one by one.

# Step 7: Sentiment Analysis Transfer Learning Model Prediction and Evaluation

In step 7, we will talk about model prediction and evaluation for sentiment analysis transfer learning.

Passing the tokenized text to the `.predict` method, we get the predictions for the customized transfer learning sentiment model. `logits` is the last layer of the neural network before softmax is applied.

We can see that the prediction has two columns. The first column is the predicted logit for label 0 and the second column is the predicted logit for label 1. logit values do not sum up to 1.

In [13]:
# Predictions
y_test_predict = model.predict(dict(tokenized_data_test))['logits']

# First 5 predictions
y_test_predict[:5]



array([[-3.523766 ,  3.4165444],
       [-3.778314 ,  3.6424186],
       [ 4.018092 , -3.851594 ],
       [-3.8848238,  3.7118843],
       [ 3.263181 , -3.5542526]], dtype=float32)

To get the predicted probabilities, we need to apply softmax on the predicted logit values.

After applying softmax, we can see that the predicted probability for each review sums up to 1.

In [14]:
# Predicted probabilities
y_test_probabilities = tf.nn.softmax(y_test_predict)

# First 5 predicted probabilities
y_test_probabilities[:5]

<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[9.6703286e-04, 9.9903297e-01],
       [5.9835223e-04, 9.9940169e-01],
       [9.9961793e-01, 3.8200829e-04],
       [5.0184951e-04, 9.9949813e-01],
       [9.9890661e-01, 1.0933299e-03]], dtype=float32)>

To get the predicted labels, `argmax` is used to return the index of the maximum probability for each review, which corresponds to the labels of zeros and ones.

In [15]:
# Predicted label
y_test_class_preds = np.argmax(y_test_probabilities, axis=1)

# First 5 predicted labels
y_test_class_preds[:5]

array([1, 1, 0, 1, 0])

`accuracy_score` is used to evaluate the model performance. We can see that the customized sentiment analysis model with transfer learning gives us above 90% accuracy, meaning that the predictions are correct above 90% of the time.

In [16]:
# Accuracy
accuracy_score(y_test_class_preds, y_test)

0.9044419559537141

# Step 8: Save and Load Model

In step 8, we will talk about how to save the model and reload it for prediction.

`tokenizer.save_pretrained` saves the tokenizer information to the drive and `model.save_pretrained` saves the model to the drive.

In [17]:
# Save tokenizer
tokenizer.save_pretrained('./hausa_sentiment_transfer_learning_tensorflow/')

# Save model
model.save_pretrained('./hausa_sentiment_transfer_learning_tensorflow/')

We can load the saved tokenizer later using `AutoTokenizer.from_pretrained()` and load the saved model using `TFAutoModelForSequenceClassification.from_pretrained()`.

In [18]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("./hausa_sentiment_transfer_learning_tensorflow/")

# Load model
loaded_model = TFAutoModelForSequenceClassification.from_pretrained('./hausa_sentiment_transfer_learning_tensorflow/')

Some layers from the model checkpoint at ./hausa_sentiment_transfer_learning_tensorflow/ were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at ./hausa_sentiment_transfer_learning_tensorflow/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


To verify that the customized transfer learning model is loaded correctly, the loaded model is used to make predictions on the testing dataset. We can see that the prediction results are exactly the same as the fine-tuned model, confirming that the model is loaded correctly.

In [19]:
# Predict logit using the loaded model
y_test_predict = loaded_model.predict(dict(tokenized_data_test))['logits']

# Take a look at the first 5 predictions
y_test_predict[:5]



array([[-3.523766 ,  3.4165444],
       [-3.778314 ,  3.6424186],
       [ 4.018092 , -3.851594 ],
       [-3.8848238,  3.7118843],
       [ 3.263181 , -3.5542526]], dtype=float32)

# Step 9: Sentiment Model Using Transfer Learning on Large Dataset

In step 9, we will talk about how to handle large datasets with Hugging Face transfer learning.

The training process can be very slow for large datasets because of the size of the tokenized array and the padding tokens. But we can load the data as `tf.data.Dataset` to make the process faster.
* Firstly, the python dataframe needs to be converted to the Hugging Face arrow dataset using `Dataset.from_pandas()`.
* Then a tokenizer needs to be initiated.
* After that, the tokenizer is applied to the Hugging Face arrow dataset.
* The pretrained model is loaded using `TFAutoModelForSequenceClassification.from_pretrained()`.
* Finally, the dataset is loaded using `prepare_tf_dataset()`.

In [20]:
# Convert pyhton dataframe to Hugging Face arrow dataset
lrl_dataset = Dataset.from_pandas(hausa_recourse_data)

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Funtion to tokenize data
def tokenize_dataset(data):
    return tokenizer(data["text"])

# Tokenize the dataset
dataset = lrl_dataset.map(tokenize_dataset)

# Load model
model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased")

# TF dataset
tf_dataset = model.prepare_tf_dataset(dataset=dataset,
                                      batch_size=16,
                                      shuffle=True,
                                      tokenizer=tokenizer)

Map:   0%|          | 0/13392 [00:00<?, ? examples/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


`prepare_tf_dataset()` is a method to wrap Hugging Face `Dataset` as a `tf.data.Dataset` with collation and batching.
* `dataset` takes in a Hugging Face `Dataset` that is to be wrapped as a `tf.data.Dataset`.
* `batch_size=16` means that in each batch 16 records will be processed. The default `batch_size` is 8.
* `shuffle=True` indicates that the samples from the dataset will be returned in random order. The default value is `True`. The Hugging Face documentation mentioned that it is usually set to `True` for training datasets and `False` for validation or testing datasets.
* `tokenizer` is a `PreTrainedTokenizer` for padding samples.

After the dataset is converted to a `tf.data.Dataset`, the model is compiled and fit on the dataset.

Because the Hugging Face datasets are stored on disk by default, they will not increase memory usage. The batches can be streamed from the dataset and the paddings can be added to each batch, which saves time and memory compared to padding the entire dataset.

In [22]:
# Loss
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# Compile model
model.compile(optimizer=Adam(5e-6), loss=loss, metrics=['accuracy'])

# Fit the model
model.fit(tf_dataset,
          epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7c33bb5c24d0>

#Step 10: Sharing pretrained models

In the steps below, we’ll take a look at the easiest ways to share pretrained models to the 🤗 Hub.

In [23]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [24]:
#hub_model_id = "HausaSentiLex"
model.push_to_hub("HausaSentiLex")

tf_model.h5:   0%|          | 0.00/434M [00:00<?, ?B/s]

In [25]:
tokenizer.push_to_hub("HausaSentiLex")

CommitInfo(commit_url='https://huggingface.co/mangaphd/HausaSentiLex/commit/c2ee96688bab48651e411c04a6ec12c1d008dd01', commit_message='Upload tokenizer', commit_description='', oid='c2ee96688bab48651e411c04a6ec12c1d008dd01', pr_url=None, pr_revision=None, pr_num=None)