# Fine-tuning a Sequence Classification Model Exam

In this exam, you will be tasked with performing dataset preprocessing and fine-tuning a model for sequence classification. Complete each step carefully according to the instructions provided.

### Model and Dataset Information

For this task, you will be working with the following:

- **Model Checkpoint**: Use the pre-trained model checkpoint `aubmindlab/bert-base-arabertv02` for both the model and tokenizer.
- **Dataset**: You will be using the `CUTD/sanad_df` dataset. Ensure to load and preprocess the dataset correctly for training and evaluation.

**Note:**
- Any additional steps or methods you include that improve or enhance the results will be rewarded with bonus points if they are justified.
- The steps outlined here are suggestions. You are free to implement alternative methods or approaches to achieve the task, as long as you explain the reasoning and the process at the bottom of the notebook.
- You can use either TensorFlow or PyTorch for this task. If you prefer TensorFlow, feel free to use it when working with Hugging Face Transformers.
- The number of data samples you choose to work with is flexible. However, if you select a very low number of samples and the training time is too short, this could affect the evaluation of your work.

## Step 1: Load the Dataset

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K 

Load the dataset and split it into training and test sets. Use 20% of the data for testing.

In [2]:
import datasets
import transformers as t
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np

In [12]:
data=datasets.load_dataset("CUTD/sanad_df",split="train[0:7000]")


In [14]:
data[0]

{'Unnamed: 0': 0,
 'text': 'الشارقة - محمد ولد محمد سالمعرضت مساء أمس الأول على خشبة مسرح قصر الثقافة في الشارقة المسرحية السعودية "بعيداً عن السيطرة" لفرقة مسرح الطائف، من تأليف فهد ردة الحارثي، وإخراج سامي صالح الزهراني، وذلك في رابعة ليالي الدورة الأولى من مهرجان الشارقة للمسرح الخليجي .تبدأ المسرحية بثلاثة أشخاص يجلسون في قاعة مكتبة، ينهمك كل منهم في القراءة بشغف، ثم يبدأون في الحوار لنكتشف أنهم كانوا يقرأون روايات لأستاذهم الكاتب المبدع الذي مات وترك روايات فريدة، رسم فيها شخصيات غاية في الدقة، ويتحدثون عن ضرورة تكريم أستاذهم، ويتفقون على طريقة خاصة للتكريم وهي إخراج شخصياته من رواياتها لتعيش في الواقع، وينتقون شخصيات مركزية، أولها الحلاق الذي كان طيباً، حافظاً لأسرار أهل الحي، وكان الجميع يحبه، وحين لا يكون الشخص لديه ما يدفعه مقابل الحلاقة فإنه لا يطالبه بشيء، ثم يستخرجون حفار القبور الذي كان يقبر الجميع، ويردد دائماً أن الدنيا فانية، وأن البقاء لله وحده، ثم يستخرجون الشاب المحب الذي ظل سنوات طويلة يحمل وردة وينتظر حبيبته التي رحلت عنه ولم تعد إليه حتى مات .يأخذ طلاب الأستاذ تلك

In [15]:
data=data.remove_columns(["Unnamed: 0"])

Initialize a tokenizer for the model.

In [16]:
tok=t.AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv02")



In [6]:
data

Dataset({
    features: ['text', 'label'],
    num_rows: 700
})

In [17]:
l=LabelEncoder()
l.fit(data["label"])

In [18]:
data_tok = tok(data["text"], return_tensors="tf", padding="max_length",max_length=128,truncation=True)

In [19]:
label=l.transform(data["label"])
lable=np.array(label)

In [10]:
model=t.TFAutoModelForSequenceClassification.from_pretrained("aubmindlab/bert-base-arabertv02")

model.safetensors:   0%|          | 0.00/543M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForSequenceClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
model.compile(optimizer="Adam")
model.fit(data_tok,label)



<tf_keras.src.callbacks.History at 0x7b267fb23490>

Convert the categorical labels into numerical format using a label encoder if needed.

## Step 11: Inference

Once the model is trained, perform inference on a sample text to evaluate the model's prediction capabilities. Use the tokenizer to process the text, and then feed it into the model to get the predicted label.

In [21]:
data_test=datasets.load_dataset("CUTD/sanad_df",split="train[7000:7001]")

In [22]:
true_label=l.transform(data_test["label"])
true_label

array([1])

In [23]:
data_test_tok = tok(data_test["text"], return_tensors="tf", padding="max_length",max_length=128,truncation=True)

In [25]:
pred=model.predict(data_test_tok)



In [30]:
pred

TFSequenceClassifierOutput(loss=None, logits=array([[ 1.8468602, -1.7547098]], dtype=float32), hidden_states=None, attentions=None)