Installing and importing necassary libraries.

In [None]:
!pip install transformers[torch]

In [None]:
!pip install accelerate -U

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from transformers import BertForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM
from transformers import pipeline
import torch
import numpy as np



1.   Import the necessary libraries:
    *   import pandas as pd
2.   URL of the CSV file
    * The URL needs to be adjusted to access the raw content directly.
    * https://raw.githubusercontent.com/hamidds/PSND/main/PSN-fa.csv
3.  Read the CSV file:
    * Use the pd.read_csv() function to read the CSV file from the raw URL:
    * df = pd.read_csv(url)
4.  Display the dataframe
    * Use df.head() to display the first few rows of the dataframe to ensure it has been loaded correctly:
    * df.head()





In [None]:
# URL of the CSV file in the GitHub repository
url = "https://raw.githubusercontent.com/hamidds/PSND/main/PSN-fa.csv"

# Read the CSV file directly from the URL
data = pd.read_csv(url)

# Display the first few rows of the dataframe
data.head()

Unnamed: 0,Norm,Environment,Context,Label
0,دوری از رفتارهای خشن و پرخاشگرانه,ورزشگاه,در یک رویداد یا مسابقه ورزشی,Normal
1,اجتناب از نگاه کردن به افراد نامحرم,خیابان شهر,عابران در حال رفت و آمد در خیابان,Normal
2,صرف نهار در ماه رمضان در اماکن عمومی,رستوران,یک مسلمان در حال صرف نهار علنی در رستوران در م...,Taboo
3,احسان و نیکوکاری,مراکز خیریه,افراد کمک های نقدی و غیرنقدی اهدا می کنند,Expected
4,تحصیل دختران و پسران در دانشگاه‌های جداگانه,دانشگاه,دانشجویان دختر و پسر در حال تحصیل در دانشگاه‌ه...,Normal


1. **Assert Statements**:
   ```python
   assert 'Norm' in data.columns
   assert 'Environment' in data.columns
   assert 'Context' in data.columns
   assert 'Label' in data.columns
   ```
   These lines of code are used to ensure that the dataset contains the required columns. Each `assert` statement checks if a specific column name is present in the dataframe `data`. If any of these conditions fail (i.e., the column is missing), it will raise an `AssertionError`.

2. **Combine Relevant Columns into a Single Text Column**:
   ```python
   data['text'] = 'Environment: ' + data['Environment'] + ' Norm: ' + data['Norm'] + ' Context: ' + data['Context']
   ```
   This line of code creates a new column named `'text'` in the dataframe `data`. It concatenates the values from three columns (`'Environment'`, `'Norm'`, and `'Context'`) along with some descriptive text. This concatenation is done to create a single text column that can be used as input for models like BERT.

3. **Encode Labels**:
   ```python
   label_mapping = {'Expected': 0, 'Normal': 1, 'Taboo': 2}
   data['label'] = data['Label'].map(label_mapping)
   ```
   Here, a dictionary `label_mapping` is defined to map textual labels to numerical values. Then, the `map()` function is used to apply this mapping to the `'Label'` column in the dataframe `data` and create a new column `'label'` containing the numerical representations of the labels.

4. **Split the Data**:
   ```python
   train_texts, test_texts, train_labels, test_labels = train_test_split(data['text'],
                                                                         data['label'],
                                                                         test_size=0.2,
                                                                         random_state=42)
   ```
   This line of code splits the data into training and testing sets. It uses the `train_test_split()` function from scikit-learn, which splits arrays or matrices into random train and test subsets. Here:
   - `data['text']` contains the input text data.
   - `data['label']` contains the corresponding labels.
   - `test_size=0.2` specifies that 20% of the data will be used for testing, and the rest for training.
   - `random_state=42` ensures reproducibility of the random splitting process. Different random states will result in different random splits, but setting the random state to a specific value ensures that the split will be the same each time the code is run.

In [None]:
# Ensure the dataset columns are as described
assert 'Norm' in data.columns
assert 'Environment' in data.columns
assert 'Context' in data.columns
assert 'Label' in data.columns

# Combine relevant columns into a single text column for BERT with column names
data['text'] = 'Environment: ' + data['Environment'] + ' Norm: ' + data['Norm'] + ' Context: ' + data['Context']

# Encode labels
label_mapping = {'Expected': 0, 'Normal': 1, 'Taboo': 2}
data['label'] = data['Label'].map(label_mapping)

# Split the data
train_texts, test_texts, train_labels, test_labels = train_test_split(data['text'],
                                                                      data['label'],
                                                                      test_size=0.2,
                                                                      random_state=42)

1. **Load Tokenizer**:
   ```python
   model_name = 'bert-base-multilingual-cased'
   tokenizer = AutoTokenizer.from_pretrained(model_name)
   ```
   Here, we are loading a tokenizer for the BERT model named `'bert-base-multilingual-cased'`. The `AutoTokenizer.from_pretrained()` method loads the pre-trained tokenizer associated with the specified BERT model.

2. **Tokenize Data**:
   ```python
   def tokenize_function(texts):
       return tokenizer(texts, padding=True, truncation=True, max_length=512)

   train_encodings = tokenize_function(train_texts.tolist())
   test_encodings = tokenize_function(test_texts.tolist())
   ```
   This code defines a function `tokenize_function()` that takes a list of texts as input and tokenizes them using the tokenizer loaded in the previous step. The `padding=True` and `truncation=True` arguments ensure that the sequences are padded to the same length and truncated if they exceed the maximum length of 512 tokens. Then, it tokenizes the training and testing texts using this function.

3. **Convert to Torch Dataset**:
   ```python
   class Dataset(torch.utils.data.Dataset):
       def __init__(self, encodings, labels):
           self.encodings = encodings
           self.labels = labels

       def __getitem__(self, idx):
           item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
           item['labels'] = torch.tensor(self.labels[idx])
           return item

       def __len__(self):
           return len(self.labels)

   train_dataset = Dataset(train_encodings, train_labels.tolist())
   test_dataset = Dataset(test_encodings, test_labels.tolist())
   ```
   This code defines a custom dataset class `Dataset` that inherits from `torch.utils.data.Dataset`. It takes tokenized encodings and corresponding labels as input and formats them into a format compatible with PyTorch. The `__getitem__` method returns a dictionary containing tokenized input sequences and their corresponding labels for a given index `idx`. The `__len__` method returns the total number of samples in the dataset. Finally, it creates training and testing datasets using this custom class, providing the tokenized encodings and labels.

In [None]:
# Load tokenizer
model_name = 'bert-base-multilingual-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize data
def tokenize_function(texts):
    return tokenizer(texts, padding=True, truncation=True, max_length=512)


train_encodings = tokenize_function(train_texts.tolist())
test_encodings = tokenize_function(test_texts.tolist())


# Convert to torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


train_dataset = Dataset(train_encodings, train_labels.tolist())
test_dataset = Dataset(test_encodings, test_labels.tolist())

```python
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
```

Here's what each part does:

1. **`AutoModelForSequenceClassification.from_pretrained()`**: This is a method provided by the Hugging Face `transformers` library. It allows us to load a pre-trained model for sequence classification. The model architecture (e.g., BERT) is determined by the `model_name` parameter.

2. **`model_name`**: This variable contains the name of the pre-trained model to load. In this case, it's `'bert-base-multilingual-cased'`, which refers to a BERT model trained on multiple languages with case sensitivity.

3. **`num_labels=3`**: This parameter specifies the number of labels/classes in the classification task. In this case, it's set to `3`, indicating that the model will be trained for a classification task with three classes.

So, this line of code loads a pre-trained BERT model for sequence classification with the specified model architecture (`'bert-base-multilingual-cased'`) and configures it for a classification task with three classes.

In [None]:
# Load model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This code sets up the training configuration, initializes a `Trainer` object, trains the model, and evaluates its performance. Let's go through each part:

1. **Training Arguments**:
   ```python
   training_args = TrainingArguments(
       output_dir='./results',
       num_train_epochs=18,  # Reduced from 3 to 2
       per_device_train_batch_size=16,  # Reduced from 8 to 4
       per_device_eval_batch_size=16,  # Reduced from 8 to 4
       warmup_steps=200,  # Adjusted based on batch size and epochs
       weight_decay=0.01,
       logging_dir='./logs',
   )
   ```
   This block defines the training arguments using the `TrainingArguments` class from the `transformers` library. It specifies various parameters related to training:
   - `output_dir`: Directory where model checkpoints and evaluation results will be saved.
   - `num_train_epochs`: Number of training epochs. Here, it's set to 18 epochs.
   - `per_device_train_batch_size`: Batch size per GPU during training. It's set to 16.
   - `per_device_eval_batch_size`: Batch size per GPU during evaluation. Also set to 16.
   - `warmup_steps`: Number of steps for warmup in the learning rate scheduler. Adjusted based on batch size and epochs.
   - `weight_decay`: Weight decay for regularization.
   - `logging_dir`: Directory for storing logs during training.

2. **Trainer Initialization**:
   ```python
   trainer = Trainer(
       model=model,
       args=training_args,
       train_dataset=train_dataset,
       eval_dataset=test_dataset,
   )
   ```
   This creates a `Trainer` object responsible for managing the training process. It takes the following arguments:
   - `model`: The pre-trained model for fine-tuning.
   - `args`: The training arguments defined earlier.
   - `train_dataset`: The training dataset.
   - `eval_dataset`: The evaluation dataset.

3. **Training**:
   ```python
   trainer.train()
   ```
   This method call initiates the training process. The `Trainer` object will iterate through the training dataset for the specified number of epochs, updating the model's parameters based on the defined training objectives.

4. **Evaluation**:
   ```python
   trainer.evaluate()
   ```
   After training, this line evaluates the model's performance on the evaluation dataset. It computes metrics such as accuracy, precision, recall, and F1 score to assess how well the model generalizes to unseen data.

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=18,  # Reduced from 3 to 2
    per_device_train_batch_size=16,  # Reduced from 8 to 4
    per_device_eval_batch_size=16,  # Reduced from 8 to 4
    warmup_steps=200,  # Adjusted based on batch size and epochs
    weight_decay=0.01,
    logging_dir='./logs',
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Train the model
trainer.train()

# Evaluate the model with fine-tuning
trainer.evaluate()

This block of code performs prediction without fine-tuning the BERT model for sequence classification. Let's break it down:

```python
# Without fine-tuning
model_without_fine_tuning = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
```
Here, we are loading a pre-trained BERT model for sequence classification without fine-tuning it. The model architecture is determined by `model_name`, and `num_labels=3` indicates that the model is configured for a classification task with three classes.

```python
classifier_without_fine_tuning = pipeline('text-classification', model=model_without_fine_tuning, tokenizer=tokenizer,
                                          return_all_scores=True)
```
This line initializes a text classification pipeline without fine-tuning. It uses the `pipeline()` function from the `transformers` library to create a text classification pipeline. The `return_all_scores=True` argument ensures that the pipeline returns scores for all possible labels.

```python
reverse_label_mapping = {0: 'Expected', 1: 'Normal', 2: 'Taboo'}
pipeline_label_mapping = {f'LABEL_{i}': label for i, label in reverse_label_mapping.items()}
```
These lines define mappings between numerical labels and textual labels. `reverse_label_mapping` maps numerical labels to textual labels, while `pipeline_label_mapping` maps pipeline labels to textual labels.

```python
# Predict without fine-tuning
preds_without_fine_tuning = classifier_without_fine_tuning(test_texts.tolist())
```
This line makes predictions on the test data using the classifier without fine-tuning. It predicts labels for each input text in the test set.

```python
pred_labels_without_fine_tuning = [
    label_mapping[pipeline_label_mapping[max(scores, key=lambda x: x['score'])['label']]] for scores in preds_without_fine_tuning
]
```
Here, predictions made by the classifier are transformed into human-readable labels. It extracts the label with the highest score for each prediction and maps it to its corresponding textual label using the defined mappings.

```python
print(f"Evaluation without fine-tuning on model {model_name}:")
print(classification_report(test_labels, pred_labels_without_fine_tuning, target_names=reverse_label_mapping.values()))
```
Finally, it prints the evaluation results without fine-tuning. It computes and displays classification metrics such as precision, recall, and F1-score using the `classification_report` function from scikit-learn. The target names are specified to provide meaningful labels in the classification report.

In [None]:
# Without fine-tuning
model_without_fine_tuning = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
classifier_without_fine_tuning = pipeline('text-classification', model=model_without_fine_tuning, tokenizer=tokenizer,
                                          return_all_scores=True)

reverse_label_mapping = {0: 'Expected', 1: 'Normal', 2: 'Taboo'}
pipeline_label_mapping = {f'LABEL_{i}': label for i, label in reverse_label_mapping.items()}

# Predict without fine-tuning
preds_without_fine_tuning = classifier_without_fine_tuning(test_texts.tolist())
pred_labels_without_fine_tuning = [
    label_mapping[pipeline_label_mapping[max(scores, key=lambda x: x['score'])['label']]] for scores in preds_without_fine_tuning
]

print(f"Evaluation without fine-tuning on model {model_name}:")
print(classification_report(test_labels, pred_labels_without_fine_tuning, target_names=reverse_label_mapping.values()))

This code block performs evaluation with fine-tuning of the BERT model for sequence classification. Let's dissect it:

```python
# Evaluation
print(f"Evaluation with fine-tuning on model {model_name}:")
preds = trainer.predict(test_dataset)
pred_labels = np.argmax(preds.predictions, axis=1)
print(classification_report(test_labels, pred_labels, target_names=reverse_label_mapping.values()))
```

Here's what each part does:

1. **Print Evaluation Header**:
   ```python
   print(f"Evaluation with fine-tuning on model {model_name}:")
   ```
   This line simply prints a header indicating that the evaluation is being performed with fine-tuning on the specified BERT model.

2. **Make Predictions**:
   ```python
   preds = trainer.predict(test_dataset)
   ```
   This line uses the `predict()` method of the `Trainer` object to make predictions on the test dataset (`test_dataset`). It returns an object containing the predictions.

3. **Extract Predicted Labels**:
   ```python
   pred_labels = np.argmax(preds.predictions, axis=1)
   ```
   Here, `np.argmax()` is used to find the index of the maximum value along the specified axis (`axis=1`). This effectively gives us the predicted class labels for each sample in the test dataset.

4. **Print Classification Report**:
   ```python
   print(classification_report(test_labels, pred_labels, target_names=reverse_label_mapping.values()))
   ```
   Finally, this line prints the classification report, which includes metrics such as precision, recall, and F1-score, comparing the predicted labels (`pred_labels`) with the true labels (`test_labels`). The `target_names` argument is provided to give meaningful labels in the classification report.

In [None]:
# Evaluation
print(f"Evaluation with fine-tuning on model {model_name}:")
preds = trainer.predict(test_dataset)
pred_labels = np.argmax(preds.predictions, axis=1)
print(classification_report(test_labels, pred_labels, target_names=reverse_label_mapping.values()))