<a href="https://colab.research.google.com/github/lov435/SOEmotions/blob/main/BERT_on_SO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###1. Install required dependencies

In [2]:
!pip install torch torchvision transformers pandas scikit-learn



###2. Load and Inspect Dataset

In [3]:
import pandas as pd

url='https://drive.google.com/file/d/1Suw8zSMSDQrtFtjFNnCXIipnkYIFDjvR/view?usp=sharing'
file_id=url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
df = pd.read_csv(dwn_url)
df.dropna(subset=['CommentTextProc'], inplace=True)

# Inspect the structure of the DataFrame
print(df.head())


                                     CommentTextProc  \
0                                               fmod   
1  group concat regexp enough reason use sql sql ...   
2                              <CALL> fix anoth call   
3                           confus what wrong code ?   
4  <CALL> anoth way but boost doc show specifi va...   

                                         OrigComment  Score  RefersTo  \
0                                  $x = fmod($x, 1);     81     False   
1  Group_concat and REGEXP are more than enough r...      3     False   
2    @David: I fixed it and added another call... :)      0      True   
3  @Jamie - that is another way, but the boost do...      2      True   
4  @Jamie - that is another way, but the boost do...      2      True   

   SameAuthor          Label        Group  admiration  amusement   anger  ...  \
0       False       Solution     Addition    0.000300     0.0003  0.0004  ...   
1       False  Clarification     Addition    0.000500     0.00

### 3. Preprocess the data

In [4]:
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize and encode comments
encoded_data = tokenizer(
    df['CommentTextProc'].tolist(),
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors='pt'
)

# Assuming df['Group'] contains your string labels
label_encoder = LabelEncoder()
df['NumericLabels'] = label_encoder.fit_transform(df['Group'])

# Now, df['NumericLabels'] contains the numeric representation of your string labels

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    encoded_data['input_ids'],
    df['NumericLabels'],
    test_size=0.2,
    random_state=42
)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

### 4. Create Dataset and DataLoader:


In [5]:
import torch

class StackOverflowDataset(torch.utils.data.Dataset):
    def __init__(self, input_ids, labels):
        self.input_ids = input_ids
        self.labels = labels

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'labels': torch.tensor(self.labels.iloc[idx], dtype=torch.long)
        }

# Create datasets and dataloaders
train_dataset = StackOverflowDataset(X_train, y_train)
val_dataset = StackOverflowDataset(X_val, y_val)

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=32, shuffle=False)

### 5. Load Pretrained BERT Model

In [6]:
from transformers import BertForSequenceClassification, AdamW

num_classes = len(df['Group'].unique())
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_classes)
optimizer = AdamW(model.parameters(), lr=2e-5)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 6. Train the Model

In [7]:
import torch
from sklearn.metrics import accuracy_score, classification_report

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

num_epochs = 3  # Adjust as needed

# Training loop
for epoch in range(num_epochs):
    model.train()
    total_correct = 0
    total_samples = 0

    for batch in train_dataloader:
        inputs = {key: value.to(device) for key, value in batch.items()}
        outputs = model(**inputs)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # Evaluation on validation set after each epoch
    model.eval()
    with torch.no_grad():
        true_labels = []
        predicted_labels = []

        for batch in val_dataloader:
            inputs = {key: value.to(device) for key, value in batch.items()}
            outputs = model(**inputs)

            # Get predicted labels
            _, preds = torch.max(outputs.logits, 1)

            true_labels.extend(inputs['labels'].cpu().numpy())
            predicted_labels.extend(preds.cpu().numpy())

    # Calculate and print accuracy for the epoch
    accuracy = accuracy_score(true_labels, predicted_labels)
    print(f'Epoch {epoch + 1}/{num_epochs}, Accuracy: {accuracy:.4f}')

# Calculate and print the overall accuracy after all epochs
overall_accuracy = accuracy_score(true_labels, predicted_labels)
print(f'\nOverall Accuracy: {overall_accuracy:.4f}')


We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Epoch 1/3, Accuracy: 0.4137
Epoch 2/3, Accuracy: 0.5685
Epoch 3/3, Accuracy: 0.6068

Overall Accuracy: 0.6068
