## Using IMDB dataset for active learning + sentiment analysis

In [39]:
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import DataLoader, Dataset
import torch
import random

# Load and preprocess the dataset
dataset = load_dataset("imdb")

In [46]:
# dataset.set_format(type='pandas')

# df = dataset['train'][:]
# df.head()

Unnamed: 0,text,label
0,"I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered ""controversial"" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, even then it's not shot like some cheaply made porno. While my countrymen mind find it shocking, in reality sex and nudity are a major staple in Swedish cinema. Even Ingmar Bergman, arguably their answer to good old boy John Ford, had sex scenes in his films.<br /><br />I do commend the filmmakers for the fact that any sex shown in the film is shown for artistic purposes rather than just to shock people and make money to be shown in pornographic theaters in America. I AM CURIOUS-YELLOW is a good film for anyone wanting to study the meat and potatoes (no pun intended) of Swedish cinema. But really, this film doesn't have much of a plot.",0
1,"""I Am Curious: Yellow"" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas and flapping labia? Nowhere, because they don't exist. The same goes for those crappy cable shows: schlongs swinging in the breeze but not a clitoris in sight. And those pretentious indie movies like The Brown Bunny, in which we're treated to the site of Vincent Gallo's throbbing johnson, but not a trace of pink visible on Chloe Sevigny. Before crying (or implying) ""double-standard"" in matters of nudity, the mentally obtuse should take into account one unavoidably obvious anatomical difference between men and women: there are no genitals on display when actresses appears nude, and the same cannot be said for a man. In fact, you generally won't see female genitals in an American film in anything short of porn or explicit erotica. This alleged double-standard is less a double standard than an admittedly depressing ability to come to terms culturally with the insides of women's bodies.",0
2,"If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably do during this pointless film).<br /><br />One might better spend one's time staring out a window at a tree growing.<br /><br />",0
3,"This film was probably inspired by Godard's Masculin, féminin and I urge you to see that film instead.<br /><br />The film has two strong elements and those are, (1) the realistic acting (2) the impressive, undeservedly good, photo. Apart from that, what strikes me most is the endless stream of silliness. Lena Nyman has to be most annoying actress in the world. She acts so stupid and with all the nudity in this film,...it's unattractive. Comparing to Godard's film, intellectuality has been replaced with stupidity. Without going too far on this subject, I would say that follows from the difference in ideals between the French and the Swedish society.<br /><br />A movie of its time, and place. 2/10.",0
4,"Oh, brother...after hearing about this ridiculous film for umpteen years all I can think of is that old Peggy Lee song..<br /><br />""Is that all there is??"" ...I was just an early teen when this smoked fish hit the U.S. I was too young to get in the theater (although I did manage to sneak into ""Goodbye Columbus""). Then a screening at a local film museum beckoned - Finally I could see this film, except now I was as old as my parents were when they schlepped to see it!!<br /><br />The ONLY reason this film was not condemned to the anonymous sands of time was because of the obscenity case sparked by its U.S. release. MILLIONS of people flocked to this stinker, thinking they were going to see a sex film...Instead, they got lots of closeups of gnarly, repulsive Swedes, on-street interviews in bland shopping malls, asinie political pretension...and feeble who-cares simulated sex scenes with saggy, pale actors.<br /><br />Cultural icon, holy grail, historic artifact..whatever this thing was, shred it, burn it, then stuff the ashes in a lead box!<br /><br />Elite esthetes still scrape to find value in its boring pseudo revolutionary political spewings..But if it weren't for the censorship scandal, it would have been ignored, then forgotten.<br /><br />Instead, the ""I Am Blank, Blank"" rhythymed title was repeated endlessly for years as a titilation for porno films (I am Curious, Lavender - for gay films, I Am Curious, Black - for blaxploitation films, etc..) and every ten years or so the thing rises from the dead, to be viewed by a new generation of suckers who want to see that ""naughty sex film"" that ""revolutionized the film industry""...<br /><br />Yeesh, avoid like the plague..Or if you MUST see it - rent the video and fast forward to the ""dirty"" parts, just to get it over with.<br /><br />",0


In [17]:


# Split the dataset into training and test sets
train_data, test_data = train_test_split(dataset['train'], test_size=0.2)

# Convert to DataFrame for easier manipulation
train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def preprocess_data(text):
    return tokenizer(text, padding='max_length', truncation=True, max_length=128)

# Preprocess the training data
train_texts = list(train_df['text'])
train_labels = list(train_df['label'])
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)

# Preprocess the test data
test_texts = list(test_df['text'])
test_labels = list(test_df['label'])
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=128)

# Create a custom dataset class
class IMDbDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    
    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)




In [27]:
class ActiveLearningDataset(Dataset):
    def __init__(self, dataset):
        self.dataset = dataset
        self.labeled_indices = set()
        self.unlabeled_indices = set(range(len(dataset)))
    
    def add_labeled_data(self, indices):
        self.labeled_indices.update(indices)
        self.unlabeled_indices.difference_update(indices)
    
    def get_labeled_data(self):
        encodings = {key: [] for key in self.dataset.encodings.keys()}
        labels = []
        for i in self.labeled_indices:
            for key in encodings.keys():
                encodings[key].append(self.dataset.encodings[key][i])
            labels.append(self.dataset.labels[i])
        encodings = {key: torch.tensor(val) for key, val in encodings.items()}
        labels = torch.tensor(labels)
        return encodings, labels
    
    def get_unlabeled_data(self):
        return [self.dataset[i] for i in self.unlabeled_indices]
    
    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        return self.dataset[idx]

# Initialize active learning dataset
al_dataset = ActiveLearningDataset(train_dataset)
initial_indices = random.sample(range(len(train_dataset)), k=int(0.01 * len(train_dataset)))  # Initial 1%
al_dataset.add_labeled_data(initial_indices)


In [30]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
training_args = TrainingArguments(
    output_dir='./results', 
    num_train_epochs=3, 
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    logging_dir='./logs', 
    logging_steps=10,
    evaluation_strategy="steps",
    save_steps=1000,
    eval_steps=1000,
    save_total_limit=2,
)

def train_model(encodings, labels):
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
    model.to(device)
    training_args = TrainingArguments(
        output_dir='./results', 
        num_train_epochs=3, 
        per_device_train_batch_size=8,
        per_device_eval_batch_size=16,
        logging_dir='./logs', 
        logging_steps=10,
        evaluation_strategy="steps",
        save_steps=1000,
        eval_steps=1000,
        save_total_limit=2,
        use_mps_device=True
    )
    train_dataset = IMDbDataset(encodings, labels)
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )
    trainer.train()
    return model

encodings, labels = al_dataset.get_labeled_data()
model = train_model(encodings, labels)



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/450 [00:00<?, ?it/s]

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


{'loss': 0.752, 'grad_norm': 16.12034034729004, 'learning_rate': 4.888888888888889e-05, 'epoch': 0.07}
{'loss': 0.7101, 'grad_norm': 5.029007434844971, 'learning_rate': 4.7777777777777784e-05, 'epoch': 0.13}
{'loss': 0.7157, 'grad_norm': 3.109116792678833, 'learning_rate': 4.666666666666667e-05, 'epoch': 0.2}
{'loss': 0.677, 'grad_norm': 8.308666229248047, 'learning_rate': 4.555555555555556e-05, 'epoch': 0.27}
{'loss': 0.6773, 'grad_norm': 6.424309730529785, 'learning_rate': 4.4444444444444447e-05, 'epoch': 0.33}
{'loss': 0.7009, 'grad_norm': 6.058610916137695, 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.4}
{'loss': 0.7876, 'grad_norm': 6.534720420837402, 'learning_rate': 4.222222222222222e-05, 'epoch': 0.47}
{'loss': 0.631, 'grad_norm': 4.2630181312561035, 'learning_rate': 4.111111111111111e-05, 'epoch': 0.53}
{'loss': 0.6762, 'grad_norm': 5.0149312019348145, 'learning_rate': 4e-05, 'epoch': 0.6}
{'loss': 0.5243, 'grad_norm': 6.3960280418396, 'learning_rate': 3.888888888888889

In [33]:
if torch.backends.mps.is_built():
    device = torch.device("mps")

def uncertainty_sampling(model, dataset, n):
    model.eval()
    unlabeled_data = dataset.get_unlabeled_data()
    unlabeled_dataloader = DataLoader(unlabeled_data, batch_size=16)
    
    # Use the model to get predictions
    predictions = []
    for batch in unlabeled_dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
        uncertainty = 1 - torch.max(probs, dim=-1)[0]
        predictions.extend(uncertainty.cpu().numpy())
    
    # Convert set to list for indexing
    unlabeled_indices = list(dataset.unlabeled_indices)
    
    # Select the indices with the highest uncertainty
    uncertain_indices = sorted(range(len(predictions)), key=lambda i: predictions[i], reverse=True)[:n]
    return [unlabeled_indices[idx] for idx in uncertain_indices]

query_size = int(0.05 * len(train_dataset))  # Query size of 5%
for iteration in range(3):
    print(f"Active Learning Iteration {iteration + 1}")
    
    # Query new data points
    new_indices = uncertainty_sampling(model, al_dataset, query_size)
    al_dataset.add_labeled_data(new_indices)
    
    # Retrain the model with the new data
    encodings, labels = al_dataset.get_labeled_data()
    model = train_model(encodings, labels)
    
    # Evaluate model performance
    trainer = Trainer(model=model, args=training_args, train_dataset=IMDbDataset(encodings, labels), eval_dataset=test_dataset)
    eval_results = trainer.evaluate(eval_dataset=test_dataset)
    print(f"Iteration {iteration + 1} - Evaluation results: {eval_results}")

Active Learning Iteration 1


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1950 [00:00<?, ?it/s]

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


{'loss': 0.7336, 'grad_norm': 2.20583176612854, 'learning_rate': 4.9743589743589746e-05, 'epoch': 0.02}
{'loss': 0.6911, 'grad_norm': 6.5328450202941895, 'learning_rate': 4.948717948717949e-05, 'epoch': 0.03}
{'loss': 0.7133, 'grad_norm': 7.767288684844971, 'learning_rate': 4.923076923076924e-05, 'epoch': 0.05}
{'loss': 0.7327, 'grad_norm': 2.948617458343506, 'learning_rate': 4.8974358974358975e-05, 'epoch': 0.06}
{'loss': 0.7077, 'grad_norm': 6.156192302703857, 'learning_rate': 4.871794871794872e-05, 'epoch': 0.08}
{'loss': 0.705, 'grad_norm': 5.826912879943848, 'learning_rate': 4.846153846153846e-05, 'epoch': 0.09}
{'loss': 0.7068, 'grad_norm': 3.4297738075256348, 'learning_rate': 4.8205128205128205e-05, 'epoch': 0.11}
{'loss': 0.712, 'grad_norm': 4.114220142364502, 'learning_rate': 4.7948717948717955e-05, 'epoch': 0.12}
{'loss': 0.7127, 'grad_norm': 3.059746265411377, 'learning_rate': 4.76923076923077e-05, 'epoch': 0.14}
{'loss': 0.6649, 'grad_norm': 2.4968035221099854, 'learning_ra

  0%|          | 0/313 [00:00<?, ?it/s]

{'eval_loss': 0.3764749765396118, 'eval_runtime': 35.618, 'eval_samples_per_second': 140.379, 'eval_steps_per_second': 8.788, 'epoch': 1.54}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


{'loss': 0.45, 'grad_norm': 18.842626571655273, 'learning_rate': 2.4102564102564103e-05, 'epoch': 1.55}
{'loss': 0.5138, 'grad_norm': 5.38333797454834, 'learning_rate': 2.384615384615385e-05, 'epoch': 1.57}
{'loss': 0.3821, 'grad_norm': 11.29217529296875, 'learning_rate': 2.358974358974359e-05, 'epoch': 1.58}
{'loss': 0.4421, 'grad_norm': 17.70659637451172, 'learning_rate': 2.3333333333333336e-05, 'epoch': 1.6}
{'loss': 0.5022, 'grad_norm': 9.272436141967773, 'learning_rate': 2.307692307692308e-05, 'epoch': 1.62}
{'loss': 0.4096, 'grad_norm': 6.689398288726807, 'learning_rate': 2.2820512820512822e-05, 'epoch': 1.63}
{'loss': 0.4006, 'grad_norm': 10.59769344329834, 'learning_rate': 2.2564102564102566e-05, 'epoch': 1.65}
{'loss': 0.3286, 'grad_norm': 6.232532024383545, 'learning_rate': 2.230769230769231e-05, 'epoch': 1.66}
{'loss': 0.4453, 'grad_norm': 7.275266170501709, 'learning_rate': 2.2051282051282052e-05, 'epoch': 1.68}
{'loss': 0.5591, 'grad_norm': 21.213706970214844, 'learning_ra

  0%|          | 0/313 [00:00<?, ?it/s]

Iteration 1 - Evaluation results: {'eval_loss': 0.6415385603904724, 'eval_runtime': 35.6758, 'eval_samples_per_second': 140.151, 'eval_steps_per_second': 8.773}
Active Learning Iteration 2


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2325 [00:00<?, ?it/s]

{'loss': 0.7171, 'grad_norm': 2.6401193141937256, 'learning_rate': 4.978494623655914e-05, 'epoch': 0.01}
{'loss': 0.684, 'grad_norm': 9.490621566772461, 'learning_rate': 4.956989247311828e-05, 'epoch': 0.03}
{'loss': 0.6832, 'grad_norm': 4.288467884063721, 'learning_rate': 4.935483870967742e-05, 'epoch': 0.04}
{'loss': 0.7354, 'grad_norm': 4.881336212158203, 'learning_rate': 4.913978494623656e-05, 'epoch': 0.05}
{'loss': 0.7023, 'grad_norm': 1.9548544883728027, 'learning_rate': 4.89247311827957e-05, 'epoch': 0.06}
{'loss': 0.6962, 'grad_norm': 7.410027980804443, 'learning_rate': 4.870967741935484e-05, 'epoch': 0.08}
{'loss': 0.6939, 'grad_norm': 2.2617318630218506, 'learning_rate': 4.849462365591398e-05, 'epoch': 0.09}
{'loss': 0.7019, 'grad_norm': 3.7606518268585205, 'learning_rate': 4.827956989247312e-05, 'epoch': 0.1}
{'loss': 0.697, 'grad_norm': 1.636579155921936, 'learning_rate': 4.806451612903226e-05, 'epoch': 0.12}
{'loss': 0.6911, 'grad_norm': 2.6445469856262207, 'learning_rate

  0%|          | 0/313 [00:00<?, ?it/s]

{'eval_loss': 0.6932776570320129, 'eval_runtime': 35.6282, 'eval_samples_per_second': 140.338, 'eval_steps_per_second': 8.785, 'epoch': 1.29}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


{'loss': 0.6958, 'grad_norm': 2.9537603855133057, 'learning_rate': 2.827956989247312e-05, 'epoch': 1.3}
{'loss': 0.7383, 'grad_norm': 3.7798638343811035, 'learning_rate': 2.806451612903226e-05, 'epoch': 1.32}
{'loss': 0.7032, 'grad_norm': 4.438270092010498, 'learning_rate': 2.78494623655914e-05, 'epoch': 1.33}
{'loss': 0.7031, 'grad_norm': 8.80803394317627, 'learning_rate': 2.763440860215054e-05, 'epoch': 1.34}
{'loss': 0.6641, 'grad_norm': 9.16932487487793, 'learning_rate': 2.7419354838709678e-05, 'epoch': 1.35}
{'loss': 0.7249, 'grad_norm': 10.090676307678223, 'learning_rate': 2.7204301075268817e-05, 'epoch': 1.37}
{'loss': 0.7353, 'grad_norm': 2.8601279258728027, 'learning_rate': 2.698924731182796e-05, 'epoch': 1.38}
{'loss': 0.7103, 'grad_norm': 5.303834915161133, 'learning_rate': 2.67741935483871e-05, 'epoch': 1.39}
{'loss': 0.6882, 'grad_norm': 2.8927197456359863, 'learning_rate': 2.6559139784946236e-05, 'epoch': 1.41}
{'loss': 0.7122, 'grad_norm': 2.51985239982605, 'learning_rat

  0%|          | 0/313 [00:00<?, ?it/s]

{'eval_loss': 0.6944551467895508, 'eval_runtime': 35.798, 'eval_samples_per_second': 139.672, 'eval_steps_per_second': 8.743, 'epoch': 2.58}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


{'loss': 0.7131, 'grad_norm': 4.264950752258301, 'learning_rate': 6.774193548387098e-06, 'epoch': 2.59}
{'loss': 0.6806, 'grad_norm': 3.182002305984497, 'learning_rate': 6.559139784946237e-06, 'epoch': 2.61}
{'loss': 0.697, 'grad_norm': 2.828944444656372, 'learning_rate': 6.344086021505377e-06, 'epoch': 2.62}
{'loss': 0.6776, 'grad_norm': 2.835517406463623, 'learning_rate': 6.129032258064516e-06, 'epoch': 2.63}
{'loss': 0.6863, 'grad_norm': 2.2537055015563965, 'learning_rate': 5.9139784946236566e-06, 'epoch': 2.65}
{'loss': 0.6792, 'grad_norm': 4.144224643707275, 'learning_rate': 5.698924731182796e-06, 'epoch': 2.66}
{'loss': 0.6737, 'grad_norm': 2.2333359718322754, 'learning_rate': 5.483870967741936e-06, 'epoch': 2.67}
{'loss': 0.7032, 'grad_norm': 2.9229109287261963, 'learning_rate': 5.268817204301076e-06, 'epoch': 2.68}
{'loss': 0.6992, 'grad_norm': 10.9837646484375, 'learning_rate': 5.0537634408602155e-06, 'epoch': 2.7}
{'loss': 0.7046, 'grad_norm': 2.756762981414795, 'learning_rat

  0%|          | 0/313 [00:00<?, ?it/s]

Iteration 2 - Evaluation results: {'eval_loss': 0.6974355578422546, 'eval_runtime': 35.615, 'eval_samples_per_second': 140.39, 'eval_steps_per_second': 8.788}
Active Learning Iteration 3


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/2700 [00:00<?, ?it/s]

{'loss': 0.7212, 'grad_norm': 4.79241943359375, 'learning_rate': 4.981481481481482e-05, 'epoch': 0.01}
{'loss': 0.7462, 'grad_norm': 6.525022029876709, 'learning_rate': 4.962962962962963e-05, 'epoch': 0.02}
{'loss': 0.6896, 'grad_norm': 3.980599880218506, 'learning_rate': 4.9444444444444446e-05, 'epoch': 0.03}
{'loss': 0.6918, 'grad_norm': 2.5722737312316895, 'learning_rate': 4.925925925925926e-05, 'epoch': 0.04}
{'loss': 0.6763, 'grad_norm': 2.7988781929016113, 'learning_rate': 4.9074074074074075e-05, 'epoch': 0.06}
{'loss': 0.752, 'grad_norm': 5.648867607116699, 'learning_rate': 4.888888888888889e-05, 'epoch': 0.07}
{'loss': 0.6973, 'grad_norm': 1.7348955869674683, 'learning_rate': 4.8703703703703704e-05, 'epoch': 0.08}
{'loss': 0.7033, 'grad_norm': 9.676098823547363, 'learning_rate': 4.851851851851852e-05, 'epoch': 0.09}
{'loss': 0.6958, 'grad_norm': 14.60083293914795, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.1}
{'loss': 0.7089, 'grad_norm': 5.6891303062438965, 'learning_

  0%|          | 0/313 [00:00<?, ?it/s]

{'eval_loss': 0.6978444457054138, 'eval_runtime': 35.5365, 'eval_samples_per_second': 140.7, 'eval_steps_per_second': 8.808, 'epoch': 1.11}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


{'loss': 0.7107, 'grad_norm': 3.902167797088623, 'learning_rate': 3.1296296296296295e-05, 'epoch': 1.12}
{'loss': 0.6811, 'grad_norm': 9.351788520812988, 'learning_rate': 3.111111111111111e-05, 'epoch': 1.13}
{'loss': 0.7409, 'grad_norm': 5.6595892906188965, 'learning_rate': 3.0925925925925924e-05, 'epoch': 1.14}
{'loss': 0.6868, 'grad_norm': 1.8289995193481445, 'learning_rate': 3.074074074074074e-05, 'epoch': 1.16}
{'loss': 0.719, 'grad_norm': 4.517435550689697, 'learning_rate': 3.055555555555556e-05, 'epoch': 1.17}
{'loss': 0.7018, 'grad_norm': 9.512813568115234, 'learning_rate': 3.037037037037037e-05, 'epoch': 1.18}
{'loss': 0.7334, 'grad_norm': 5.499905586242676, 'learning_rate': 3.018518518518519e-05, 'epoch': 1.19}
{'loss': 0.6985, 'grad_norm': 4.4730682373046875, 'learning_rate': 3e-05, 'epoch': 1.2}
{'loss': 0.7111, 'grad_norm': 2.2514681816101074, 'learning_rate': 2.981481481481482e-05, 'epoch': 1.21}
{'loss': 0.6889, 'grad_norm': 9.093832015991211, 'learning_rate': 2.96296296

  0%|          | 0/313 [00:00<?, ?it/s]

{'eval_loss': 0.5048370361328125, 'eval_runtime': 38.2447, 'eval_samples_per_second': 130.737, 'eval_steps_per_second': 8.184, 'epoch': 2.22}


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


{'loss': 0.6429, 'grad_norm': 6.034804344177246, 'learning_rate': 1.2777777777777777e-05, 'epoch': 2.23}
{'loss': 0.5309, 'grad_norm': 5.302908420562744, 'learning_rate': 1.2592592592592592e-05, 'epoch': 2.24}
{'loss': 0.6923, 'grad_norm': 10.327276229858398, 'learning_rate': 1.2407407407407408e-05, 'epoch': 2.26}
{'loss': 0.8524, 'grad_norm': 17.71708869934082, 'learning_rate': 1.2222222222222222e-05, 'epoch': 2.27}
{'loss': 0.6313, 'grad_norm': 5.244909286499023, 'learning_rate': 1.2037037037037037e-05, 'epoch': 2.28}
{'loss': 0.6057, 'grad_norm': 16.332304000854492, 'learning_rate': 1.1851851851851853e-05, 'epoch': 2.29}
{'loss': 0.5964, 'grad_norm': 11.331141471862793, 'learning_rate': 1.1666666666666668e-05, 'epoch': 2.3}
{'loss': 0.5842, 'grad_norm': 11.49596118927002, 'learning_rate': 1.1481481481481482e-05, 'epoch': 2.31}
{'loss': 0.5841, 'grad_norm': 8.027848243713379, 'learning_rate': 1.1296296296296297e-05, 'epoch': 2.32}
{'loss': 0.5454, 'grad_norm': 16.432218551635742, 'le

  0%|          | 0/313 [00:00<?, ?it/s]

Iteration 3 - Evaluation results: {'eval_loss': 0.33840757608413696, 'eval_runtime': 38.0562, 'eval_samples_per_second': 131.385, 'eval_steps_per_second': 8.225}


In [38]:
# Return labeled dataset as DataFrame
final_encodings, final_labels = al_dataset.get_labeled_data()
final_texts = [tokenizer.decode(enc) for enc in final_encodings['input_ids'].tolist()]
final_labels = final_labels.tolist()

# Map labels to 'negative' and 'positive'
label_mapping = {0: 'negative', 1: 'positive'}
final_labels_mapped = [label_mapping[label] for label in final_labels]

labeled_df = pd.DataFrame({'text': final_texts, 'label': final_labels_mapped})
labeled_df.to_csv('sentiment.csv', index=False)

