#### Question 1: Sentiment Analysis with Transformers
Dataset Problem: Use the IMDB movie reviews dataset to perform sentiment analysis using a Transformer model. load the dataset from TensorFlow datasets library and solve the problem. 

Due to the complexity and size of Transformer models, use via libraries like Hugging Face's Transformers and work it out, feel free to experiment with more than 1 transformer model and compare the results and give a short explanation on the best model, what are the reasons for its performance.  


In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
import evaluate
import warnings
warnings.filterwarnings('ignore')

In [3]:
from transformers.utils import is_accelerate_available
import torch, transformers, accelerate

print("PyTorch:", torch.__version__)
print("Transformers:", transformers.__version__)
print("Accelerate:", accelerate.__version__)
print("is_accelerate_available():", is_accelerate_available())


PyTorch: 2.9.0+cpu
Transformers: 4.57.1
Accelerate: 1.11.0
is_accelerate_available(): True


In [6]:
dataset = load_dataset("imdb")
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [7]:
models_to_test = ["distilbert-base-uncased", "bert-base-uncased"]

In [8]:
results = {}

for model_name in models_to_test:
    print(f"\n--- Training {model_name} ---")

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    def tokenize_function(example):
        return tokenizer(example["text"], truncation=True, padding="max_length", max_length=256)

    tokenized_datasets = dataset.map(tokenize_function, batched=True)
    tokenized_datasets = tokenized_datasets.remove_columns(["text"])
    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
    tokenized_datasets.set_format("torch")

    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(4000))
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    accuracy = evaluate.load("accuracy")

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return accuracy.compute(predictions=predictions, references=labels)

    training_args = TrainingArguments(
        output_dir=f"./results_{model_name}",
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=1,
        learning_rate=2e-5,
        weight_decay=0.01,
        logging_dir='./logs',
        do_train=True,
        do_eval=True,
        save_steps=10000,
        save_total_limit=1,
        no_cuda=True, 
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=small_train_dataset,
        eval_dataset=small_eval_dataset,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    trainer.train()
    eval_result = trainer.evaluate()
    results[model_name] = eval_result

print("\n=== Final Results ===")
for model_name, metrics in results.items():
    print(f"{model_name}: {metrics}")


--- Training distilbert-base-uncased ---


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.3837



--- Training bert-base-uncased ---


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.3792



=== Final Results ===
distilbert-base-uncased: {'eval_loss': 0.3158005475997925, 'eval_accuracy': 0.872, 'eval_runtime': 134.4118, 'eval_samples_per_second': 7.44, 'eval_steps_per_second': 0.93, 'epoch': 1.0}
bert-base-uncased: {'eval_loss': 0.28238561749458313, 'eval_accuracy': 0.889, 'eval_runtime': 273.3746, 'eval_samples_per_second': 3.658, 'eval_steps_per_second': 0.457, 'epoch': 1.0}


#### Question 2: Text Generation with Transformers

Dataset Problem: Using a pre-trained GPT model (any version) from Hugging Face's Transformers, generate a short story based on a given prompt. Example prompt is below

Prompt=” In a distant future, humanity has discovered”


In [9]:
from transformers import pipeline, set_seed

In [10]:
generator = pipeline("text-generation", model="gpt2")
set_seed(42)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [11]:
prompt = "In a distant future, humanity has discovered"
generated_text = generator(prompt, max_length=100, num_return_sequences=1, temperature=0.8)
print(generated_text[0]['generated_text'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In a distant future, humanity has discovered that there is a way to stop the threat that our ancestors created. In order to do so, it will require a great deal of courage.

The power of our intelligence. Without it, there would be no civilization, no civilization, and no culture.

The human race would die. And because of the human race, it would be impossible to stop.

You have heard the story of the human race, but what about the next generation of the human race?

I have heard about it, but how is that possible?

The next generation is going to have many choices. Are they going to be able to choose between a life of hard work or a life of peace, or maybe a life of freedom and independence?

If you look at what other people have done, they have done to destroy the human race.

We have gone from a place of freedom in which we have to work, to freedom in which we have to save the world.

It was this freedom that saved us from extinction.

I had that freedom when I was a kid and when I w

In [None]:
for model_name in ["gpt2", "gpt2-medium", "EleutherAI/gpt-neo-1.3B"]:
    print(f"\n--- {model_name} ---")
    gen = pipeline("text-generation", model=model_name)
    output = gen(prompt, max_length=100, temperature=0.8)
    print(output[0]["generated_text"])


--- gpt2 ---


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In a distant future, humanity has discovered the only way to stop the universe's growing pains. The human race has been waging a war against the cosmic forces of the present, and has begun to study the energies of the future. The technology of space travel has become one of the top technologies of the present, and mankind is poised to make the ultimate decision. Humanity is in danger, and they are prepared to take drastic measures to stop the growing pains of the universe. In an effort to save the galaxy, humanity has been given the ability to utilize the force of the cosmic forces of the future as a weapon to wipe out the forces of the past. With the help of the universe, mankind has taken control of the galaxy, and is on the map to conquer the entire galaxy. But to do this, they must first conquer the universe itself. Humanity has become a galactic government, and is now the most important country in the galaxy. To destroy the cosmic forces of the future, humanity must destroy the hu

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In a distant future, humanity has discovered an alternative to the "good" that has been on the minds of most. The "bad" is no longer good and we no longer have to live in fear of being devoured. Humanity is now able to use the resources and technology available to them. It doesn't cost money or work; instead, people just share them. They do not need to get a job or buy a house. They don't need to buy food or clothes. They don't even have to go to war to survive. You don't need to wear a suit and tie when you are on the road, or use a cell phone. You don't even have to be a child anymore.

"The story of The Good is No Longer Good" is a tale of opportunity, growth, and evolution. Humans are no longer ruled by fear, by conformity and conformity's children. We are free to evolve with the times and become what we are capable of. The tale will be told through the lives of some of our heroes who have made the world a better place. Please join me as we embark on a journey into a world where hu

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/5.31G [00:00<?, ?B/s]