<a href="https://colab.research.google.com/github/pmadhyastha/INM434/blob/main/Generative_AI_a_focus_on_NLP_technologies_of_the_present.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q -U transformers datasets accelerate bitsandbytes

In [None]:
!pip install peft trl

### We will now see the ease of using pre-trained models for text generation tasks with just a few lines of code using the Hugging Face Transformers library.

Remember to swtich to GPU runtime.

In [None]:
from transformers import pipeline

model_name = "distilgpt2"
generator = pipeline('text-generation', model=model_name)

prompt = "I'd like to write a poem about the beauty of nature."
result = generator(prompt, max_length=50, num_return_sequences=3)

for output in result:
    print(output['generated_text'])


### What does the model generate?

###  In the following code, we will finetune GPT2 for the task of classification for sentiment analysis. We will begin with a standard sentiment related dataset - the imdb reviews dataset.

In [None]:
from datasets import load_dataset

# Load a sentiment analysis dataset
dataset = load_dataset("imdb")

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

###  In the above codebase, we load the IMDB sentiment analysis dataset.

###  Then we use the GPT-2 tokenizer, and apply the tokenizer to tokenize the text data in the dataset. The resulting tokenized_datasets variable will contain the tokenized text data.

###  Now we will set up a training pipeline for the task, using the pre-trained GPT-2 language model and the Hugging Face Transformers library.




In [None]:
from transformers import Trainer, TrainingArguments


from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=2)

training_args = TrainingArguments(
    output_dir="./outputs",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

trainer.train()

###  Does this crash? Can you reason why this might be? Can you change the model to `distillGPT2` as we had done before?

consider using: `from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer`

In [None]:
# This is the code for evaluating the model that we have trained!

trainer.evaluate()


###  Are you able to evaluate the model? What do you see?  

###  In the following codebase, we will perform inference with the model.

In [5]:
def predict(sentence):
    inputs = tokenizer(sentence, return_tensors="pt")
    outputs = model(**inputs)
    prediction = outputs.logits.argmax(-1).item()
    return "positive" if prediction == 1 else "negative"

sentence = "Today's class was interesting but a tad complex!"
print(predict(sentence))

positive
