# Assignment 2: Worked Example

**Generative AI: Use Case**

In our assignment, the primary goal is to provide personalized solutions based on individuals' moods using generative AI and prompt engineering. The dataset at our disposal contains a comprehensive set of issues paired with corresponding solutions. For instance, if an individual expresses feelings of anxiety, our generative AI model suggests potential remedies such as engaging in a conversation with a therapist or taking a peaceful walk outdoors. These examples showcase the versatility of our approach in tailoring responses to diverse emotional states. We will be fine tuning the model for our dataset to offer meaningful and contextually relevant recommendations, promoting mental well-being and a positive user experience.

**Installing Required Libraries**

In [None]:
!pip install --upgrade pip
!pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet


!pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

**Importing the Libraries**

In [None]:
import torch
import time
import evaluate
import pandas as pd
import numpy as np
import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

**About the model**

GPT-2 XL is a big and powerful language model created by OpenAI. It's like a smart computer program that's really good at understanding and generating human-like text. The "XL" means it's extra large, with 1.5 billion parts making it one of the biggest models. It learns by reading a lot of different texts, and then it can be used to write or complete sentences, answer questions, or do other language-related tasks. People can also adjust or fine-tune it for specific jobs. While it's powerful, using it requires a lot of computer resources.

In [None]:
model_name = "gpt2-xl"

tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

**Dataset**

We picked a dataset that has two parts: "Condition" and "Solution." In the first part, we list the problems or challenges people might face, and in the second part, we suggest ways to overcome those issues. For instance, if someone is feeling really down (that's the condition), the solution could be talking to friends, spending time with family, being around people, or speaking to a therapist. It's like a guide offering ideas to help deal with different situations people might find themselves in.

**Loading the Dataset**

In [None]:
# Load your CSV dataset
df = pd.read_csv('dataset.csv')

# Save the dataset in a text file
df.to_csv('dataset.txt', sep='\t', index=False, header=False)

In [None]:
index = 200

sentence_prefix = dataset["Condition"][0]
solution = dataset["Solution"][0]

prompt = f"""
Answer the following question.

{sentence_prefix}

Solution:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    model.generate(
        inputs["input_ids"], 
        max_new_tokens=200,
    )[0], 
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SOLUTION:\n{solution}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

# Performing Fine-Tuning

**Preprocessing the Dataset**

Imagine you're trying to cook a delicious meal, but the ingredients you have are messy, incomplete, and inconsistent. It would be quite challenging to create a flavorful dish in that situation. Similarly, in machine learning, data preprocessing is like cleaning and preparing the ingredients before cooking. It involves handling missing values, removing noise, and ensuring consistency in the data. This crucial step ensures the data is of high quality, suitable for analysis, and leads to more accurate and reliable results.

In [None]:
model_name = "gpt2-xl"

tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

In [None]:
# Tokenize the dataset
def tokenize_dataset(file_path):
    return TextDataset(
        tokenizer=tokenizer,
        file_path=file_path,
        block_size=128,  # Adjust the block size according to your dataset and available memory
    )

train_dataset = tokenize_dataset('dataset.txt')

# Create data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We are not doing masked language modeling here
)

**Setting up training arguments**

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./chatbot_model",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
)

**Initialize Trainer and fine-tune the model**

**What is fine tuning?**

In Machine Learning, fine tuning is a way to make a pre-trained model work better for a specific task. It's like giving the model a little extra training to help it understand the new job better. This is especially useful when you don't have a lot of data to train the model from scratch and want your model to work efficiently for a specific task.

In [None]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# Start fine-tuning
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./chatbot_model")
tokenizer.save_pretrained("./chatbot_model")

**Loading the pre-trained model for getting the result**

In [None]:
# Create a chatbot using the fine-tuned model
fine_tuned_model = GPT2LMHeadModel.from_pretrained("./chatbot_model")
fine_tuned_tokenizer = GPT2Tokenizer.from_pretrained("./chatbot_model")

**Giving the input text for tokenizing it**

In [None]:
# Input text
input_text = "Stress"

# Tokenize input text
input_ids = tokenizer.encode(input_text, return_tensors="pt")

**Generating the output of the given input text from the pre-trained model**

In [None]:
# Generate output from the model
output = fine_tuned_model.generate(input_ids, max_length=max_length, num_return_sequences=1)

# Decode and print the generated text
response = fine_tuned_tokenizer.decode(output[0], skip_special_tokens=True)
print(Response)

# Conclusion

We have used GPT2-xl model for our use case and have also implemented zero shot inference for the same dataset.