# Fine-tuning model for interview question answering

This notebook is a test for fine-tuning a model for interview question answering. It contains several sources for interview questions and answers. The goal is to fine-tune a model to answer interview questions.

## Requirements:
This code written using Local Machine with GPU Nvidia GTX 1660 Ti 6GB. But you can use Google Colab for free.

 - Anaconda
 - Nvidia CUDA Toolkit 11.1
 - Jupyter Notebook

## Links to the sources:

Sources for interview questions and answers:
- [https://github.com/sudheerj/angular-interview-questions](https://github.com/sudheerj/angular-interview-questions)
- [https://github.com/sudheerj/javascript-interview-questions](https://github.com/sudheerj/javascript-interview-questions)
- [https://github.com/sudheerj/reactjs-interview-questions](https://github.com/sudheerj/reactjs-interview-questions)
- [https://github.com/aershov24/full-stack-interview-questions](https://github.com/aershov24/full-stack-interview-questions)

## Model for paraphrasing:
Also we use the following model for paraphrasing:
- [https://huggingface.co/google/flan-t5-small](https://huggingface.co/google/flan-t5-small)

## Model for fine-tuning:
And the following model for fine-tuning:
- [https://huggingface.co/databricks/dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b)
- [https://huggingface.co/google/flan-t5-small](https://huggingface.co/google/flan-t5-small)

In [None]:
# Dataset preparation
!pip install datasets markdown beautifulsoup4
# GPU support
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
!pip install nvidia-ml-py3
# Data processing
!pip install numpy pandas nltk evaluate
# Training software
!pip install transformers accelerate
# Training visualization
!pip install tensorboard

## Create question and answer dataset from interviewing questions and answers

This code is used to create a dataset from the sources above. It parses the markdown files and creates a dataset in JSON format.
Most of questions are the h3 or h4 tags and answers are the content after the question tag.

In [None]:
import os

import markdown
import pandas as pd
from bs4 import BeautifulSoup

sudheerj_paths = [
    os.path.join('..', 'data', 'interview', 'sudheerj', 'angular-interview-questions.md'),
    os.path.join('..', 'data', 'interview', 'sudheerj', 'javascript-interview-questions.md'),
    os.path.join('..', 'data', 'interview', 'sudheerj', 'reactjs-interview-questions.md'),
]

aershov24_paths = [
    os.path.join('..', 'data', 'interview', 'aershov24', 'full-stack-interview-questions.md')
]

questions_answers_path = os.path.join('..', 'datasets', 'interview', 'interview_questions.json')
augmented_questions_answers_path = os.path.join('..', 'datasets', 'interview', 'interview_questions_augmented.json')


# Extract questions and answers from markdown files
def parse_files(md_files, question_selector):
    data = pd.DataFrame()
    for md_file in md_files:
        with open(md_file, "r", encoding="utf-8") as file:
            md_content = file.read()
            html_content = markdown.markdown(md_content)
            soup = BeautifulSoup(html_content, "html.parser")

            questions = soup.select(question_selector)

            for question in questions:
                answer_elements = []
                sibling = question.find_next_sibling()

                while sibling and sibling.name != question_selector:
                    answer_elements.append(str(sibling))
                    sibling = sibling.find_next_sibling()

                answer = BeautifulSoup(''.join(answer_elements).strip())

                data = pd.concat([data, pd.DataFrame({
                    'question': [question.text.strip()],
                    'answer': [answer.text.strip()]
                })], ignore_index=True)
    return data


sudheerj_df = parse_files(sudheerj_paths, "h3")
aershov24_df = parse_files(aershov24_paths, "h4")

combine_df = pd.concat([sudheerj_df, aershov24_df], ignore_index=True)
combine_df.to_json(questions_answers_path, orient='records')
combine_df.tail()

## Augment dataset with paraphrasing

Augment is a process of creating new data from existing data. In this case we use the model for paraphrasing to create new questions and answers.
It is done with following steps:
1. Take precreated dataset
2. Go through each question and answer and replace some words with synonyms
3. Save it
4.
Yep, it is that simple. But it is enough to create a dataset with 4 times more data.

In [None]:
from tqdm.auto import tqdm
import random
import pandas as pd
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')

combine_df = pd.read_json(questions_answers_path, orient='records')


def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonyms.add(lemma.name())
    return list(synonyms)


def replace_with_synonym(sentence, max_replacements=2):
    words = sentence.split()
    replacements = 0
    new_words = []

    for word in words:
        if replacements < max_replacements and random.random() < 0.5:
            synonyms = get_synonyms(word)
            if synonyms:
                word = random.choice(synonyms)
                replacements += 1
        new_words.append(word)

    return ' '.join(new_words)


def generate_augmented_row(row):
    question = row['question']
    answer = row['answer']

    augmented_question = replace_with_synonym(question)
    augmented_answer = replace_with_synonym(answer)
    rand = random.random()
    if rand < 0.25:
        return augmented_question, answer
    elif rand > 0.25:
        return question, augmented_answer
    else:
        return augmented_question, augmented_answer


def generate_augmented_rows(df):
    for _, row in tqdm(df.iterrows(), total=len(df), desc='Augmenting'):
        yield generate_augmented_row(row)


pd.DataFrame(generate_augmented_rows(df=combine_df), columns=['question', 'answer'])
augmented_df = pd.concat(
    [
        combine_df,
        pd.DataFrame(generate_augmented_rows(df=combine_df), columns=['question', 'answer']),
        pd.DataFrame(generate_augmented_rows(df=combine_df), columns=['question', 'answer']),
    ],
    ignore_index=True
).dropna().drop_duplicates(subset=['question', 'answer'], keep='first', ignore_index=True)

augmented_df.to_json(augmented_questions_answers_path, orient='records')

print(f'Original dataset size: {len(combine_df)}')
print(f'Augmented dataset size: {len(augmented_df)}')
augmented_df.tail()

## Train a T5 model to generate answers from questions

This code is used to train a T5 model to generate answers from questions. The model is trained on the augmented dataset above.
Steps to train the model:
1. Split the dataset into train and validation sets
2. Create a T5 model and tokenizer
3. Tokenize the questions and answers
4. Create a `Seq2SeqTrainer` to train the model
5. Train the model for 4 epochs
6. Save the model
The training is done on a single GPU and took only 44 minutes to train the model on GTX 1660 Ti.

In [None]:

import pandas as pd
from transformers import T5ForConditionalGeneration, T5Tokenizer, T5Config, Seq2SeqTrainingArguments, \
    DataCollatorForSeq2Seq, Seq2SeqTrainer
from accelerate import Accelerator
from datasets import load_dataset
import numpy as np
import evaluate

# I set 4 epochs because i can
num_train_epochs = 4

accelerator = Accelerator()

# Splitting the dataset into train and validation sets
[train_ds, test_df] = load_dataset('json',
                                   data_files=augmented_questions_answers_path,
                                   split=['train[:90%]', 'train[-10%:]']
                                   )
# Creating the tokenizer and model
model_name = "google/flan-t5-small"
config = T5Config.from_pretrained(model_name)

# Creating the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained(model_name, config=config)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Putting the model on the GPU to be FAAAAAST
model, tokenizer = accelerator.prepare(model, tokenizer)

# Tokenizing the questions and answers
def preprocess_data(batch):
    input_texts = ["question: " + example for example in batch["question"]]
    target_texts = ["answer: " + example for example in batch["answer"]]
    input_tokenized = tokenizer(input_texts, truncation=True, max_length=512, padding="max_length", return_tensors="np")
    target_tokenized = tokenizer(target_texts, truncation=True, max_length=512, padding="max_length",
                                 return_tensors="np")
    input_tokenized, target_tokenized = accelerator.prepare(input_tokenized, target_tokenized)
    return {"input_ids": input_tokenized.input_ids, "attention_mask": input_tokenized.attention_mask,
            "labels": target_tokenized.input_ids}


tokenized_train_dataset = train_ds.map(preprocess_data, batched=True)
tokenized_test_dataset = test_df.map(preprocess_data, batched=True)

# Data Collator - it is used to pad the data to the same length
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Creating the trainer
training_args = Seq2SeqTrainingArguments(
    output_dir="output",
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir="logs",
    learning_rate=5e-5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    report_to=["tensorboard"],
    optim='adamw_torch',
    fp16=True,
)

# Computing the metrics since it is hype thing to do
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc", 'rouge', 'bertscore', 'bleu', 'meteor', 'sacrebleu', 'accuracy'),
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Creating trainer and train the model! Rock!!!
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

# Saving the model for later use
trainer.save_model("output/model")
tokenizer.save_pretrained("output/model")

# Displaying fancy digits
trainer.evaluate()

## Inference

Let's compare the difference between the source model and the fine-tuned model.

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer, T5Config
from accelerate import Accelerator

accelerator = Accelerator()

# Creating the tokenizer and model
model_name = "google/flan-t5-small"
config = T5Config.from_pretrained(model_name)

# Creating the model and tokenizer
model = T5ForConditionalGeneration.from_pretrained(model_name, config=config)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Putting the model on the GPU to be FAAAAAST
model, tokenizer = accelerator.prepare(model, tokenizer)

# Inference on the source and fine-tuned models
def generate_answer(question, model, tokenizer, max_length=128):
    model.eval()
    input_text = "question: " + question
    input_tokens = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
    output_tokens = model.generate(input_tokens, max_length=max_length, repetition_penalty=2.5, length_penalty=1.0,
                                   early_stopping=True, num_beams=4, num_return_sequences=4)
    answer = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
    answer = answer.replace("answer: ", "")
    return answer

# Put whatever question you want
question = "What is the framework?"

# Loading the source model
source_model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")
source_tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")

# Generating the answer from the source model
source_answer = generate_answer(question, source_model, source_tokenizer)

# Loading the fine-tuned model
tuned_model = T5ForConditionalGeneration.from_pretrained("output/model")
tuned_tokenizer = T5Tokenizer.from_pretrained("output/model")

# Generating the answer from the fine-tuned model
tuned_answer = generate_answer(question, tuned_model, tuned_tokenizer)

# Behold the difference between the source and fine-tuned models!!!
print(f"Question: {question}")  # Question: What is the framework?
print(f"Source answer: {source_answer}")  # a framework
print(f"Tuned answer: {tuned_answer}")  # The framework is a set of tools that can be used to build and maintain the application. These tools are used to build applications, such as JavaScript, HTML, CSS, etc. In this framework, you can create your own custom code for your application. For example, let's take a look at the main features of the framework,javascript const template = []; console.log(message)  console.log(message);

## Conclusion

Model able to answer questions with more deep understanding after fine-tuning on the augmented dataset. However, the model is still not perfect.

To improve the model, you can try to:
1. Increase the number of epochs
2. Increase the number of training examples
3. Use a larger model
4. Use a different optimizer
5. Use a different learning rate
6. Use a different scheduler
7. Use a different data augmentation technique
8. Use a larger dataset
