# Fine-tuning model for interview question answering

This notebook is a test for fine-tuning a model for interview question answering. It contains several sources for interview questions and answers. The goal is to fine-tune a model to answer interview questions.

## Requirements:
This code written using Local Machine with GPU Nvidia GTX 1660 Ti 6GB. But you can use Google Colab for free.

 - Anaconda
 - Nvidia CUDA Toolkit 11.1
 - Jupyter Notebook

## Links to the sources:

Sources for interview questions and answers:
- [https://github.com/sudheerj/angular-interview-questions](https://github.com/sudheerj/angular-interview-questions)
- [https://github.com/sudheerj/javascript-interview-questions](https://github.com/sudheerj/javascript-interview-questions)
- [https://github.com/sudheerj/reactjs-interview-questions](https://github.com/sudheerj/reactjs-interview-questions)
- [https://github.com/aershov24/full-stack-interview-questions](https://github.com/aershov24/full-stack-interview-questions)

## Model for paraphrasing:
Also we use the following model for paraphrasing:
- [https://huggingface.co/google/flan-t5-small](https://huggingface.co/google/flan-t5-small)

## Model for fine-tuning:
And the following model for fine-tuning:
- [https://huggingface.co/databricks/dolly-v2-3b](https://huggingface.co/databricks/dolly-v2-3b)
- [https://huggingface.co/google/flan-t5-small](https://huggingface.co/google/flan-t5-small)

In [1]:
# Install dependencies and libraries including CUDA for PyTorch
!pip install datasets markdown beautifulsoup4
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
!pip install transformers pandas accelerate nvidia-ml-py3 datasets



ERROR: Directory '//' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.






In [106]:
import os
import markdown
import pandas as pd
from bs4 import BeautifulSoup
from datasets import Dataset

sudheerj_paths = [
    os.path.join('..', 'data', 'interview', 'sudheerj', 'angular-interview-questions.md'),
    os.path.join('..', 'data', 'interview', 'sudheerj', 'javascript-interview-questions.md'),
    os.path.join('..', 'data', 'interview', 'sudheerj', 'reactjs-interview-questions.md'),
]
aershov24_paths = [
    os.path.join('..', 'data', 'interview', 'aershov24', 'full-stack-interview-questions.md')
]

def parse_files(md_files, question_selector):
    data = pd.DataFrame()
    for md_file in md_files:
        with open(md_file, "r", encoding="utf-8") as file:
            md_content = file.read()
            html_content = markdown.markdown(md_content)
            soup = BeautifulSoup(html_content, "html.parser")

            questions = soup.select(question_selector)

            for question in questions:
                answer_elements = []
                sibling = question.find_next_sibling()

                while sibling and sibling.name != question_selector:
                    answer_elements.append(str(sibling))
                    sibling = sibling.find_next_sibling()

                answer = BeautifulSoup(''.join(answer_elements).strip())

                data = pd.concat([data, pd.DataFrame({
                    'question': [question.text.strip()],
                    'answer': [answer.text.strip()]
                })])
    return data

sudheerj_df = parse_files(sudheerj_paths, "h3")
aershov24_df = parse_files(aershov24_paths, "h4")

combine_df = pd.concat([sudheerj_df, aershov24_df])

print(f"Total questions: {len(combine_df)}")

combine_df.to_csv(os.path.join('..', 'datasets', 'interview', 'interview_questions.csv'), index=False)
combine_df.tail()

Total questions: 1601


Unnamed: 0,question,answer
0,"Given variables a and b, switch their values s...","py\na, b = b, a"
0,How do you list the functions in a module?,Use the dir() method to list the functions in ...
0,What are descriptors?,Descriptors were introduced to Python way back...
0,What is React?,React is an open-source JavaScript library cre...
0,How would you write an inline style in React?,For example: ```html\n```\n\n\n#### What is JE...


In [None]:
from transformers.pipelines.base import KeyDataset
from tqdm.auto import tqdm
from datasets import Dataset
import pandas as pd
from transformers import pipeline
from accelerate import Accelerator

accelerator = Accelerator()

num_paraphrases = 2
batch_size = 5

combined_ds = Dataset.from_pandas(combine_df)

paraphrase_pipeline = pipeline(
    "text2text-generation",
    model="google/flan-t5-small",
    tokenizer="google/flan-t5-small",
    framework="pt",
    num_beams=num_paraphrases,
    device=accelerator.device.index
)

# Helper function to preprocess input for the pipeline
def preprocess_input(batch, input_key):
    return

paraphrased_questions_df = pd.DataFrame(columns=["question"])
paraphrased_answers_df = pd.DataFrame(columns=["answer"])

preprocessed_batch_question = combined_ds.map(lambda examples: {"question": [f"paraphrase: {example}" for example in examples['question']]}, batched=True, batch_size=batch_size)
preprocessed_batch_answer = combined_ds.map(lambda examples: {"answer": [f"paraphrase: {example}" for example in examples['answer']]}, batched=True, batch_size=batch_size)

for batch in tqdm(paraphrase_pipeline(KeyDataset(preprocessed_batch_question, "question"), num_return_sequences=num_paraphrases), desc="Paraphrasing questions", total=len(preprocessed_batch_question)):
    for paraphrase in batch:
        paraphrased_questions_df = pd.concat([paraphrased_questions_df, pd.DataFrame({
            'question': [paraphrase['generated_text']]
        })], ignore_index=True)
for batch in tqdm(paraphrase_pipeline(KeyDataset(preprocessed_batch_answer, "answer"), num_return_sequences=num_paraphrases), desc="Paraphrasing answers", total=len(preprocessed_batch_answer)):
    for paraphrase in batch:
        paraphrased_answers_df = pd.concat([paraphrased_answers_df, pd.DataFrame({
            'answer': [paraphrase['generated_text']]
        })], ignore_index=True)

paraphrased_qa_df = pd.concat([paraphrased_questions_df, paraphrased_answers_df], axis=1, join="inner")
paraphrased_qa_combined = pd.concat([paraphrased_qa_df, combine_df], axis=0, join="inner")
paraphrased_qa_df.to_csv(os.path.join('..', 'datasets', 'interview', 'paraphrased_qa.csv'), index=False)
paraphrased_qa_combined.to_csv(os.path.join('..', 'datasets', 'interview', 'paraphrased_qa_combined.csv'), index=False)

print(f"Para-phrased questions: {len(paraphrased_questions_df)}")
print(f"The original questions: {len(combine_df)}")
print(f"Together: {len(paraphrased_qa_combined)}")

Map:   0%|          | 0/1601 [00:00<?, ? examples/s]

Map:   0%|          | 0/1601 [00:00<?, ? examples/s]

Paraphrasing questions:   0%|          | 0/1601 [00:00<?, ?it/s]