### Question Synthesis Engine

This allows me to generate high-quality questions for the students by finetuning a Llama 3 8b model for high-quality question generation. Of course, to me the questions might seem high-quality but in actual fact may not be so I created a simple streamlit application which is in the same folder to conduct a blind test with randomly chosen questions mixed with actual NSMQ questions. After 10 people took the test, we recorded a high accuracy of people marking the synthetic questions for actual questions.

---

First we check the GPU version available in the environment and install specific dependencies that are compatible with the detected GPU to prevent version conflicts.

In [None]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

Next we need to prepare to load a range of quantized language models, including the model we want to use - the LLama-3 model with 4-bit quantization.


In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True # im using 4bit quantization to reduce memory usage. Can be False.

fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit",
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",
    "unsloth/llama-3-70b-Instruct-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.6
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


---

Next, we integrate LoRA adapters into our model, which allows us to efficiently update just a fraction of the model's parameters, enhancing training speed and reducing computational load.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We are using a combined csv file that contains all the questions that was provided us for the hackathon. I wrote a script to combine the questions from the various years into one file. But the dataset contains only questions from the Fundamental Questions Round so to efficiently simulate questions from the other rounds, I wrote a script to get the transcripts from about 25 NSMQ videos and then I extracted the questions from there.

---
Then, we define a system prompt that formats tasks into instructions, inputs, and responses, and apply it to a dataset to prepare our inputs and outputs for the model, with an EOS token to signal completion.

### Creating the Combined Dataset

Given the Google Drive folder of NSMQ Questions, we had to create a combined csv file containing all the questions so that we could train our model effectively. I also had to scrape transcripts from past NSMQ contests to get questions from the other rounds to help effectively generate questions from the other rounds.

In [None]:
import pandas as pd
from google.colab import drive
import gspread
import os

# Mounting Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Path to the root folder containing the NSMQ QUESTIONS directory
root_folder = '/content/drive/MyDrive/Brilla AI Resources'

In [None]:
def read_excel_files(root_folder):
    data_frames = []

    if not os.path.exists(root_folder):
        print(f"Root folder not found: {root_folder}")
        return data_frames

    print(f"Root folder: {root_folder}")

    for year_folder in os.listdir(root_folder):
        year_folder_path = os.path.join(root_folder, year_folder)
        print(f"Year folder: {year_folder_path}")

        if os.path.isdir(year_folder_path):
            for file in os.listdir(year_folder_path):
                file_path = os.path.join(year_folder_path, file)
                print(f"Found file: {file_path}")

                if file.endswith('.xlsx') or file.endswith('.xls'):
                    print(f"Reading {file_path}")
                    df = pd.read_excel(file_path)
                    data_frames.append(df)

    return data_frames

# Read all Excel files and combine them into a single DataFrame
data_frames = read_excel_files(root_folder)

if data_frames:
    combined_df = pd.concat(data_frames, ignore_index=True)
    # Save the combined DataFrame to a CSV file
    output_path = '/content/drive/My Drive/combined_data.csv'
    combined_df.to_csv(output_path, index=False)
    print(f"Data has been combined and saved to {output_path}")
else:
    print("No data frames to combine.")

Root folder: /content/drive/MyDrive/Brilla AI Resources
Year folder: /content/drive/MyDrive/Brilla AI Resources/2020
Found file: /content/drive/MyDrive/Brilla AI Resources/2020/2020 NSMQ contest 13.xlsx
Reading /content/drive/MyDrive/Brilla AI Resources/2020/2020 NSMQ contest 13.xlsx
Found file: /content/drive/MyDrive/Brilla AI Resources/2020/2020 NSMQ contest 14.xlsx
Reading /content/drive/MyDrive/Brilla AI Resources/2020/2020 NSMQ contest 14.xlsx
Found file: /content/drive/MyDrive/Brilla AI Resources/2020/2020 NSMQ contest 23.xlsx
Reading /content/drive/MyDrive/Brilla AI Resources/2020/2020 NSMQ contest 23.xlsx
Found file: /content/drive/MyDrive/Brilla AI Resources/2020/2020 NSMQ contest 38.xlsx
Reading /content/drive/MyDrive/Brilla AI Resources/2020/2020 NSMQ contest 38.xlsx
Found file: /content/drive/MyDrive/Brilla AI Resources/2020/2020 NSMQ contest 2.xlsx
Reading /content/drive/MyDrive/Brilla AI Resources/2020/2020 NSMQ contest 2.xlsx
Found file: /content/drive/MyDrive/Brilla AI 

In [None]:
!pip install pytube youtube-transcript-api

Collecting pytube
  Downloading pytube-15.0.0-py3-none-any.whl (57 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pytube
Successfully installed pytube-15.0.0


In [None]:
from pytube import Playlist
from youtube_transcript_api import YouTubeTranscriptApi

# Function to get the transcript of a single video
def get_transcript(video_id):
    try:
        transcript_list = YouTubeTranscriptApi.get_transcript(video_id)
        transcript_text = ''
        for line in transcript_list:
            transcript_text += line['text'] + ' '
        return transcript_text
    except Exception as e:
        print(f"An error occurred for video {video_id}: {str(e)}")
        return None

# Main function to iterate through the playlist and write transcripts to a file
def get_transcripts_from_playlist(playlist_url, output_file):
    playlist = Playlist(playlist_url)
    video_ids = [video.video_id for video in playlist.videos]

    with open(output_file, 'w', encoding='utf-8') as f:
        for video_id in video_ids:
            print(f"Fetching transcript for video ID: {video_id}")
            transcript = get_transcript(video_id)
            if transcript:
                f.write(transcript + '\n\n')
    print(f"Transcripts have been written to {output_file}")

# Example usage
if __name__ == "__main__":
    playlist_url = 'https://youtube.com/playlist?list=PLLbHBxYIftwvrWScPQQKitFVT11-MBUlK&si=f0I9PiaGDvvyeY1y'
    output_file = 'transcripts.txt'

    get_transcripts_from_playlist(playlist_url, output_file)

Fetching transcript for video ID: 3N4tI9aOTVw
Fetching transcript for video ID: imuvkEb9rmQ
Fetching transcript for video ID: 9mYYhigVxh8
Fetching transcript for video ID: UJaocPqJz58
Fetching transcript for video ID: 0aWI4GWxMJk
Fetching transcript for video ID: lOr3h3qOr4U
Fetching transcript for video ID: 19zsntIBkMo
Fetching transcript for video ID: 1Q9B6Uqgw2U
Fetching transcript for video ID: G53NMVV59FE
Fetching transcript for video ID: fE-qVk4HDHA
Fetching transcript for video ID: K1cBKW2gEZs
Fetching transcript for video ID: V3dbA4rGrcw
Fetching transcript for video ID: 2lxLwkGStn0
Fetching transcript for video ID: 18ntn6Z1Ea8
Fetching transcript for video ID: OjEIgrBwV7c
Fetching transcript for video ID: lUfW7RZ9aKA
Fetching transcript for video ID: l30tHg9tmLA
Fetching transcript for video ID: 01tSRqMUd3I
Fetching transcript for video ID: WPuhIZfPO0A
Fetching transcript for video ID: fIiAyetZnqI
Fetching transcript for video ID: uH3AP4g9rys
Fetching transcript for video ID: 

---

From here, a lot of manual work was done to get the questions from the transcript. The transcript is not perfect and because of this, manual extraction of the questions and formatting them into a cleaner format was needed. Please check the folder that this file is found in. It'll contain the combined + annotated + cleaned dataset named "Question Synth Dataset"

In [None]:
from datasets import load_dataset
from datasets import Dataset

question_synth_prompt = """You're a highly respected college instructor at one of the most renowed institutions in the world.
You're tasked with setting questions for the renowed National Maths and Science Quiz.
The National Science and Math Quiz is a competition where first year college students compete by solving questions from various fields of STEM (principally Chemistry, Physics, Mathematics and Biology)
As you're a highly respected college instructor in the STEM field, your questions should be as top quality and thought provoking as possible.
If your questions are not as high-quality, someone will die. If your question doesn't obey the input, someone will die.
Based on the input, put in the MAXIMUM effort to generate high stardard and quality questions at the level of a final year college student who is highly speicialized in the specified field can answer.
The questions must not be so easy for even a third year student in the field. The standard should be that high.
Take good note of the input values and produce desired outputs based on the input values.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

csv_file_path = "/content/drive/MyDrive/afried/Question Synth Dataset.csv"  # Update this with your actual file path
df = pd.read_csv(csv_file_path)
dataset = Dataset.from_pandas(df)

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    texts = []
    for has_preamble, preamble_text, question, answer, subject, question_type in zip(
        examples["has_preamble"], examples["preamble_text"], examples["question"],
        examples["answer"], examples["subject"], examples["question_type"]
    ):
        # Construct the instruction and input for the new question
        instruction = f"Create a high-quality college level question and answer based on the given subject and question type. Please take a note of the question type. It is TOO TOO IMPORTANT."
        input_text = f"Example:\nHas Preamble: {has_preamble}\nSubject: {subject}\nQuestion Type: {question_type}"
        output_text = f"Preamble Text: {preamble_text}\nQuestion: {question}\nAnswer: {answer}"
        text = question_synth_prompt.format(instruction, input_text, output_text) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

# Apply the formatting function to the dataset
dataset = dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/15852 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
At this stage, we're configuring our model's training setup, where we define things like batch size and learning rate, to teach our model effectively with the data we have prepared.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 150, # increase this to make the model learn "better"
        # num_train_epochs=1,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/15852 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
11.033 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 15,852 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 150
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.5153
2,2.5525
3,2.4114
4,2.3894
5,2.3137
6,2.1997
7,1.9336
8,1.5438
9,1.3148
10,1.1441


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

1502.46 seconds used for training.
25.04 minutes used for training.
Peak reserved memory = 11.033 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 74.81 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model and save our questions to the SQLite file!  You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time! The whole idea of this was to make sure that students had enough high-quality questions to practice without having to hit any endpoint or use the copyrighted questions.

The following were the statistics we **aimed** for. Due to time, we were able to generate about 10,600 questions;

**15,600 questions** in total;
1. Fundamental Questions (ALL AREAS - Bio, Chem, Math and Physics)
2. Riddles
3. True and False

In [None]:
# @title Text Extraction
import re

def extract_info(text):
    # Split the text into blocks based on the "### Input" delimiter
    blocks = text.split("### Input:")

    results = []

    for block in blocks[1:]:  # Skip the first empty block
        # Extract information from each block
        match = re.search(r'Subject: (.*?)\nQuestion Type:', block, re.DOTALL)
        if match:
            subject = match.group(1).strip()
        else:
            subject = ""

        match = re.search(r'Has Preamble: (.*?)\nSubject: (.*?)\nPreamble Text: (.*?)\nQuestion: (.*?)\nAnswer: (.*?)\n', block, re.DOTALL)
        if match:
            has_preamble, _, preamble, question, answer = match.groups()
            if (answer.strip().find("<|begin_of_text|>")) > 0:
              answer = answer.strip()[ : answer.strip().find("<|begin_of_text|>")]
            else:
              answer = answer.strip()

            results.append({
                'has preamble': has_preamble.strip(),
                'subject' : subject,
                'preamble': preamble.strip(),
                'question': question.strip(),
                'answer': answer
            })

    return results


# results = extract_info(test_text)

# for result in results:
#   print(result)

In [None]:
# @title CSV Operations
import csv

def write_to_csv(preamble, question, answer, subject, path="/content/drive/MyDrive/afried/generatedQuestions.csv"):
    # Data to write
    # header = ["preamble", "question", "answer", "subject"]

    subject_parts = subject.split()
    main_subject = subject_parts[0] if len(subject_parts) > 0 else ""
    topic = " ".join(subject_parts[1:]) if len(subject_parts) > 1 else ""

    new_row = [preamble, question, answer, main_subject, topic]

    # Appending to a CSV file
    with open(path, mode='a', newline='') as file:
        csv_writer = csv.writer(file)
        csv_writer.writerow(new_row)

# Example usage
# write_to_csv(preamble='preamble', question='question', answer='answer', subject='subject topic')


In [None]:
# @title Run from here

prompts = [
    ("No", "Chemistry Solubility", "1") for x in range(150) # of course, this changes as we go through various topics in the various subjects. we discovered that this gives way better results than just naming it "Chemistry" or "Physics"
]

output_array = []

def gen_10questions(has_preamble, subject, question_type):
    preamble = ""
    question = ""
    answer = ""
    subject = ""

    promptOne = f"Has Preamble: {has_preamble}\nSubject: {subject}\nQuestion Type: {question_type}"

    FastLanguageModel.for_inference(model)
    inputs = tokenizer(
        [
            question_synth_prompt.format(
                f""" Please generate high quality real-world application questions in {prompts[0][1]} for the National Science and Math Quiz. do your OUTMOST best to make the questions as complex as possible.
                The questions can be application questions that have themes in any field.
                The questions must be very complex to the highest standard of university level students.
                Also, provide an answer to each question. It is very imiportant to provide an answer to every question!
                But do not provide any explanations for the answers whatsoever. Unless the question is an explanation question.
                And explanation questions must be real-life application questions
                """, # instruction
                promptOne, # input
                "", # output - leave this blank for generation!
            )
        ], return_tensors="pt").to("cuda")

    from transformers import TextStreamer
    text_streamer = TextStreamer(tokenizer)
    outputs = model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048)
    output_array.append(tokenizer.batch_decode(outputs))

    # Combine all strings in the list into one string
    combined_text = ''.join(content[-1] for content in output_array)

    # Extract information from the combined text
    results = extract_info(combined_text)

    # Check if results contain data before accessing
    if results:
        for result in results:
            # Print or handle each result individually
            # preamble = result.get('preamble', "")
            # question = result.get('question', "")
            # answer = result.get('answer', "")
            # subject = result.get('subject', "")

            preamble = result['preamble']
            question = result['question']
            answer = result['answer']
            subject = result['subject']

        write_to_csv(preamble=preamble, question=question, answer=answer, subject=subject)
        print("written to csv...")
    else:
        print("No results to process")

for x in prompts:
    gen_10questions(x[0], x[1], x[2])


In [None]:
# @title Clean up csv

import pandas as pd
import re

# Read the CSV file into a DataFrame
df = pd.read_csv('generatedQuestions.csv')

# Function to extract the subject from the preamble
def extract_subject(preamble):
    if isinstance(preamble, str):
        match = re.search(r'Subject:\s*(\w+)', preamble)
        if match:
            return match.group(1)
    return None

# Function to clean the preamble column
def clean_preamble(preamble):
    if isinstance(preamble, str):
        if preamble.startswith("None"):
            return "None"
        else:
            match = re.search(r'(None)? Subject:.*', preamble)
            if match:
                preamble_cleaned = preamble.split(" Subject:")[0].strip()
                return preamble_cleaned if preamble_cleaned else "None"
    return preamble

# Convert the preamble column to strings
df['preamble'] = df['preamble'].astype(str)

# Apply the function to extract the subject
df['subject'] = df['preamble'].apply(extract_subject)

# Apply the function to clean the preamble
df['preamble'] = df['preamble'].apply(clean_preamble)

# Save the updated DataFrame to a new CSV file
df.to_csv('combinedgenquestions.csv', index=False)

print("Operation completed successfully ...")


In [None]:
# @title Convert csv to db file

import csv
import sqlite3

# Establish a connection to a new or existing SQLite database
connection = sqlite3.connect('data.db')
cursor = connection.cursor()

# Create a new table to store the data
cursor.execute('''
CREATE TABLE IF NOT EXISTS questions (
    has_preamble TEXT,
    preamble_text TEXT,
    question TEXT,
    answer TEXT,
    subject TEXT,
    question_type TEXT,
    form TEXT,
    difficulty TEXT,
    subject_topic TEXT
)
''')

track=0
# Open the CSV file and read data
with open('combinedgenquestions.csv', newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        track=track+1
        cursor.execute('''
        INSERT INTO questions (has_preamble, preamble_text, question, answer, subject,
                               question_type, form, difficulty, subject_topic)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)''',
        (row['has_preamble'], row['preamble_text'], row['question'], row['answer'],
         row['subject'], row['question_type'], row['form'], row['difficulty'], row['subject_topic']))
        print(track)

# Commit the changes to the database
connection.commit()
# Close the connection
connection.close()
