# Elice Interview Mini-project
### **Interviewee**: Vu Linh Le (Andrew)
### **Position**: AI Engineer

## Problem Statement
The candidate is expected to develop a system that can automatically generate high-quality
multiple-choice quizzes from textbooks or PDF documents. The quizzes should be relevant to
the content of the given text and provide high-quality questions with multiple-choice answers.
Define what "good quality" is for this project and outline your strategy to enhance model
performance for better question quality.

## Preliminary: Text extraction from PDF
Assume that we will only work with PDF documents. In this step, the code will extract the text from the PDF documents. The text will be used as the input for the next steps.

**Library used**: PyPDF2.
I included a sample PDF file for testing purposes. The text is about my favorite Vietnamese noodle soup, Bun Bo Hue.

In [15]:
# Install the required packages
%conda env create -f environment.yml

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Retrieving notices: ...working... done

CondaValueError: prefix already exists: /opt/homebrew/Caskroom/miniconda/base/envs/elice


Note: you may need to restart the kernel to use updated packages.


In [16]:
import PyPDF2
import ipywidgets as widgets
import json


def extract_text_from_pdf(pdf_path):
    try:
        with open(pdf_path, "rb") as file:
            reader = PyPDF2.PdfReader(file)
            text = ""

            for page_num in range(len(reader.pages)):
                page = reader.pages[page_num]
                text += page.extract_text()

            return text
    except Exception as e:
        print(f"Error: {e}")
        return None

***Note***: Even though the code extract irrelevant text, it will not be handled in this notebook for the sake of time and simplicity.

## Solutions

### Solution 1: Using solely OpenAI's GPT-3.5 API
The purpose of this solution is to demonstrate the use of OpenAI's GPT-3.5 API to generate multiple-choice questions from a given text. Skills shown in this solution include:
- Understanding how to use third-party APIs to quickly prototype a solution
- Prompt engineering to get the best results from the API

In [17]:
# Define the class for the MCQ
class MCQ:
    def __init__(self, question, choices, answer, explanation):
        self.question = question
        self.choices = choices
        self.answer = answer
        self.explanation = explanation

    @staticmethod
    def from_json(response):
        json_str = response.choices[0].message.content
        json_obj = json.loads(json_str)
        try:
            question = json_obj["question"]
            choices = json_obj["choices"]
            answer = json_obj["answer"]
            explanation = json_obj["explanation"]
        except Exception as e:
            print("Error parsing JSON")
            print(f"Error: {e}")
            return None
        return MCQ(question, choices, answer, explanation)

    def display_mcq(self, output=None):
        # Create a RadioButtons widget for the choices
        # options = widgets.RadioButtons(options=self.choices, description="", disabled=False)
        CHOICE_LETTERS = ["A", "B", "C", "D"]

        # Create a widget for displaying the question
        question_widget = widgets.HTML(value=f"<h3>{self.question}</h3>")

        choice_widgets = [widgets.HTML(value=self.choices[i]) for i in range(4)]
        vbox = widgets.VBox([choice_widgets[0], choice_widgets[1], choice_widgets[2], choice_widgets[3]])
        options = widgets.Select(options=CHOICE_LETTERS, description="Your answer", disabled=False)
        answer_button = widgets.Button(description="Answer")
        # Function to handle the submission of the answer
        def on_answer_button_clicked(_):
            # Get the selected answer
            selected_answer = options.value
            # Check if the selected answer is correct
            if selected_answer == self.answer:
                feedback_widget.value = "<h3 style='color: green;'>Correct!</h3>"
            else:
                feedback_widget.value = f"<h3 style='color: red;'>Incorrect. The correct answer is {self.answer}</h3>"
            feedback_widget.value += f"<p>{self.explanation}</p>"

        # Register the event handler for the button
        answer_button.on_click(on_answer_button_clicked)

        # Create a widget for displaying feedback
        feedback_widget = widgets.HTML(value="")

        # Display the widgets
        if output:
            with output:
                display(question_widget, vbox, options, answer_button, feedback_widget)
        else:
            display(question_widget, vbox, options, answer_button, feedback_widget)


In [18]:
# Setup for the API useage, details in setting up the API is presented in the solution pdf file.
import constants
from openai import OpenAI

In [19]:
client = OpenAI(api_key=constants.OPENAI_API_KEY)

In [20]:
SYSTEM_PROMPT = """
You are an English teacher creating a multiple-choice question based on a given passage. Vary the difficulty of the questions from very easy to very difficult. Provide the question, four choices labeled A, B, C, and D, and indicate the correct answer. Separate the elements in the response by a new line.
You return the response following strictly the format provided in the example below.
You give no narration.

Example for user's input:
Passage: \{passage\}"|"Difficulty: {difficulty}

Example for response:
\n{\"question\": question Data, \"choices\":[choice A, choice B,
choice C, choice D], \"answer\": \" A, B, C, or D \",
\"explanation\": explaination data values}
"""

def get_mcq(client: OpenAI, document: str, difficulty: int) -> MCQ:
    """
    Generates multiple-choice questions based on the given document and difficulty level.

    Args:
        client (OpenAI): The OpenAI client object.
        document (str): The document to generate questions from.
        difficulty (int): The difficulty level of the questions.

    Returns:
        MCQ: An instance of the MCQ class containing the generated multiple-choice questions.
    """
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": f"Passage: {document}|Difficulty: {difficulty.value}"},
            ]
        )
        mcq = MCQ.from_json(response)
        if mcq:
            return mcq
        else:
            print("Error generating MCQ, please try again!")
            return None
    except Exception as e:
        print("Error fetching response from the API")
        print(f"Error: {e}")
        return None

#### Change the path of the PDF file to test the code

In [21]:
# The pdf is read with PyPDF2 and all the text in the pdf is extracted and stored in the variable document
pdf_path = "sample_doc.pdf"

document = extract_text_from_pdf(pdf_path)
document = "\n".join([line for line in document.splitlines() if line.strip() != ""])

if document:
    print(document)

Sample document  
Bun Bo Hue  
Bun Bo Hue is a flavorful and aromatic Vietnamese noodle soup that originates from the city of 
Hue in central Vietnam. Renowned for its bold and spicy profile, this soup features a robust 
broth made with a combination of beef and pork bones, lemongrass, shrimp paste, and a 
medley of aromatic spices. The dish is typically served with round rice noodles, sliced beef, pork, 
and sometimes cubes of congealed pig's blood. What sets Bun Bo Hue apart is its complex and 
rich flavor profile, characterized by the harmonious blend of lemongrass, chili, and fermented 
shrimp paste. Topped with fresh herbs, lime wedges, and crunchy bean sprouts, this dish offers 
a delightful interplay of textures and tastes, making it a bel oved and distinctive part of 
Vietnamese cuisine. Whether enjoyed in the bustling streets of Vietnam or in Vietnamese 
restaurants around the world, Bun Bo Hue remains a culinary delight for those seeking a hearty 
and spicy noodle soup experi

In [22]:
difficulty = widgets.Dropdown(
    options=[("Very Easy", 0), ("Easy", 1), ("Medium", 2), ("Hard", 3), ("Very Hard", 4)],
    value=2,
    description="Difficulty:"
)
gen_question_button = widgets.Button(description="Generate MCQ")
output = widgets.Output()

def on_gen_question_button_clicked(_):
    print("Generating MCQ...")
    mcq = get_mcq(client, document, difficulty)
    if mcq:
        mcq.display_mcq()
    else:
        print("Failed to generate MCQ, please try again!")

gen_question_button.on_click(on_gen_question_button_clicked)

display(difficulty, gen_question_button)

Dropdown(description='Difficulty:', index=2, options=(('Very Easy', 0), ('Easy', 1), ('Medium', 2), ('Hard', 3…

Button(description='Generate MCQ', style=ButtonStyle())

### Solution 2: Finetuning T5ForConditionalGeneration model with SQuaD dataset and RACE dataset
In this solution, I will:
- Use the Hugging Face's Transformers library to finetune the T5ForConditionalGeneration model with the SQuaD dataset to generate the question from a given answer and context.
- Use T5ForConditionalGeneration to generate distractors from the correct answer, question and context.

In [23]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the model and tokenizer from the Hugging Face model hub (my checkpoint, please check the test.py file for validation)
tokenizer_qg = AutoTokenizer.from_pretrained("levulinh/t5_question_generation_squad")
model_qg = AutoModelForSeq2SeqLM.from_pretrained("levulinh/t5_question_generation_squad")
tokenizer_dis = AutoTokenizer.from_pretrained("levulinh/t5_distraction_mctest")
model_dis = AutoModelForSeq2SeqLM.from_pretrained("levulinh/t5_distraction_mctest")

In [24]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model_qg.eval()
model_qg.to(device)
model_dis.eval()
model_dis.to(device)


def get_prediction_gq(context, answer):
    inputs = tokenizer_qg(
        f"{answer} <sep> {context}", max_length=256, padding="max_length", truncation=True, add_special_tokens=True
    )

    input_ids = torch.tensor(inputs["input_ids"], dtype=torch.long).unsqueeze(0).to(device)
    attention_mask = torch.tensor(inputs["attention_mask"], dtype=torch.long).unsqueeze(0).to(device)

    outputs = model_qg.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=64)

    prediction = tokenizer_qg.decode(
        outputs.flatten(),
        skip_special_tokens=True,
    )
    return prediction


def get_prediction_dis(context, answer, question):
    inputs = tokenizer_dis(
        f"{answer} <sep> {question} {context}",
        max_length=256,
        padding="max_length",
        truncation=True,
        add_special_tokens=True,
    )

    input_ids = torch.tensor(inputs["input_ids"], dtype=torch.long).unsqueeze(0).to(device)
    attention_mask = torch.tensor(inputs["attention_mask"], dtype=torch.long).unsqueeze(0).to(device)

    outputs = model_dis.generate(input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=80)

    prediction = tokenizer_dis.decode(
        outputs.flatten(),
        skip_special_tokens=True,
    )
    return prediction


def parse_qa(pred_string: str):
    if len(pred_split := pred_string.split("<sep>")) != 2:
        return None, None
    ans, ques = pred_split
    ans = ans.strip()
    ques = ques.strip()
    if ques[-1] in "!@#$%^&*()_+{}[]|\\:;\"'<>,./":
        # Replace the special character with a question mark
        ques = ques[:-1] + "?"
    elif ques[-1] not in "?":
        # Add a question mark to the end
        ques += "?"
    return ans, ques

def parse_dis(pred_string):
    if len(pred_split := pred_string.split("<sep>")) != 3:
        return None, None, None
    dises = pred_split
    for dis in dises:
        dis = dis.strip()
    return dises

In [25]:
import random

def gen_mcq_t5(context, answer):
    ans, ques = parse_qa(get_prediction_gq(context, answer))
    distractions = parse_dis(get_prediction_dis(context, ans, ques))
    correct_answer_position = random.choice(range(4))
    correct_answer_letter = ["A", "B", "C", "D"][correct_answer_position]
    distractions.insert(correct_answer_position, ans)

    # Mixing up the choices
    choices = [f"{letter}. {choice}" for letter, choice in zip(["A", "B", "C", "D"], distractions)]


    mcq = MCQ(
    question=ques,
    choices=choices,
    answer=correct_answer_letter,
    explanation="No explaination provided."
)

    mcq.display_mcq()

In [26]:
context = document
answer = "[MASK]"
gen_mcq_t5(context, answer)

HTML(value='<h3>What is Bun Bo Hue?</h3>')

VBox(children=(HTML(value='A. spicy broth'), HTML(value='B.  lemongrass'), HTML(value='C.  a medley of aromati…

Select(description='Your answer', options=('A', 'B', 'C', 'D'), value='A')

Button(description='Answer', style=ButtonStyle())

HTML(value='')

In [27]:
context = """John bought a new puppy. He named the new puppy Spike. Spike was a good dog and minded
John. John took Spike to the pond behind his house. Spike loved playing in the water. John
would throw the frisbee to Spike. He would also throw a bone to Spike. Spike loved
running. Jessica came to the pond to visit John. Jessica and Tom always played with
John. Jessica was John's best friend. They both loved Spike and Spike loved them. Jessica
brought lunch to the pond. She also brought colas to the pond. They ate and Spike sat by
them being a good dog. When they were done eating they packed their lunch up. They put
Spike on his leash and they went home."""
answer = "[MASK]"

gen_mcq_t5(context, answer)

HTML(value='<h3>Who did John buy a new puppy?</h3>')

VBox(children=(HTML(value='A. Tom'), HTML(value='B.  Jessica'), HTML(value='C. Spike'), HTML(value='D.  Tom'))…

Select(description='Your answer', options=('A', 'B', 'C', 'D'), value='A')

Button(description='Answer', style=ButtonStyle())

HTML(value='')

## Extra: Keyphrase extraction

In [28]:
import pke

def extract_keyphrases(text):
    # Create a SingleRank keyphrase extraction instance
    extractor = pke.unsupervised.SingleRank()

    # Load the content of the document
    extractor.load_document(input=text, language='en', normalization=None)

    # Extract keyphrases
    extractor.candidate_selection()
    extractor.candidate_weighting()

    # Get the keyphrases with their scores
    keyphrases = extractor.get_n_best(n=10)  # You can adjust the number of keyphrases to retrieve

    return keyphrases

# Example usage
paragraph = document

keyphrases = extract_keyphrases(paragraph)

print("Extracted Keyphrases:")
for keyphrase in keyphrases:
    print(keyphrase)


Extracted Keyphrases:
('bun bo hue', 0.1325006805495492)
('aromatic vietnamese noodle soup', 0.09353026359527972)
('spicy noodle soup experience', 0.07062902794695111)
('hue', 0.052207124020506056)
('shrimp paste', 0.04913901472855818)
('crunchy bean sprouts', 0.04644398660725723)
('rich flavor profile', 0.04604739220694151)
('vietnamese cuisine', 0.04058910213613991)
('spicy profile', 0.040419626633916704)
('round rice noodles', 0.03674633066834697)
