# Let's just ask Phi-Mini 3.5 what the misconception is...

1: Take each test question - and create a DF of questions that "support" it using train data:
* Question with same "construct"
* Questions with same "subject"
* Random questions if needed

2: For each question / answer pair in the DF - we generate a message sequence

3: Feed those ~15 messages into Phi-Mini 3.5.  As a final message - we give it the question / answer pair we want to predict on.

4: Collect the misconceptions predictions from Phi!

5: Use BGE to generate embeddings for all misconceptions in misconception_mapping.csv

6: Use BGE to generate embeddings for LLM predicted misconceptions

7: Use cosine similariy to identify 25 closest misconceptions in misconception_mapping to the LLM prediction

### Credit to BGE / cosine similarity code here:
### https://www.kaggle.com/code/pingfan/baseline-bge-cos-sim/

# Borrow some whl files to run Phi with internet off

In [1]:
!pip install -q -U transformers --no-index --find-links /kaggle/input/hf-libraries/transformers

# Usual imports / misc.

In [2]:
import sys 
import torch
import random
import numpy as np
import pandas as pd
import gc
import time
import random
from tqdm import tqdm

from pprint import pprint
from IPython.display import display

from sklearn.metrics.pairwise import cosine_similarity

from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, AutoModel

if (not torch.cuda.is_available()): print("Sorry - GPU required!")
    
import logging
logging.getLogger('transformers').setLevel(logging.ERROR)

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)

# A few configuration things...

In [3]:
# min / max example questions for prompt generation
# for each question all answers with non-NAN misconceptions will be used
min_example_questions = 5
max_example_questions = 8

#example question messages limited to this many words
#assures we don't run out of GPU RAM
#actual prompt for target not included in word count
max_words_for_examples = 1500

#maximum new tokens Phi will generate for responses
max_new_tokens = 55

# Load up Phi Mini!
* We try to clean up before loading this cell - but if you re-run - you might run out of GPU memory...

In [4]:
# Clear GPU memory and delete existing objects if they exist
if torch.cuda.is_available():
    torch.cuda.empty_cache()
for obj in ['model', 'pipe', 'tokenizer']:
    if obj in globals():
        del globals()[obj]

# Model configuration
model_name = '/kaggle/input/phi-3.5-mini-instruct/pytorch/default/1'

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Create pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, trust_remote_code=True, max_new_tokens=max_new_tokens)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Test out Phi

In [5]:
messages = [
    {"role": "user", "content": "Tell me about your math skills."},
]

pipe(messages)

[{'generated_text': [{'role': 'user',
    'content': 'Tell me about your math skills.'},
   {'role': 'assistant',
    'content': " I am Phi, a Microsoft language model, and I don't possess math skills or abilities in the traditional sense. However, I can assist with a wide range of mathematical problems, concepts, and questions.\n\nHere's how I can help with math"}]}]

# Look at Misconception Map

In [6]:
misc_map_df = pd.read_csv("/kaggle/input/eedi-mining-misconceptions-in-mathematics/misconception_mapping.csv")
misc_map_df.head(10)

Unnamed: 0,MisconceptionId,MisconceptionName
0,0,Does not know that angles in a triangle sum to 180 degrees
1,1,Uses dividing fractions method for multiplying fractions
2,2,Believes there are 100 degrees in a full turn
3,3,"Thinks a quadratic without a non variable term, can not be factorised"
4,4,Believes addition of terms and powers of terms are equivalent e.g. a + c = a^c
5,5,"When measuring a reflex angle, gives the acute or obtuse angle that sums to 360 instead"
6,6,Can identify the multiplier used to form an equivalent fraction but does not apply to the numerator
7,7,Believes gradient = change in y
8,8,Student thinks that any two angles along a straight line are equal
9,9,Thinks there are 180 degrees in a full turn


# Look at Train

In [7]:
train_df = pd.read_csv("/kaggle/input/eedi-mining-misconceptions-in-mathematics/train.csv")
train_df.head(10)

Unnamed: 0,QuestionId,ConstructId,ConstructName,SubjectId,SubjectName,CorrectAnswer,QuestionText,AnswerAText,AnswerBText,AnswerCText,AnswerDText,MisconceptionAId,MisconceptionBId,MisconceptionCId,MisconceptionDId
0,0,856,Use the order of operations to carry out calculations involving powers,33,BIDMAS,A,\[\n3 \times 2+4-5\n\]\nWhere do the brackets need to go to make the answer equal \( 13 \) ?,\( 3 \times(2+4)-5 \),\( 3 \times 2+(4-5) \),\( 3 \times(2+4-5) \),Does not need brackets,,,,1672.0
1,1,1612,Simplify an algebraic fraction by factorising the numerator,1077,Simplifying Algebraic Fractions,D,"Simplify the following, if possible: \( \frac{m^{2}+2 m-3}{m-3} \)",\( m+1 \),\( m+2 \),\( m-1 \),Does not simplify,2142.0,143.0,2142.0,
2,2,2774,Calculate the range from a list of data,339,Range and Interquartile Range from a List of Data,B,"Tom and Katie are discussing the \( 5 \) plants with these heights:\n\( 24 \mathrm{~cm}, 17 \mathrm{~cm}, 42 \mathrm{~cm}, 26 \mathrm{~cm}, 13 \mathrm{~cm} \)\nTom says if all the plants were cut in half, the range wouldn't change.\nKatie says if all the plants grew by \( 3 \mathrm{~cm} \) each, the range wouldn't change.\nWho do you agree with?",Only\nTom,Only\nKatie,Both Tom and Katie,Neither is correct,1287.0,,1287.0,1073.0
3,3,2377,Recall and use the intersecting diagonals properties of a rectangle,88,Properties of Quadrilaterals,C,The angles highlighted on this rectangle with different length sides can never be... ![A rectangle with the diagonals drawn in. The angle on the right hand side at the centre is highlighted in red and the angle at the bottom at the centre is highlighted in yellow.](),acute,obtuse,\( 90^{\circ} \),Not enough information,1180.0,1180.0,,1180.0
4,4,3387,Substitute positive integer values into formulae involving powers or roots,67,Substitution into Formula,A,The equation \( f=3 r^{2}+3 \) is used to find values in the table below. What is the value covered by the star? \begin{tabular}{|c|c|c|c|c|}\n\hline\( r \) & \( 1 \) & \( 2 \) & \( 3 \) & \( 4 \) \\\n\hline\( f \) & \( 6 \) & \( 15 \) & \( \color{gold}\bigstar \) & \\\n\hline\n\end{tabular},\( 30 \),\( 27 \),\( 51 \),\( 24 \),,,,1818.0
5,5,2052,Identify a unit of area,75,Area of Simple Shapes,D,"James has answered a question on the area of a trapezium and got an answer of \( 54 \).\n\nBehind the star he has written the units that he used.\n\n\(\n54 \, \bigstar \n\)\n\nWhich of the following units could be correct?",\( m \),\( \mathrm{cm} \),\( \mathrm{km}^{3} \),\( \mathrm{mm}^{2} \),686.0,686.0,686.0,
6,6,376,Convert two digit integer percentages to fractions,238,Converting between Fractions and Percentages,B,Convert this percentage to a fraction\n\( 62 \% \),\( \frac{62}{10} \),\( \frac{31}{50} \),\( \frac{6}{2} \),None of these,329.0,,847.0,329.0
7,7,314,Divide decimals by 10,224,Multiplying and Dividing with Decimals,A,\( 43.2 \div 10= \),\( 4.32 \),\( 0.432 \),\( 33.2 \),\( 43.02 \),,2123.0,2273.0,2133.0
8,8,435,Subtract proper fractions with different denominators which do not share a common factor,230,Adding and Subtracting Fractions,A,\(\n\frac{4}{5}-\frac{1}{3}=\frac{\bigstar}{15}\n\)\nWhat should replace the star?,\( 7 \),\( 5 \),\( 17 \),\( 3 \),,907.0,1514.0,907.0
9,9,1321,Identify horizontal translations in the form f(x) = for non-trigonometric functions,164,Transformations of functions in the form f(x),C,What transformation maps the graph of\n\(y=f(x)\n\)\nto the graph of\n\(\ny=f(x-3)\n\),Translation by vector\n\(\n\left[\begin{array}{l}\n0 \\\n3\n\end{array}\right]\n\),Translation by vector\n\(\n\left[\begin{array}{c}\n0 \\\n-3\n\end{array}\right]\n\),Translation by vector\n\(\n\left[\begin{array}{l}\n3 \\\n0\n\end{array}\right]\n\),Translation by vector\n\(\n\left[\begin{array}{r}\n-3 \\\n0\n\end{array}\right]\n\),1889.0,1234.0,,1312.0


# Look at Test

In [8]:
test_df = pd.read_csv("/kaggle/input/eedi-mining-misconceptions-in-mathematics/test.csv")
test_df.head(10)

Unnamed: 0,QuestionId,ConstructId,ConstructName,SubjectId,SubjectName,CorrectAnswer,QuestionText,AnswerAText,AnswerBText,AnswerCText,AnswerDText
0,1869,856,Use the order of operations to carry out calculations involving powers,33,BIDMAS,A,\[\n3 \times 2+4-5\n\]\nWhere do the brackets need to go to make the answer equal \( 13 \) ?,\( 3 \times(2+4)-5 \),\( 3 \times 2+(4-5) \),\( 3 \times(2+4-5) \),Does not need brackets
1,1870,1612,Simplify an algebraic fraction by factorising the numerator,1077,Simplifying Algebraic Fractions,D,"Simplify the following, if possible: \( \frac{m^{2}+2 m-3}{m-3} \)",\( m+1 \),\( m+2 \),\( m-1 \),Does not simplify
2,1871,2774,Calculate the range from a list of data,339,Range and Interquartile Range from a List of Data,B,"Tom and Katie are discussing the \( 5 \) plants with these heights:\n\( 24 \mathrm{~cm}, 17 \mathrm{~cm}, 42 \mathrm{~cm}, 26 \mathrm{~cm}, 13 \mathrm{~cm} \)\nTom says if all the plants were cut in half, the range wouldn't change.\nKatie says if all the plants grew by \( 3 \mathrm{~cm} \) each, the range wouldn't change.\nWho do you agree with?",Only\nTom,Only\nKatie,Both Tom and Katie,Neither is correct


# Generate a DF optimized for use as LLM prompting text
* Generates messages for with requested number of questions (each question will have multiple answers...)
* Prioritizes questions with requested construct
* Secondarily, prioritizes questions with requested construct
* If more questions are still needed - chosen at random

In [9]:
import numpy as np

def generate_filtered_df(df, construct_id=None, subject_id=None, min_rows=5, max_rows=10, skip_question_id=None, verbose=False, random_seed=42):
    # Set the random seed for numpy and pandas
    np.random.seed(random_seed)
    
    result_df = pd.DataFrame()
    construct_count = 0
    subject_count = 0
    random_count = 0
    
    # Apply initial filter to skip the specified QuestionId
    if skip_question_id is not None:
        df = df[df['QuestionId'] != skip_question_id]
        if verbose: print(f"Skipping QuestionId {skip_question_id}")
    
    # Step 1: Filter by ConstructId if provided
    if construct_id is not None:
        construct_df = df[df['ConstructId'] == construct_id]
        result_df = pd.concat([result_df, construct_df])
        construct_count = len(result_df)
        if verbose: print(f"Matched ConstructId {construct_id}: {construct_count} rows")
    
    # Step 2: If we don't have enough rows, add rows with the specified SubjectId
    if len(result_df) < min_rows and subject_id is not None:
        subject_df = df[(df['SubjectId'] == subject_id) & ~df.index.isin(result_df.index)]
        rows_to_add = min(len(subject_df), min_rows - len(result_df))
        result_df = pd.concat([result_df, subject_df.head(rows_to_add)])  # Use head() instead of sample()
        subject_count = len(result_df) - construct_count
        if verbose: print(f"Added rows from SubjectId {subject_id}: {subject_count} rows")
    
    # Step 3: If we still don't have enough rows, add random rows
    if len(result_df) < min_rows:
        remaining_df = df[~df.index.isin(result_df.index)]
        rows_to_add = min(len(remaining_df), min_rows - len(result_df))
        result_df = pd.concat([result_df, remaining_df.head(rows_to_add)])  # Use head() instead of sample()
        random_count = len(result_df) - (construct_count + subject_count)
        if verbose: print(f"Added random rows to meet minimum: {random_count} rows")
    
    # Step 4: If we have more than max_rows, use the first max_rows
    if len(result_df) > max_rows:
        result_df = result_df.head(max_rows)
        if verbose: print(f"Reduced to maximum: {max_rows} rows")
    
    if verbose: 
        print(f"\nFinal DataFrame composition:")
        print(f"ConstructId matches: {construct_count}")
        print(f"SubjectId matches: {subject_count}")
        print(f"Random additions: {random_count}")
        print(f"Total rows: {len(result_df)}")
    
    return result_df.reset_index(drop=True)

# Usage
filtered_df = generate_filtered_df(train_df, construct_id=2052, subject_id=75, min_rows=min_example_questions, max_rows=max_example_questions, verbose=True, random_seed=42)
filtered_df.head(10)

Matched ConstructId 2052: 2 rows
Added rows from SubjectId 75: 3 rows

Final DataFrame composition:
ConstructId matches: 2
SubjectId matches: 3
Random additions: 0
Total rows: 5


Unnamed: 0,QuestionId,ConstructId,ConstructName,SubjectId,SubjectName,CorrectAnswer,QuestionText,AnswerAText,AnswerBText,AnswerCText,AnswerDText,MisconceptionAId,MisconceptionBId,MisconceptionCId,MisconceptionDId
0,5,2052,Identify a unit of area,75,Area of Simple Shapes,D,"James has answered a question on the area of a trapezium and got an answer of \( 54 \).\n\nBehind the star he has written the units that he used.\n\n\(\n54 \, \bigstar \n\)\n\nWhich of the following units could be correct?",\( m \),\( \mathrm{cm} \),\( \mathrm{km}^{3} \),\( \mathrm{mm}^{2} \),686.0,686.0,686.0,
1,1132,2052,Identify a unit of area,196,Area Units,B,Tom and Katie are discussing units of area\nTom says centimetres is a unit of area\nKatie says \( \mathrm{mm}^{2} \) is a unit of area\nWho is correct?,Only\nTom,Only Katie,Both Tom and Katie,Neither is correct,686.0,,686.0,686.0
2,15,2578,Calculate the area of an equilateral triangle where the dimensions are given in the same units,75,Area of Simple Shapes,D,"How would you calculate the area of this triangle? ![A triangle, base 12m. All three sides are equal.]()",\( \frac{6 \times 12}{2} \),\( \frac{12 \times 12}{2} \),\( \frac{12 \times 12 \times 12}{2} \),None of these,,590.0,71.0,
3,78,2582,"Given the area of a parallelogram, calculate a missing dimension",75,Area of Simple Shapes,A,The area of this parallelogram is \( 24 \mathrm{~cm}^{2} \)\n\nWhat measurement should replace the star? ![Parallelogram with base length 6cm and the perpendicular height has a star symbol.](),\( 4 \mathrm{~cm} \),\( 2 \mathrm{~cm} \),\( 8 \mathrm{~cm} \),Not enough information,,272.0,1835.0,2397.0
4,103,2566,Calculate the area of a parallelogram where the dimensions are given in the same units,75,Area of Simple Shapes,C,"What is the area of the parallelogram? ![A parallelogram with the length labelled 10cm, the slanted height labelled 4cm, and the perpendicular height (marked with a right angle) labelled 3cm]()",\( 120 \mathrm{~cm}^{2} \),\( 40 \mathrm{~cm}^{2} \),\( 30 \mathrm{~cm}^{2} \),\( 15 \mathrm{~cm}^{2} \),867.0,669.0,,695.0


# Generate array of training message segments for a given DF
## For each question / answer pair (with non-nan misconception) in DF includes:
* Question (combines ConstructName and QuestionText)
* Correct Answer
* Incorrect Answer (test question)
* Misconception for given answer

Also - size of returned data is limitted by bytes.  Intent is to assure we don't blow-up memory usage.

In [10]:
def get_train_messages_for_df(filtered_train_df, skip_nan_misconceptions=True, answers=['A', 'B', 'C', 'D'], verbose = False):
    messages = []
    current_size = 0
    
    for _, row in filtered_train_df.iterrows():
        for answer_choice in answers:
            if answer_choice == row['CorrectAnswer']:
                continue
            
            misconception_id = row[f'Misconception{answer_choice}Id']
            
            if pd.isna(misconception_id) and skip_nan_misconceptions:
                continue
            
            if not pd.isna(misconception_id):
                new_message = [
                    f"{row['ConstructName']}: {row['QuestionText']}",
                    row[f'Answer{row["CorrectAnswer"]}Text'],
                    row[f'Answer{answer_choice}Text'],
                    misc_map_df.loc[int(misconception_id), 'MisconceptionName']
                ]
                
                # Calculate size of new message
                new_message_size = sum(sys.getsizeof(item) for item in new_message)
                                
                messages.append(new_message)
                current_size += new_message_size
            
    # Print size of returned data
    if verbose: print(f"Size of returned data: {current_size} bytes")
    
    return messages

examples_sequences = get_train_messages_for_df(filtered_df, verbose = True)
examples_sequences

Size of returned data: 7523 bytes


[['Identify a unit of area: James has answered a question on the area of a trapezium and got an answer of \\( 54 \\).\n\nBehind the star he has written the units that he used.\n\n\\(\n54 \\, \\bigstar \n\\)\n\nWhich of the following units could be correct?',
  '\\( \\mathrm{mm}^{2} \\)',
  '\\( m \\)',
  'Does not know units of area should be squared'],
 ['Identify a unit of area: James has answered a question on the area of a trapezium and got an answer of \\( 54 \\).\n\nBehind the star he has written the units that he used.\n\n\\(\n54 \\, \\bigstar \n\\)\n\nWhich of the following units could be correct?',
  '\\( \\mathrm{mm}^{2} \\)',
  '\\( \\mathrm{cm} \\)',
  'Does not know units of area should be squared'],
 ['Identify a unit of area: James has answered a question on the area of a trapezium and got an answer of \\( 54 \\).\n\nBehind the star he has written the units that he used.\n\n\\(\n54 \\, \\bigstar \n\\)\n\nWhich of the following units could be correct?',
  '\\( \\mathrm{mm

# Define prompt components
* Tell the LLM what we are looking for
* Prefixes added to messages

In [11]:
#original text prefix
question_prefix = "Question:"

#LLM "response"
llm_correct_response_for_rewrite = "Provide me with the correct answer for a baseline."
llm_incorrect_response_for_rewrite = "Now - provide the incorrect answer and I will anaylze the difference to infer the misconception."

#modified text prefix
incorrect_answer_prefix = "Incorrect Answer:"
correct_answer_prefix = "Correct Answer:"

#providing this as the start of the response helps keep things relevant
response_start = "Misconception for incorrect answer: "

# Cleans LLM output

In [12]:
def clean_response(my_string, response_start):
    # Trim leading spaces first
    my_string = my_string.lstrip()
    
    # Remove response_start if present
    if my_string.startswith(response_start):
        my_string = my_string[len(response_start):]
    
    # Find indices of first period and first linefeed
    period_index = my_string.find('.')
    linefeed_index = my_string.find('\n')
    
    # Determine where to truncate
    truncate_index = len(my_string)  # Default to end of string
    if period_index != -1:
        truncate_index = period_index
    if linefeed_index != -1 and linefeed_index < truncate_index:
        truncate_index = linefeed_index
    
    # Truncate the string
    my_string = my_string[:truncate_index]
    
    return my_string.strip()

# Detection logic
### Takes input of question, answer and example message segments
### Actual messages are generated for LLM
### Output is predicted misconception!

In [13]:
def get_prompt(question, correct_answer, incorrect_answer, examples_sequences, max_word_count = max_words_for_examples, verbose=False):
    messages = []
    current_word_count = 0

    def calculate_word_count(text):
        return len(text.split())

    # Iterate through example sequences, checking the total word count
    for examp_question, examp_correct_answer, examp_incorrect_answer, examp_misconception in examples_sequences:
        # Construct the combined sequence for this example
        example_messages = [
            {"role": "user", "content": f"{question_prefix} {examp_question}"},
            {"role": "assistant", "content": llm_correct_response_for_rewrite},
            {"role": "user", "content": f"{correct_answer_prefix} {examp_correct_answer}"},
            {"role": "assistant", "content": llm_incorrect_response_for_rewrite},
            {"role": "user", "content": f"{incorrect_answer_prefix} {examp_incorrect_answer}"},
            {"role": "assistant", "content": f"{response_start} {examp_misconception}"}
        ]

        # Calculate the total word count for this sequence
        example_total_words = sum(calculate_word_count(msg["content"]) for msg in example_messages)

        # Check if adding this sequence would exceed the max word count
        if current_word_count + example_total_words > max_word_count:
            break  # Stop if we would exceed the limit

        # Append the entire sequence to messages and update the word count
        messages.extend(example_messages)
        current_word_count += example_total_words

    # Always add the actual prompt
    actual_prompt_messages = [
        {"role": "user", "content": f"{question_prefix} {question}"},
        {"role": "assistant", "content": llm_correct_response_for_rewrite},
        {"role": "user", "content": f"{correct_answer_prefix} {correct_answer}"},
        {"role": "assistant", "content": llm_incorrect_response_for_rewrite},
        {"role": "user", "content": f"{incorrect_answer_prefix} {incorrect_answer}"}
    ]

    # Append the actual prompt messages
    messages.extend(actual_prompt_messages)

    if verbose:
        print("Prompt:")
        for message in messages:
            display(message)

    decoded = pipe(messages)

    return decoded

# Hard coded test of everything...
### Using a train question so we can compare the predicted misconception with the actual one.

In [14]:
sample_question_index = 35
question_letter_to_test = "C"

question = train_df.iloc[sample_question_index]
correct_question_letter = question["CorrectAnswer"]
if correct_question_letter == question_letter_to_test:
    print("WARNING: Tested letter is for a correct answer!")

question_id = question["QuestionId"]
subject_id = question["SubjectId"]
construct_id = question["ConstructId"]
question_text = f"{question['ConstructName']}: \n {question['QuestionText']}\n"
correct_answer_text = question[f"Answer{correct_question_letter}Text"]
incorrect_answer_text = question[f"Answer{question_letter_to_test}Text"]
misconception_id = question[f"Misconception{question_letter_to_test}Id"]
misconception_text = misc_map_df['MisconceptionName'].values[int(misconception_id)]

filtered_df = generate_filtered_df(train_df, construct_id=construct_id, subject_id=subject_id, min_rows=5, max_rows=7, skip_question_id = question_id)
examples_sequences = get_train_messages_for_df(filtered_df)

response = get_prompt(question_text, correct_answer_text, incorrect_answer_text, examples_sequences, verbose = True)

just_response = clean_response(response[0]['generated_text'][-1]['content'], response_start)

print("\n-----SUMMARY----")

print("Question: ", question_text)
print("Correct Answer: ", correct_answer_text)
print("Wrong Answer: ", incorrect_answer_text)

print()

print ("Predicted Misconception:\n", just_response, "\n")
print ("Actual Misconception:\n", misconception_text)

Prompt:


{'role': 'user',
 'content': 'Question: Convert decimals less than 1 with 3 or more decimal place to percentages: How do you write \\( 0.909 \\) as a percentage?'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 90.9 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 99 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Does not understand the value of zeros as placeholders'}

{'role': 'user',
 'content': 'Question: Convert decimals less than 1 with 3 or more decimal place to percentages: How do you write \\( 0.909 \\) as a percentage?'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 90.9 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 909 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Thinks they just remove the decimal point when converting a decimal to a percentage'}

{'role': 'user',
 'content': 'Question: Convert decimals less than 1 with 2 decimal place to percentages: \\( 0.01+57 \\%= \\)'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 58 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 57.1 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Multiplies by 10 instead of 100 when converting decimal to percentage'}

{'role': 'user',
 'content': 'Question: Convert decimals less than 1 with 2 decimal place to percentages: \\( 0.01+57 \\%= \\)'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 58 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 67 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Multiplies by 1000 instead of 100 when converting a decimal to a percentage'}

{'role': 'user',
 'content': 'Question: Convert decimals less than 1 with 2 decimal place to percentages: Write \\( 0.15 \\) as a percentage.'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 15 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 0.15 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Thinks you need to just add a % sign to a decimal to convert to a percentage'}

{'role': 'user',
 'content': 'Question: Convert decimals less than 1 with 2 decimal place to percentages: Write \\( 0.15 \\) as a percentage.'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 15 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 1.5 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Multiplies by 10 instead of 100 when converting decimal to percentage'}

{'role': 'user',
 'content': 'Question: Convert decimals greater than 1 to percentages: How do you write \\( 4.6 \\) as a percentage?'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 460 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 46 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Multiplies by 10 instead of 100 when converting decimal to percentage'}

{'role': 'user',
 'content': 'Question: Convert decimals greater than 1 to percentages: How do you write \\( 4.6 \\) as a percentage?'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 460 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 0.046 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Divides instead of multiplies when converting a decimal to a percentage'}

{'role': 'user',
 'content': 'Question: Convert decimals greater than 1 to percentages: How do you write \\( 4.6 \\) as a percentage?'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 460 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 4.6 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Thinks you need to just add a % sign to a decimal to convert to a percentage'}

{'role': 'user',
 'content': 'Question: Convert decimals less than 1 with 2 decimal place to percentages: Write \\( 0.07 \\) as a percentage.'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 7 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 70 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Does not understand the value of zeros as placeholders'}

{'role': 'user',
 'content': 'Question: Convert decimals less than 1 with 2 decimal place to percentages: Write \\( 0.07 \\) as a percentage.'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 7 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 0.07 \\% \\)'}

{'role': 'assistant',
 'content': 'Misconception for incorrect answer:  Thinks you need to just add a % sign to a decimal to convert to a percentage'}

{'role': 'user',
 'content': 'Question: Convert decimals less than 1 with 1 decimal place to percentages: \n Convert this decimal to a percentage\n\\[\n0.6\n\\]\n'}

{'role': 'assistant',
 'content': 'Provide me with the correct answer for a baseline.'}

{'role': 'user', 'content': 'Correct Answer: \\( 60 \\% \\)'}

{'role': 'assistant',
 'content': 'Now - provide the incorrect answer and I will anaylze the difference to infer the misconception.'}

{'role': 'user', 'content': 'Incorrect Answer: \\( 0.6 \\% \\)'}


-----SUMMARY----
Question:  Convert decimals less than 1 with 1 decimal place to percentages: 
 Convert this decimal to a percentage
\[
0.6
\]

Correct Answer:  \( 60 \% \)
Wrong Answer:  \( 0.6 \% \)

Predicted Misconception:
 Adds a % sign without multiplying by 100 to convert a decimal to a percentage 

Actual Misconception:
 Thinks you need to just add a % sign to a decimal to convert to a percentage


# Generate Misconceptions for all Test
* Cycle through all non-correct answers in Test
* We are generating novel text - not trying to match with actual provided options

In [15]:
def process_test_questions():
    results = []
    start_time = time.time()
    total_items = 0
    
    for question_index in range(len(test_df)):
        question = test_df.iloc[question_index]
        question_id = question["QuestionId"]
        subject_id = question["SubjectId"]
        construct_id = question["ConstructId"]
        correct_answer = question["CorrectAnswer"]
        question_text = f"{question['ConstructName']}: \n {question['QuestionText']}\n"
        
        for answer_choice in ['A', 'B', 'C', 'D']:
            #skip correct answer
            if answer_choice != correct_answer:
                incorrect_answer_text = question[f"Answer{answer_choice}Text"]
                correct_answer_text = question[f"Answer{correct_answer}Text"]

                filtered_df = generate_filtered_df(train_df, construct_id=construct_id, subject_id=subject_id, min_rows=min_example_questions, max_rows=max_example_questions, skip_question_id=question_id)
                examples_sequences = get_train_messages_for_df(filtered_df)
                response = get_prompt(question_text, correct_answer_text, incorrect_answer_text, examples_sequences)
                just_response = clean_response(response[0]['generated_text'][-1]['content'], response_start)
                
                results.append({
                    'QuestionId_Answer': f"{question_id}_{answer_choice}",
                    'MiscPredText': just_response
                })
                
                total_items += 1
                print(".", end="", flush=True)
    
    end_time = time.time()
    total_time = end_time - start_time
    avg_time_per_item = total_time / total_items if total_items > 0 else 0
    
    print(f"\nTotal execution time: {total_time:.2f} seconds")
    print(f"Total items processed: {total_items}")
    print(f"Average time per item: {avg_time_per_item:.2f} seconds")
    print(f"Time for 1000 questions * 3 incorrect answers (3000 items): {(avg_time_per_item * 3000) / 3600} hours")

    return pd.DataFrame(results)

# Example usage:
predicted_misc = process_test_questions()
predicted_misc

.........
Total execution time: 63.92 seconds
Total items processed: 9
Average time per item: 7.10 seconds
Time for 1000 questions * 3 incorrect answers (3000 items): 5.918576540770354 hours


Unnamed: 0,QuestionId_Answer,MiscPredText
0,1869_B,"Applies multiplication before addition within the brackets, but does not recognize that the brackets should encompass the entire addition and subtraction operation"
1,1869_C,"Places the subtraction inside the parentheses, which changes the order of operations"
2,1869_D,"Confuses the order of operations, believes addition comes before multiplication without considering the need for brackets to alter the order of operations"
3,1870_A,"Incorrectly factors the numerator as if it were a simple quadratic, not recognizing that the quadratic can be factored into a product of binomials"
4,1870_B,"Incorrectly factors the numerator as if it were a simple quadratic, not recognizing that the quadratic can be factored into a product of binomials"
5,1870_C,Incorrectly factors the numerator or incorrectly applies the division to the terms in the numerator
6,1871_A,"Believes if you change all values by the same proportion (either cutting in half or adding a constant amount), the range of the data set would not change"
7,1871_C,Believes if you changed all values by the same proportion the range would not change
8,1871_D,Believes if you added the same value to all numbers in the dataset the range will change


# Next: Use cosine similarity to map predicted misconceptions back to actual options in Misconception Map
#### Credit to BGE / cosine similarity code here:
#### https://www.kaggle.com/code/pingfan/baseline-bge-cos-sim/

In [16]:
device = "cuda:0"

bge_tokenizer = AutoTokenizer.from_pretrained('/kaggle/input/bge-large-en-v1.5/transformers/default/1/bge-large-en-v1.5')
bge_model = AutoModel.from_pretrained('/kaggle/input/bge-large-en-v1.5/transformers/default/1/bge-large-en-v1.5')
bge_model.eval()
bge_model.to(device)
print("Done!")

Done!


# Get embeddings for all possible misconceptions in misconception_mapping.csv
* This only takes about 4 seconds on GPU for 2586 possible misconceptions
* This suggests that mapping the ~3000 questions / answer pairs will also take an acceptable amount of time

In [17]:
start_time = time.time()

MisconceptionName = list(misc_map_df['MisconceptionName'].values)
per_gpu_batch_size = 8

def prepare_inputs(text, tokenizer, device):
    tokenizer_outputs = tokenizer.batch_encode_plus(
        text,
        padding        = True,
        return_tensors = 'pt',
        max_length     = 1024,
        truncation     = True
    )
    result = {
        'input_ids': tokenizer_outputs.input_ids.to(device),
        'attention_mask': tokenizer_outputs.attention_mask.to(device),
    }
    return result

all_ctx_vector = []
for mini_batch in tqdm(range(0, len(MisconceptionName[:]), per_gpu_batch_size)):
    mini_context          = MisconceptionName[mini_batch:mini_batch+ per_gpu_batch_size]
    encoded_input         = prepare_inputs(mini_context,bge_tokenizer,device)
    sentence_embeddings   = bge_model(**encoded_input)[0][:, 0]
    sentence_embeddings   = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
    all_ctx_vector.append(sentence_embeddings.detach().cpu().numpy())

all_ctx_vector = np.concatenate(all_ctx_vector, axis=0)
print("Sentence embeddings:", sentence_embeddings.shape)

# Stop the timer
end_time = time.time()

# Calculate and print the execution time
execution_time = end_time - start_time
print(f"Execution time: {execution_time:.2f} seconds")

100%|██████████| 324/324 [00:10<00:00, 30.82it/s]

Sentence embeddings: torch.Size([3, 1024])
Execution time: 10.52 seconds





# Get embeddings for all of our LLM predicted misconceptions

In [18]:
test_texts = list(predicted_misc['MiscPredText'].values)
all_text_vector = []
per_gpu_batch_size = 8

for mini_batch in tqdm(
        range(0, len(test_texts[:]), per_gpu_batch_size)):
    mini_context = test_texts[mini_batch:mini_batch
                                           + per_gpu_batch_size]
    encoded_input = prepare_inputs(mini_context,bge_tokenizer,device)
    sentence_embeddings = bge_model(
        **encoded_input)[0][:, 0]
    sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
    
    all_text_vector.append(sentence_embeddings.detach().cpu().numpy())

all_text_vector = np.concatenate(all_text_vector, axis=0)
print(all_text_vector.shape)

100%|██████████| 2/2 [00:00<00:00, 29.12it/s]

(9, 1024)





# Get cosine similarities between LLM predictions and misconception mappings

In [19]:
test_cos_sim_arr = cosine_similarity(all_text_vector, all_ctx_vector)
test_sorted_indices = np.argsort(-test_cos_sim_arr, axis=1)

test_sorted_indices[:, :25]

array([[ 328,  871, 2143, 2181,  488, 1941, 2131, 1672, 2051, 1862, 1361,
         911, 1432, 1971, 1679, 1516, 2306,  220, 1666, 2586,  373, 1345,
        2085, 1119,  376],
       [1054, 1971, 1941,  234, 1514, 2586, 1963, 2273,  467, 2001, 2530,
        1190, 1887, 2181,  228, 1274,  969, 2518,  997,  172, 1875, 1241,
        1345, 1119, 1916],
       [1672,   15, 1941,  158, 2586, 1119,  516,  520, 2518, 2488,  373,
        1999, 1971, 1316,  638, 1205,  674, 2143, 1890, 1856, 2306,   74,
        1549, 1432,  481],
       [1755,  403, 1808, 1670,  891, 1610, 1044,  167,   78, 1432,  102,
        2247, 2173,  265,  340,  838, 2291, 1628,   58,  754,  346, 1976,
         347,    3,  744],
       [1755,  403, 1808, 1670,  891, 1610, 1044,  167,   78, 1432,  102,
        2247, 2173,  265,  340,  838, 2291, 1628,   58,  754,  346, 1976,
         347,    3,  744],
       [ 403,  891, 1928,  652, 1755,  561, 2458, 1119,   58,  363,  538,
          78,  609, 1181, 2272, 2547, 1624,  975, 1

# Look at an example LLM prediction vs. what we are submitting:

In [20]:
example_question = 2
print("LLM predicted misconception:\n", predicted_misc['MiscPredText'].values[example_question] )
print("\nClosest cosine similarities from misconception map:")
for misconception_id in test_sorted_indices[example_question, :25]:
    print(misc_map_df.loc[misc_map_df['MisconceptionId'] == misconception_id]['MisconceptionName'].values)

LLM predicted misconception:
 Confuses the order of operations, believes addition comes before multiplication without considering the need for brackets to alter the order of operations

Closest cosine similarities from misconception map:
['Confuses the order of operations, believes addition comes before multiplication ']
['Confuses the order of operations, believes addition comes before division']
['Confuses the order of operations, believes subtraction comes before multiplication ']
['Believes addition comes before indices, in orders of operation']
['Misunderstands order of operations in algebraic expressions']
['Does not interpret the correct order of operations in a fraction when there is an addition on the numerator']
['Thinks the inverse of multiplication is addition']
['Thinks the inverse of addition is multiplication']
['May have made a calculation error using the order of operations']
['Answers order of operations questions with brackets as if the brackets are not there']
['Bel

# Generate submission / submit!

In [21]:
results = []

# Iterate through each row of predicted_misc and corresponding sorted indices
for (_, row), indices in zip(predicted_misc.iterrows(), test_sorted_indices):
    # Get the QuestionId_Answer
    question_id_answer = row['QuestionId_Answer']
    
    # Get the top 25 misconception indices and join them as a space-separated string
    top_25_indices = ' '.join(map(str, indices[:25]))
    
    # Append the result to our list
    results.append({
        'QuestionId_Answer': question_id_answer,
        'MisconceptionId': top_25_indices
    })

# Create the submission dataframe
submission_df = pd.DataFrame(results)

submission_df.to_csv("submission.csv", index=False)

submission_df.head(10)

Unnamed: 0,QuestionId_Answer,MisconceptionId
0,1869_B,328 871 2143 2181 488 1941 2131 1672 2051 1862 1361 911 1432 1971 1679 1516 2306 220 1666 2586 373 1345 2085 1119 376
1,1869_C,1054 1971 1941 234 1514 2586 1963 2273 467 2001 2530 1190 1887 2181 228 1274 969 2518 997 172 1875 1241 1345 1119 1916
2,1869_D,1672 15 1941 158 2586 1119 516 520 2518 2488 373 1999 1971 1316 638 1205 674 2143 1890 1856 2306 74 1549 1432 481
3,1870_A,1755 403 1808 1670 891 1610 1044 167 78 1432 102 2247 2173 265 340 838 2291 1628 58 754 346 1976 347 3 744
4,1870_B,1755 403 1808 1670 891 1610 1044 167 78 1432 102 2247 2173 265 340 838 2291 1628 58 754 346 1976 347 3 744
5,1870_C,403 891 1928 652 1755 561 2458 1119 58 363 538 78 609 1181 2272 2547 1624 975 1479 1081 1804 1134 289 623 2579
6,1871_A,1287 1073 129 2575 787 676 365 946 1923 324 1297 1105 837 985 181 2250 1150 1958 2275 1853 153 164 2151 475 1627
7,1871_C,1287 1073 365 2151 676 2250 129 787 2532 804 2439 164 475 322 324 1923 1105 1297 1237 1958 2575 2548 1349 2323 2552
8,1871_D,1073 1287 365 1923 2151 1105 1797 2552 1157 1342 2330 1677 1817 1415 1441 1474 322 1182 475 912 2548 711 1198 1079 2412
