# Student Simulation Experiment for Question Difficulty Estimation

This notebook simulates student responses and estimates question difficulty using data from Zapien – an edtech platform that applies Item Response Theory (IRT) for adaptive math learning.

### Data Source
- Student–question interactions
- Topic-specific IRT ability scores

### Experiment Steps
1. **Create Student Dataset**: Select 50 students with broad topic coverage and extract their topic-specific IRT ability scores.
2. **Select Questions**: Identify 10 questions with high interaction counts and stable difficulty scores.
3. **Simulate Student Responses**: Use LLMs (via Gemini) to simulate how these students would answer the questions (including ability scores in the prompt).
4. **Extended Experiment**: Add ability scores from foundational topics by identifying prerequisites via LLM.
5. **Analysis**: Calculate difficulty scores from simulated answers and compare with the actual ones.


## 01. Setup and Library Imports

In this section we import all necessary libraries, load environment variables, and set up our LLM vendor library as well as utility functions.

In [1]:
import os
import json
import re
from dotenv import load_dotenv

import pandas as pd
import matplotlib.pyplot as plt

# Import Google Generative AI (Gemini) vendor
import google.generativeai as genai
from google.generativeai.types import generation_types
# Import DeepSeek vendor
from openai import OpenAI


# Import utility functions (e.g., extract_options, get_image_base64, display_rows)
from utils import *

# Load environment variables
load_dotenv()

print("Setup complete. Environment variables loaded and libraries imported.")

Setup complete. Environment variables loaded and libraries imported.


## 02. Load Raw Data

Load the master dataset containing student–question interactions and IRT ability scores.

In [2]:
# Load the master dataset with all statistics
df_original = pd.read_csv('../data/new/master.csv')
print(f"Loaded master dataset with {len(df_original):,} rows.")

Loaded master dataset with 280,979 rows.


### Dataset Overview

Let’s examine some basic information about the dataset, including the number of rows, columns, and column data types.

In [3]:
print("\nDataset Overview:")
print(f"Total rows: {len(df_original):,}")
print(f"Total columns: {len(df_original.columns)}")

print("\nMissing Values:")
missing = df_original.isnull().sum()
print(missing[missing > 0])

print("\nKey Column Types:")
print("- Numeric columns:", df_original.select_dtypes(include=['int64', 'float64']).columns.tolist())
print("- Categorical columns:", df_original.select_dtypes(include=['object']).columns.tolist())
print("- Boolean columns:", df_original.select_dtypes(include=['bool']).columns.tolist())


Dataset Overview:
Total rows: 280,979
Total columns: 25

Missing Values:
grade_id       45260
grade_name     45260
school_id      45260
school_name    45260
user_level      8967
solution           1
hint              79
dtype: int64

Key Column Types:
- Numeric columns: ['answer_id', 'user_id', 'grade_id', 'school_id', 'user_level', 'question_id', 'difficulty', 'topic_id', 'subject_id', 'axis_id', 'guide_id', 'template_id']
- Categorical columns: ['created_at', 'grade_name', 'school_name', 'options', 'question_title', 'correct_option', 'solution', 'hint', 'topic_name', 'subject_name', 'axis_name', 'student_answer']
- Boolean columns: ['is_correct']


## 03. Create Student Dataset

We now build a dataset that aggregates each student's topic-specific information.

**Steps:**
- Sort the data by `answer_id` (chronological order).
- Group by `user_id` and `topic_id` to aggregate statistics (e.g., count of questions, latest ability level).
- Clean-up column names for clarity.

In [4]:
# Sort the dataset chronologically by answer_id
df_original = df_original.sort_values('answer_id')

# Group by user_id and topic_id; aggregate statistics
user_topic_stats = df_original.groupby(['user_id', 'topic_id']).agg({
    'answer_id': ['max', 'count'],  # 'max' for the latest answer; 'count' for questions attempted
    'user_level': 'last',            # last value after sorting
    'topic_name': 'first',           # constant per topic
    'subject_id': 'first',
    'subject_name': 'first',
    'axis_id': 'first',
    'axis_name': 'first'
}).reset_index()

# Flatten the multi-level columns
user_topic_stats.columns = [
    'user_id', 'topic_id', 'last_answer_id', 'num_questions', 'user_level',
    'topic_name', 'subject_id', 'subject_name', 'axis_id', 'axis_name'
]

print("Student-topic statistics created.")

Student-topic statistics created.


In [5]:
# Filter student-topic pairs: only keep rows with >5 questions and a valid user_level
print("\nBefore filtering:")
print(f"Total rows: {len(user_topic_stats)}")
print(f"Unique users: {len(user_topic_stats['user_id'].unique())}")

filtered_stats = user_topic_stats[(user_topic_stats['num_questions'] > 5) & (user_topic_stats['user_level'].notna())]

print("\nAfter filtering (only pairs with >5 questions and valid user_level):")
print(f"Total rows: {len(filtered_stats)} | Unique users: {len(filtered_stats['user_id'].unique())}")


Before filtering:
Total rows: 39468
Unique users: 1893

After filtering (only pairs with >5 questions and valid user_level):
Total rows: 15620 | Unique users: 1605


### Explore a Random User's Topic Levels

Display the topic levels for a randomly selected user from the filtered dataset.

In [6]:
# Pick a random user and display their topic information
example_user = filtered_stats['user_id'].sample(n=1).iloc[0]
user_topics = filtered_stats[filtered_stats['user_id'] == example_user]

print(f"\nUser {example_user}'s topic levels:")
print("--------------------------------")
for _, row in user_topics.iterrows():
    print(f"Topic: {row['topic_name']}")
    print(f"Level: {row['user_level']:.2f}")
    print(f"Questions attempted: {row['num_questions']}")
    print("--------------------------------")


User 16387's topic levels:
--------------------------------
Topic: Suma y resta de números racionales
Level: 1.43
Questions attempted: 10
--------------------------------
Topic: Multiplicación y división de números racionales
Level: 1.62
Questions attempted: 10
--------------------------------
Topic: Transformación de decimal a fracción
Level: 0.89
Questions attempted: 10
--------------------------------
Topic: Fracciones como porcentajes
Level: -2.18
Questions attempted: 10
--------------------------------
Topic: Ecuaciones con coeficientes racionales
Level: 2.25
Questions attempted: 40
--------------------------------
Topic: Función exponencial
Level: 1.81
Questions attempted: 10
--------------------------------
Topic: Resolución de ecuaciones de segundo grado
Level: -0.96
Questions attempted: 13
--------------------------------
Topic: Teorema de pitágoras
Level: 1.27
Questions attempted: 10
--------------------------------
Topic: Medidas de tendencia central con datos agrupados
Lev

### Identify Top 50 Users in a Specific Axis

For axis 24, count how many topics each user has and select the top 50 users.

In [7]:
# For axis 24, count topics per user and select top 50
axis_24_topics = filtered_stats[filtered_stats['axis_id'] == 24].groupby('user_id')['topic_id'].count().reset_index()
axis_24_topics.columns = ['user_id', 'num_topics']
axis_24_topics = axis_24_topics.sort_values('num_topics', ascending=False)

top_50_users = axis_24_topics.head(50)
print("\nTop 50 users in Axis 24 based on number of topics:")
print(top_50_users)


Top 50 users in Axis 24 based on number of topics:
      user_id  num_topics
863     51959          43
973     52992          31
892     52253          26
864     51962          25
883     52172          25
914     52584          24
905     52423          23
495     36903          23
940     52912          23
947     52922          22
878     52164          22
918     52645          21
880     52167          20
972     52991          19
708     50494          18
924     52733          18
965     52979          17
409     23076          16
406     23068          16
975     52995          15
718     50504          15
872     52036          15
944     52917          14
894     52295          14
987     53024          14
867     51986          13
229     16540          13
701     50487          13
354     22853          13
408     23075          13
705     50491          12
899     52381          12
228     16539          12
951     52933          12
204     16509          12
177     1647

### Build User–Topic Matrix

For the top 50 users (axis 24), create a pivot table that maps each user to their topic level. This matrix will later be updated with simulated topic levels.

In [8]:
# Filter detailed data for the top 50 users in axis 24
top_50_data = filtered_stats[(filtered_stats['axis_id'] == 24) & (filtered_stats['user_id'].isin(top_50_users['user_id']))]

# Create a pivot table: rows are users, columns are topics, values are user_level
top_50_users_matrix = top_50_data.pivot(index='user_id', columns='topic_id', values='user_level')

print("\nUser-Topic Level Matrix for Top 50 Users created.")

# Save the matrix for later use
top_50_users_matrix.to_csv('user_topic_levels.csv')
print("User-topic level matrix saved to 'user_topic_levels.csv'.")


User-Topic Level Matrix for Top 50 Users created.
User-topic level matrix saved to 'user_topic_levels.csv'.


## 04. Define LLM Functions

In this section we define helper functions for interacting with the Gemini model and for creating student profiles and prompts.

### LLM Response Function

This function sends a prompt to the Gemini model and returns its response text.

In [2]:
def get_gemini_response(prompt: str, max_retries: int = 3) -> str:
    """
    Get a response from the Gemini model with retry logic.
    """
    import time
    
    for attempt in range(max_retries):
        try:
            # Configure the API key using an environment variable
            genai.configure(api_key=os.getenv('GOOGLE_API_KEY'))
            
            # Initialize the Gemini model
            model = genai.GenerativeModel(model_name="gemini-2.0-flash-lite-preview-02-05")
            
            # Start a chat session
            chat_session = model.start_chat(history=[])
            
            try:
                # Try to get response
                response = chat_session.send_message(prompt)
                return response.text
            except generation_types.StopCandidateException:
                # If we get StopCandidateException, just try again
                print("Received StopCandidateException, retrying...")
                continue
                
        except Exception as e:
            if "429" in str(e) and attempt < max_retries - 1:
                wait_time = 30  # seconds
                print(f"\nQuota exceeded. Waiting {wait_time} seconds before retry {attempt + 1}/{max_retries}...")
                time.sleep(wait_time)
                continue
            else:
                raise e
    
    # If we've exhausted all retries, raise the last error
    raise Exception("Maximum retries reached without successful response")

As a backup, we also use the deepseek R1 model.

In [21]:
def get_deepseek_response(prompt: str) -> str:
    """
    Get a response from the DeepSeek model.
    
    Args:
        prompt (str): The prompt to send to the model.
    
    Returns:
        str: The model's response text.
    """
    
    # Initialize the OpenAI client with DeepSeek configuration
    client = OpenAI(api_key=os.getenv('DEEPSEEK_API_KEY'), base_url="https://api.deepseek.com")
    
    # Create the message payload
    messages = [{"role": "user", "content": prompt}]
    
    # Send request to DeepSeek model
    response = client.chat.completions.create(
        model="deepseek-reasoner",
        messages=messages
    )
    
    # Return the response content
    return response.choices[0].message.content

print("LLM response function defined.")

LLM response function defined.


### Student Profile and Prompt Functions

These functions create a JSON profile of a student's topic levels (for a given axis) and then build a prompt that instructs the LLM to fill in missing values.

In [10]:
def create_student_profile(user_id: int, df: pd.DataFrame, axis_id: int = 24) -> dict:
    """
    Creates a JSON profile of a student's topic levels for a specific axis.
    
    Args:
        user_id (int): The student's ID
        df (pd.DataFrame): DataFrame containing the original data
        axis_id (int): The axis ID to filter topics (default is 24)
    
    Returns:
        dict: The student's topic profile with topics and corresponding levels
    """
    # Get all topics for the specified axis
    all_topics = df[df['axis_id'] == axis_id][['topic_id', 'topic_name']].drop_duplicates()
    
    # Get the student's latest level for each topic
    user_levels = df[(df['user_id'] == user_id) & (df['axis_id'] == axis_id)].groupby('topic_id')['user_level'].last()
    
    profile = {}
    for _, row in all_topics.iterrows():
        topic_id = str(row['topic_id'])  # For JSON compatibility
        level = user_levels.get(row['topic_id']) if row['topic_id'] in user_levels.index else None
        if level is not None:
            level = round(level, 2)
        profile[topic_id] = {
            "name": row['topic_name'],
            "level": level
        }
    
    return profile


def create_llm_prompt(profile: dict) -> str:
    """
    Creates a prompt for the LLM to fill in missing topic levels in a student's profile.
    
    Args:
        profile (dict): The student's topic profile
    
    Returns:
        str: The formatted prompt with the profile embedded
    """
    prompt = """You are an expert in student assessment and mathematics education. Given a student's known topic proficiency levels, fill in the missing levels (marked as null) based on topic relationships and mathematical progression.

Rules:
- Levels range from -3 (lowest) to 3 (highest).
- Use topic names to understand prerequisites and relationships.
- Consider that mastery in foundational topics indicates higher ability in related topics.
- Missing values should be logically consistent with known values.

Here is the student's current profile:

{profile}

Return ONLY a JSON with the same structure but with all null values filled in. Maintain the exact same format. """
    
    return prompt.format(profile=profile)

print("Student profile and prompt functions defined.")

Student profile and prompt functions defined.


## 05. Update Student Profiles Using the LLM

For each of the top 50 users, we create a student profile, generate an LLM prompt, get the response from Gemini, and update the corresponding topic levels. Finally, we merge in student age data.

In [6]:
# Copy the top 50 user matrix to store updated topic levels
updated_users = top_50_users_matrix.copy()

for user_id in updated_users.index:
    print(f"\nProcessing user {user_id}...")
    
    # Create the student's profile and generate the LLM prompt
    profile = create_student_profile(user_id, df_original, axis_id=24)
    prompt = create_llm_prompt(profile)
    
    # Get LLM response
    response = get_gemini_response(prompt)
    
    # Clean the response from markdown code fences
    cleaned_response = response.strip()
    if cleaned_response.startswith('```json'):
        cleaned_response = cleaned_response[7:]
    if cleaned_response.endswith('```'):
        cleaned_response = cleaned_response[:-3]
    
    try:
        updated_profile = json.loads(cleaned_response)
        
        # Update the corresponding topics for the user
        for topic_id, topic_data in updated_profile.items():
            topic_id_int = int(topic_id)
            if topic_id_int in updated_users.columns:
                updated_users.loc[user_id, topic_id_int] = topic_data['level']
        print(f"Successfully updated user {user_id}")
    except json.JSONDecodeError as e:
        print(f"Error parsing response for user {user_id}: {e}")
        continue

# Save the updated user profiles
updated_users.to_csv('updated_user_profiles.csv')
print("Updated user profiles saved to 'updated_user_profiles.csv'.")

NameError: name 'top_50_users_matrix' is not defined

In [12]:
# Load student age information and merge with updated user profiles
grade_age = pd.read_csv('../data/original/grade_age.csv')

user_grades = df_original[['user_id', 'grade_id']].drop_duplicates()
user_ages = user_grades.merge(grade_age, on='grade_id', how='left')[['user_id', 'student_age']]

updated_users['student_age'] = user_ages.set_index('user_id')['student_age']

print("\nUpdated user profiles with student age:")
print(updated_users.head())
print(f"\nUsers with age data: {updated_users['student_age'].notna().sum()}")

updated_users.to_csv('updated_user_profiles.csv')
print("Saved updated users with age data to 'updated_user_profiles.csv'.")

## 06. Create Question Dataset

Extract question metadata and response counts from the raw data. Here we focus on questions from topic 452.

In [13]:
# Sort data by answer_id (chronological order)
df_sorted = df_original.sort_values('answer_id')

# Group by question_id to extract metadata and count responses
questions_metadata = df_sorted.groupby('question_id').agg({
    'difficulty': 'last',
    'topic_id': 'first',
    'topic_name': 'first',
    'subject_id': 'first',
    'subject_name': 'first',
    'axis_id': 'first',
    'axis_name': 'first',
    'question_title': 'first',
    'options': 'first',
    'correct_option': 'first',
    'solution': 'first',
    'student_answer': 'first',
    'answer_id': 'count'
}).reset_index()

# Rename count column
questions_metadata = questions_metadata.rename(columns={'answer_id': 'response_count'})

# Filter for questions in topic 452 and sort by response count
questions_metadata = questions_metadata[questions_metadata['topic_id'] == 452].sort_values('response_count', ascending=False)

print("Questions metadata for topic 452 extracted.")
print(f"Shape: {questions_metadata.shape}")

In [14]:
# Select the top 30 questions and create a histogram of difficulty values
top_30_questions = questions_metadata.head(30)

plt.figure(figsize=(10, 6))
plt.hist(top_30_questions['difficulty'], bins=15, color='#64748B', edgecolor='#0F172A')
plt.xlabel('Difficulty')
plt.ylabel('Count')
plt.title('Distribution of Difficulty for Top 30 Most Answered Questions')
plt.grid(True, alpha=0.3)
plt.show()

In [15]:
# Save the top 30 questions for further processing
top_30_questions.to_csv('top_30_questions.csv', index=False)
print("Top 30 questions saved to 'top_30_questions.csv'.")

Top 30 questions saved to 'top_30_questions.csv'.


### Extract and Map Options

The questions contain a single column with options in various formats. We use the utility function `extract_options` to split them into separate columns. Then, we verify if the provided `correct_option` and `student_answer` match any of the extracted options, and map them to letter codes (a–e).

In [16]:
# Load the top 30 questions CSV
top_30_questions = pd.read_csv('top_30_questions.csv')

# Extract options into separate columns (option_a through option_e)
options_extracted = top_30_questions['options'].apply(extract_options).tolist()
options_df = pd.DataFrame(options_extracted, index=top_30_questions.index, 
                           columns=['option_a', 'option_b', 'option_c', 'option_d', 'option_e'])
top_30_questions = pd.concat([top_30_questions, options_df], axis=1)

print("Options extracted into separate columns.")

In [17]:
# Check if correct_option and student_answer match any of the extracted option columns
correct_matches_mask = ((top_30_questions['correct_option'] == top_30_questions['option_a']) |
                         (top_30_questions['correct_option'] == top_30_questions['option_b']) |
                         (top_30_questions['correct_option'] == top_30_questions['option_c']) |
                         (top_30_questions['correct_option'] == top_30_questions['option_d']) |
                         (top_30_questions['correct_option'] == top_30_questions['option_e']))

student_matches_mask = ((top_30_questions['student_answer'] == top_30_questions['option_a']) |
                         (top_30_questions['student_answer'] == top_30_questions['option_b']) |
                         (top_30_questions['student_answer'] == top_30_questions['option_c']) |
                         (top_30_questions['student_answer'] == top_30_questions['option_d']) |
                         (top_30_questions['student_answer'] == top_30_questions['option_e']))

top_30_questions['correct_option_matches'] = correct_matches_mask
top_30_questions['student_answer_matches'] = student_matches_mask

print(f"Number of rows where correct_option doesn't match any option: { (~correct_matches_mask).sum() }")
print(f"Number of rows where student_answer doesn't match any option: { (~student_matches_mask).sum() }")

In [18]:
# Map answer letters for the correct option
top_30_questions['correct_option_letter'] = None
top_30_questions.loc[top_30_questions['correct_option'] == top_30_questions['option_a'], 'correct_option_letter'] = 'a'
top_30_questions.loc[top_30_questions['correct_option'] == top_30_questions['option_b'], 'correct_option_letter'] = 'b'
top_30_questions.loc[top_30_questions['correct_option'] == top_30_questions['option_c'], 'correct_option_letter'] = 'c'
top_30_questions.loc[top_30_questions['correct_option'] == top_30_questions['option_d'], 'correct_option_letter'] = 'd'
top_30_questions.loc[top_30_questions['correct_option'] == top_30_questions['option_e'], 'correct_option_letter'] = 'e'

# Map answer letters for the student's answer
top_30_questions['student_answer_letter'] = None
top_30_questions.loc[top_30_questions['student_answer'] == top_30_questions['option_a'], 'student_answer_letter'] = 'a'
top_30_questions.loc[top_30_questions['student_answer'] == top_30_questions['option_b'], 'student_answer_letter'] = 'b'
top_30_questions.loc[top_30_questions['student_answer'] == top_30_questions['option_c'], 'student_answer_letter'] = 'c'
top_30_questions.loc[top_30_questions['student_answer'] == top_30_questions['option_d'], 'student_answer_letter'] = 'd'
top_30_questions.loc[top_30_questions['student_answer'] == top_30_questions['option_e'], 'student_answer_letter'] = 'e'

print("Mapped answer letters for correct_option and student_answer.")

In [19]:
# Save the final top 30 questions to CSV
top_30_questions.to_csv('top_30_questions_final.csv', index=False)
print("Final top 30 questions saved to 'top_30_questions_final.csv'.")

## 07. First Experiment: Simulate Student Response Based on Topic Level

In this experiment we simulate a student answering a math question based solely on their topic level. The process involves:

1. **Creating a Question Prompt Template**
2. **Defining a Helper Function to Aggregate Multiple LLM Responses**
3. **Running the Experiment**: Iterating through questions and users, simulating responses, and recording results.

### Step 1: Create the Question Prompt Template

This function builds a prompt that instructs the LLM to simulate a student’s response to a math question based on their ability level.

In [3]:
def create_question_prompt(question_data: dict, student_level: float) -> str:
    """
    Creates a prompt for the LLM to simulate a student's response to a math question.
    
    Args:
        question_data (dict): Dictionary containing question metadata (topic, subject, options, etc.)
        student_level (float): The student's ability level for the topic
    
    Returns:
        str: The formatted prompt
    """
    prompt = f"""### Begin Role ###

You are simulating a student taking a mathematics assessment. Your role is to authentically embody a student with specific academic traits, knowledge gaps, and behavioral patterns. You should think, reason, and respond in a way that realistically reflects:

- Your assigned academic abilities and limitations

- Your emotional state during assessments

- Your typical problem-solving approach

- Your common misconceptions and mistakes

- Your level of confidence in different topics

- Your past experiences with similar problems

- Your attention span and focus level

- Your test-taking strategies and habits

Remember that you are not an AI, but rather a student with human characteristics, including occasional self-doubt, varying motivation levels, and the potential to make mistakes that align with your profile.

### End Role ###

### Begin Question Context ###
As a student, you are facing a question with these characteristics:

Topic: {question_data['topic_name']}
Subject: {question_data['subject_name']}
Axis: {question_data['axis_name']}
Question: {question_data['question_title']}

Options:
a) {question_data['option_a']}
b) {question_data['option_b']}
c) {question_data['option_c']}
d) {question_data['option_d']}
e) {question_data['option_e']}
### End Question Context ###

### Begin Student Academic Profile ###
You are a {int(question_data['student_age']) if not pd.isna(question_data['student_age']) else 15}-year-old student, this is your academic profile and current knowledge state:

The topic of the question is "{question_data['topic_name']}"

Your mathematical ability in this specific topic is {student_level:.2f}. This is a psychometric measure that goes from -3 (low ability) to 3 (high ability). Use this rubric to understand the skill level for this student on this topic:

| **Theta Range**   | **Verbal Description** |

|-------------------|------------------------|

| -3 to -2.5        | **Very Low Ability**: The student is struggling significantly with the topic. They may lack foundational skills and often rely on guessing or external help to attempt answers. This level indicates a need for intensive support and practice with foundational concepts. |

| -2.5 to -1.5      | **Low Ability**: The student shows a basic understanding but frequently makes errors. They grasp some concepts but struggle to apply them consistently. They would benefit from targeted instruction and guided practice to build up their skills. |

| -1.5 to -0.5      | **Below Average Ability**: The student has a rudimentary understanding and can sometimes apply concepts, though with some inconsistencies. They may succeed in simpler problems but require more support for complex ones. |

| -0.5 to 0.5       | **Average Ability**: The student has a foundational understanding of the topic and can generally solve straightforward problems. They may struggle with complex problems but show potential with additional practice and guidance. |

| 0.5 to 1.5        | **Above Average Ability**: The student has a good grasp of the topic and can solve most problems correctly. They may occasionally make errors in more advanced questions but are generally proficient and can work through challenges with some effort. |

| 1.5 to 2.5        | **High Ability**: The student demonstrates strong skills in the topic, solving both simple and complex problems with minimal errors. They can work independently, make inferences, and apply knowledge in varied contexts. |

| 2.5 to 3          | **Very High Ability**: The student has mastered the topic, showing exceptional skill and understanding. They solve problems accurately, efficiently, and can approach the topic creatively. They are ready for advanced concepts and require minimal support. |

Keep this profile in mind as you approach this question - your responses should authentically reflect your academic level, knowledge state, and typical problem-solving patterns.
### End Student Academic Profile ###

### Begin Instructions ###
Read the question.

Think through the problem as a student at your level would:

- Consider what parts you understand

- Note what might confuse you

- Think about similar problems you might have seen before

Select your answer based on:

- Your current skill level

- Your historical performance in this topic

- Typical mistakes someone at your level might make
### End Instructions ###

### Begin Response Format ###
You should respond in this way:
Thinking: Here goes your thinking process. You should write a few sentences describing your initial thoughts about the problem, how you would approach it, and any potential pitfalls you might encounter. Consider the options. Are there any options that could confuse you away from the correct answer? Consider the difficulty of the question for a student of your age and skill level, would you get it correctly? Finally, you should mention your final choice and some rationale around why you chose it.
Answer: Respond with the letter of your chosen option in the format [[letter]]. It is IMPORTANT that you use the two square brackets so that I can easily extract your answer.
For example, if you choose option a, respond:
Answer: [[a]]
### End Response Format ###

### Begin Important Reminder ###
Remember, you are simulating a student with your academic profile. Your response should reflect this ability level in both your thinking process and answer choice. Don't necessarily write the correct answer, but try to simulate what a student with your profile would do. What errors would you make? What misconceptions would you have? What options would you consider? What options would confuse you away from the correct answer?
### End Important Reminder ###"""
    return prompt


### Step 2: Helper Function to Determine the Majority Answer

This function takes a list of responses and returns the most common one. In case of a tie, it returns the first.

In [4]:
def get_majority_answer(responses: list) -> str:
    """
    Returns the most common answer from a list of responses. 
    In case of a tie, returns the first answer.
    """
    from collections import Counter
    counts = Counter(responses)
    return counts.most_common(1)[0][0]

print("Majority answer function defined.")

Majority answer function defined.


### Step 3: Run the Experiment

This function iterates over each question (from the top 30) and for each top user, it:

- Constructs the question prompt using the student’s level
- Simulates three LLM responses
- Uses a majority vote to determine the final answer
- Records and saves the results for analysis

In [5]:
def run_experiment():
    # Load questions and updated user profiles
    print("Loading data files...")
    questions = pd.read_csv('top_30_questions_final.csv')
    users = pd.read_csv('updated_user_profiles.csv', index_col='user_id')
    print(f"Loaded {len(questions)} questions and {len(users)} users")
    
    total_questions = len(questions)
    total_users = len(users)
    total_iterations = total_questions * total_users
    current_iteration = 0
    
    # Check if partial results exist
    if os.path.exists('experiment_results.csv'):
        print("Loading existing results...")
        results_df = pd.read_csv('experiment_results.csv')
        # Get the last completed question_id
        completed_questions = results_df['question_id'].unique()
        questions = questions[~questions['question_id'].isin(completed_questions)]
        print(f"Resuming from question {len(completed_questions) + 1}")
    else:
        results_df = pd.DataFrame()
    
    print("\nStarting experiment iterations...")
    
    for q_idx, question in questions.iterrows():
        question_results = []
        print(f"\nProcessing question {q_idx + 1}/{total_questions} (ID: {question['question_id']})")
        print("Question:")
        print(question['question_title'])
        print("Options:")
        print(f"a) {question['option_a']}")
        print(f"b) {question['option_b']}")
        print(f"c) {question['option_c']}")
        print(f"d) {question['option_d']}")
        print(f"e) {question['option_e']}")
        
        for user_idx, user_id in enumerate(users.index):
            current_iteration += 1
            print(f"\nIteration {current_iteration}/{total_iterations}")
            print(f"Processing user {user_idx + 1}/{total_users} (ID: {user_id})")
            
            student_level = users.loc[user_id, '452']
            print(f"Student level: {student_level}")
            
            question_data = {
                'topic_name': question['topic_name'],
                'subject_name': question['subject_name'],
                'axis_name': question['axis_name'],
                'question_title': question['question_title'],
                'option_a': question['option_a'],
                'option_b': question['option_b'],
                'option_c': question['option_c'],
                'option_d': question['option_d'],
                'option_e': question['option_e'],
                'student_age': users.loc[user_id, 'student_age']
            }
            
            responses = []
            print("Getting LLM responses...")
            for attempt in range(3):
                print(f"Attempt {attempt + 1}/3")
                prompt = create_question_prompt(question_data, student_level)
                print("Prompt sent to LLM:")
                print(prompt)
                
                llm_response = get_gemini_response(prompt).strip().lower()
                print("Full LLM response:")
                print(llm_response)
                
                # Extract answer in the format [[letter]]
                answer_match = re.search(r'\[\[([a-e])\]\]', llm_response)
                if answer_match:
                    response_letter = answer_match.group(1)
                    print(f"Valid response received: {response_letter}")
                else:
                    response_letter = 'invalid'
                    print("Invalid response format received")
                responses.append(response_letter)
            
            final_answer = get_majority_answer(responses)
            print(f"Final answer (majority vote): {final_answer}")
            
            question_results.append({
                'question_id': question['question_id'],
                'user_id': user_id,
                'student_level': student_level,
                'llm_answer': final_answer,
                'correct_answer': question['correct_option_letter'],
                'response_1': responses[0],
                'response_2': responses[1],
                'response_3': responses[2]
            })
            
            progress = (current_iteration / total_iterations) * 100
            print(f"Overall progress: {progress:.1f}%")
        
        # Save results after each question is completed by all users
        question_df = pd.DataFrame(question_results)
        results_df = pd.concat([results_df, question_df], ignore_index=True)
        results_df.to_csv('experiment_results.csv', index=False)
        print(f"Results saved for question {question['question_id']}")
    
    print("\nExperiment completed.")
    return results_df

print("Experiment function defined.")

Experiment function defined.


In [6]:
# Run the experiment and calculate accuracy
results_df = run_experiment()

correct_predictions = (results_df['llm_answer'] == results_df['correct_answer']).mean()
print(f"\nAccuracy: {correct_predictions:.2%}")

Loading data files...
Loaded 30 questions and 50 users
Loading existing results...
Resuming from question 4

Starting experiment iterations...

Processing question 4/30 (ID: 17333)
Question:
¿Cuál es el valor de k si la solución del siguiente sistema de ecuaciones es `(2,1)`?<br>`3kx + y = 31`<br>`2x-ky = -1`<br><br>
Options:
a) `-2`
b) `0`
c) `1`
d) `4`
e) `5`

Iteration 1/1500
Processing user 1/50 (ID: 9753)
Student level: -1.93
Getting LLM responses...
Attempt 1/3
Prompt sent to LLM:
### Begin Role ###

You are simulating a student taking a mathematics assessment. Your role is to authentically embody a student with specific academic traits, knowledge gaps, and behavioral patterns. You should think, reason, and respond in a way that realistically reflects:

- Your assigned academic abilities and limitations

- Your emotional state during assessments

- Your typical problem-solving approach

- Your common misconceptions and mistakes

- Your level of confidence in different topics

- Y

KeyboardInterrupt: 