# Question Embedding Updater
**Description:**  
This Jupyter notebook is designed to streamline the process of updating training data for question generation. By uploading a folder containing questions and their associated files, the notebook will extract the relevant content, generate embeddings, and compile the data into a CSV file. This CSV will serve as the updated training dataset, ready for use in the question generation.


## Part 1: Upload Folder and Extract Content
Just change the path to the folder containing all the data. 

In [106]:
import os
import pandas as pd
from typing import Dict

def extract_questions_to_dataframe(directory_path: str) -> pd.DataFrame:
    """
    Extracts questions and their associated files from a directory structure and 
    organizes the data into a Pandas DataFrame where columns represent question titles.
    
    Args:
        directory_path (str): The path to the root directory containing question folders.
        
    Returns:
        pd.DataFrame: A DataFrame with question titles as rows
    """
    data = {}  # Dictionary to store data for each question title
    
    # Traverse the directory structure
    for current_path, subdirectories, filenames in os.walk(directory_path):
        # Check if the current directory is a leaf directory (no subdirectories)
        if not subdirectories:
            question_title = os.path.basename(current_path)  # Use directory name as question title
            
            # Store file contents for the current question
            file_contents = {}
            for filename in filenames:
                file_path = os.path.join(current_path, filename)
                try:
                    with open(file_path, 'r') as file:
                        file_contents[filename] = file.read()
                except Exception as e:
                    print(f"Error reading file {file_path}: {e}")
            
            # Add the file contents to the data dictionary under the question title
            data[question_title] = file_contents
    
    # Create a DataFrame with columns as question titles
    df = pd.DataFrame.from_dict(data, orient='index')
    return df
path = r"C:\Users\lberm\Downloads\ME175-20241228T184344Z-001"
df = extract_questions_to_dataframe(path)
df.rename_axis("question_title", axis='index',inplace=True)
df.head()


Unnamed: 0_level_0,info.json,question.html,server.js,solution.html
question_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ddSwitchToSL,"{\n ""title"": ""DoubleDecliningBalanceDepreci...",<pl-question-panel>\n <p>Create a table sho...,const math = require('mathjs'); \n\nconst gene...,<pl-solution-panel>\n \n <pl-hint lev...
doubleDecline,"{\n ""title"": ""DoubleDecliningBalanceDepreci...",<pl-question-panel>\n <p>Create a table sho...,const math = require('mathjs');\n//const { bve...,"<pl-solution-panel>\n <pl-hint level=""1"" da..."
MARCSDeduction5year,"{\n ""title"": ""MacrsDepreciationCalculation""...",<pl-question-panel>\n <p>Using the MACRS de...,const math = require('mathjs');\n\nconst MACRS...,"<pl-solution-panel>\n <pl-hint level=""1"" da..."
MARCSDeduction7year,"{\n ""title"": ""MacrsDepreciationCalculation""...",<pl-question-panel>\n <p>Using the MACRS de...,const math = require('mathjs');\n\nconst MACRS...,"<pl-solution-panel>\n <pl-hint level=""1"" da..."
straightLineMethod,"{\n ""title"": ""MillingMachineDepreciation"",\...",<pl-question-panel>\n <p>A milling machine ...,const math = require('mathjs');\n\nconst gener...,"<pl-solution-panel>\n <pl-hint level=""1"" da..."


## Part 2: Creating a batch file 

In [107]:
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(api_key = os.environ.get('OPENAI_API_KEY'))

In [108]:
question_prompt = """
    The following is an html document containing a questio, you are tasked with extracting the question 
    and format it such that it looks like it comes from a textbook or official material. The output should be 
    1. Maintain the Original Structure: Ensure the extracted question retains its original structure, including any subsections, numbered parts, or bullet points.
    2. Enhance Formatting for Readability: Format the question to appear polished and professional, akin to a textbook or exam guide. Use clear sectioning, appropriate indentation, and consistent stylin
    3. Ensure Completeness: Verify that each question has all necessary components (e.g., instructions, given data, diagrams, or conditions) to be fully understood without additional context.
    4. Preserve Essential Elements: Retain mathematical symbols, figures, or any special formatting provided in the HTML. Delimit any latext using $$ for block level math and $ for inline math. 
    
    html:{html}
"""
# question_prompt.format(html = df.iloc[0]["question.html"])

def clean_question(html:str):
    question_prompt = r"""
    The following is an html document containing a questio, you are tasked with extracting the question 
    and format it such that it looks like it comes from a textbook or official material. The output should be 
    1. Maintain the Original Structure: Ensure the extracted question retains its original structure, including any subsections, numbered parts, or bullet points.
    2. Enhance Formatting for Readability: Format the question to appear polished and professional, akin to a textbook or exam guide. Use clear sectioning, appropriate indentation, and consistent stylin
    3. Ensure Completeness: Verify that each question has all necessary components (e.g., instructions, given data, diagrams, or conditions) to be fully understood without additional context.
    4. Preserve Essential Elements: Retain mathematical symbols, figures, or any special formatting provided in the HTML. Delimit any latext using $$ for block level math and $ for inline math. 
    5. The hmtl will contain placeholders in the format {{params.value}} replace these values with appropriate values
    Only Return the formatted question
    """
    
    completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "developer", "content": "You are a helpful assistant."},
        {"role": "user", "content": question_prompt+ f'\n html:{html}'}])
    
    return completion.choices[0].message.content

random_index = df.sample(n=3).index
for id in random_index:
    original_question = df.loc[id]['question.html']
    cleaned_question = clean_question(original_question)
    msg = f"""
    \nQuestion{id}
    Original Question: {original_question}\n
    Cleaned Question {cleaned_question}\n
    """
    print(msg)
    


    
QuestionMARCSDeduction5year
    Original Question: <pl-question-panel>
    <p>Using the MACRS deduction method for a large tape-drive system purchased for {{params.cost}} USD classified as {{params.propertyClass}}-year property, calculate the depreciation expense and the book value of the system at the end of each year for its {{params.usefulLife}}-year useful life.</p>
</pl-question-panel>

<pl-number-input answers-name="bookValueEndYear1" comparison="sigfig" digits="3" label="Book Value at End of Year 1 "></pl-number-input>
<pl-number-input answers-name="bookValueEndYear2" comparison="sigfig" digits="3" label="Book Value at End of Year 2 "></pl-number-input>
<pl-number-input answers-name="bookValueEndYear3" comparison="sigfig" digits="3" label="Book Value at End of Year 3 "></pl-number-input>
<pl-number-input answers-name="bookValueEndYear4" comparison="sigfig" digits="3" label="Book Value at End of Year 4 "></pl-number-input>

<pl-number-input answers-name="bookValueEndYearend

In [150]:
import json
from pydantic import BaseModel,Field
import os 


class Question(BaseModel):
    question: str=Field(...,description="The extracted question")

def create_request(content, custom_id):
    """
    Create a JSON request body for the chat completion API.

    Args:
        content (str): The content of the user message.
        custom_id (str): A custom identifier for the request.

    Returns:
        str: A JSON-formatted string for the request body.
    """
    request = {
        "custom_id": str(custom_id),
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a helpful assistant."
                },
                {
                    "role": "user",
                    "content": content
                }
            ],
            "response_format": {
                "type": "json_schema",
                "json_schema": {
                    "name": "response",
                    "schema": Question.schema()
                }
            }
        }
    }
    return json.dumps(request, indent=4)
def save_as_jsonl(data, filename):
    """
    Save a list of JSON strings as a JSONL file.

    Args:
        data (list): A list of JSON strings.
        filename (str): The name of the file to save the data.

    Returns:
        None
    """
    with open(filename, 'w') as jsonl_file:
        for item in data:
            # Parse the string to a JSON object, then dump it back as a single line
            json_obj = json.loads(item)
            jsonl_file.write(json.dumps(json_obj) + '\n')

question_prompt = r"""
The following is an HTML document containing a question. You are tasked with extracting the question and formatting it such that it retains the same structure, removing any unnecessary HTML, and cleaning it up to look polished and professional.

The output should:
1. **Maintain the Original Structure**: Ensure the extracted question retains its original structure, including any subsections, numbered parts, or bullet points.
2. **Enhance Formatting for Readability**: Format the question to appear polished and professional, akin to a textbook or official material. Use clear sectioning, appropriate indentation, and consistent styling.
3. **Ensure Completeness**: Verify that each question has all necessary components (e.g., instructions, given data, diagrams, or conditions) to be fully understood without additional context.
4. **Preserve Essential Elements**: Retain mathematical symbols, figures, or any special formatting provided in the HTML. Delimit any LaTeX using `$$` for block-level math and `$` for inline math.
5. **Replace Placeholders**: The HTML may contain placeholders in the format `{{params.value}}`. Replace these placeholders with appropriate values based on the context. Ensure that all of them are replaced 

Example Input 1:
<p>Calculate the force exerted by an object using Newton's Second Law, where mass = {{params.mass}} kg and acceleration = {{params.acceleration}} m/s².</p>

Example Output 1:
Calculate the force exerted by an object using Newton's Second Law, where:
- Mass = 5 kg
- Acceleration = 3 m/s².

---

Example Input 2:
<p>A rectangular prism has a length of {{params.length}} cm, a width of {{params.width}} cm, and a height of {{params.height}} cm. Calculate its volume.</p>

Example Output 2:
A rectangular prism has a length of 10 cm, a width of 4 cm, and a height of 5 cm. Calculate its volume.

---

Example Input 3:
<p>What is the result of the following integral? $$\int_{{{params.lower_limit}}}^{{{params.upper_limit}}} x^2 \, dx$$</p>

Example Output 3:
What is the result of the following integral? 
$$\int_{0}^{3} x^2 \, dx$$

---

Ensure the output question retains the same structure, but remove unnecessary HTML and clean up the text to make it polished and professional. Only return the formatted question.
"""

    
request = []
for i, (index, row) in enumerate(df.iterrows(), start=1):
    question_title = index
    data = row.to_dict()
    question_html = data.get('question.html','')
    content = question_prompt+f'\nhtml{question_html}\n'
    request.append(create_request(content,question_title))
# Check base directory as thats where its probably saved
save_as_jsonl(request,'175batch.jsonl')

In [151]:
from openai import OpenAI
import os
client = OpenAI()

print(os.getcwd())
batch_input_file = client.files.create(
    file=open(os.path.join(os.getcwd(),"175batch.jsonl"), "rb"),
    purpose="batch"
)

print(batch_input_file)
batch_input_file_id = batch_input_file.id
client.batches.create(
    input_file_id=batch_input_file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
        "description": "update"
    }
)

c:\Users\lberm\OneDrive\Documents\Github\mechedu1.0\src\notebooks
FileObject(id='file-L2iTJhJs9ffYTUpwgxMsiy', bytes=124454, created_at=1735584207, filename='175batch.jsonl', object='file', purpose='batch', status='processed', status_details=None)


Batch(id='batch_6772e9d09e888190b908158aa8bce19b', completion_window='24h', created_at=1735584208, endpoint='/v1/chat/completions', input_file_id='file-L2iTJhJs9ffYTUpwgxMsiy', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1735670608, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'update'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

In [165]:
batch = client.batches.retrieve("batch_6772e9d09e888190b908158aa8bce19b")
print(batch)

Batch(id='batch_6772e9d09e888190b908158aa8bce19b', completion_window='24h', created_at=1735584208, endpoint='/v1/chat/completions', input_file_id='file-L2iTJhJs9ffYTUpwgxMsiy', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1735585883, error_file_id=None, errors=None, expired_at=None, expires_at=1735670608, failed_at=None, finalizing_at=1735585878, in_progress_at=1735584209, metadata={'description': 'update'}, output_file_id='file-RjyXfu2NfEpZ78hWpu3WSt', request_counts=BatchRequestCounts(completed=33, failed=0, total=33))


### Part 3: Merging Data

In [166]:
file_response = client.files.content("file-RjyXfu2NfEpZ78hWpu3WSt")
data = {}
for line in file_response.text.split('\n'):
    line = line.strip()
    if line:  
        try:
            content = json.loads(line)
            question_id = content["custom_id"]
            completion = json.loads(content["response"]["body"]["choices"][0]["message"]["content"]).get('question')
            data[question_id] = completion
        except json.JSONDecodeError as e:
            print(f"Skipping invalid JSON line: {line} | Error: {e}")
df_batch = pd.DataFrame.from_dict(data, orient='index')

In [167]:
df_batch.head()

Unnamed: 0,0
ddSwitchToSL,Create a table showing the depreciation and bo...
doubleDecline,Create a table showing the depreciation and bo...
MARCSDeduction5year,Using the MACRS deduction method for a large t...
MARCSDeduction7year,Using the MACRS deduction method for a large t...
straightLineMethod,"A milling machine is purchased for 15,000 USD ..."


In [168]:
new_df = pd.merge(df, df_batch, left_index=True, right_index=True)
new_df.head()

Unnamed: 0,info.json,question.html,server.js,solution.html,0
ddSwitchToSL,"{\n ""title"": ""DoubleDecliningBalanceDepreci...",<pl-question-panel>\n <p>Create a table sho...,const math = require('mathjs'); \n\nconst gene...,<pl-solution-panel>\n \n <pl-hint lev...,Create a table showing the depreciation and bo...
doubleDecline,"{\n ""title"": ""DoubleDecliningBalanceDepreci...",<pl-question-panel>\n <p>Create a table sho...,const math = require('mathjs');\n//const { bve...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",Create a table showing the depreciation and bo...
MARCSDeduction5year,"{\n ""title"": ""MacrsDepreciationCalculation""...",<pl-question-panel>\n <p>Using the MACRS de...,const math = require('mathjs');\n\nconst MACRS...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",Using the MACRS deduction method for a large t...
MARCSDeduction7year,"{\n ""title"": ""MacrsDepreciationCalculation""...",<pl-question-panel>\n <p>Using the MACRS de...,const math = require('mathjs');\n\nconst MACRS...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",Using the MACRS deduction method for a large t...
straightLineMethod,"{\n ""title"": ""MillingMachineDepreciation"",\...",<pl-question-panel>\n <p>A milling machine ...,const math = require('mathjs');\n\nconst gener...,"<pl-solution-panel>\n <pl-hint level=""1"" da...","A milling machine is purchased for 15,000 USD ..."


### Creating embeddings

In [174]:
new_df.rename(columns={0: 'question'}, inplace=True)
new_df.iloc[2].question

'Using the MACRS deduction method for a large tape-drive system purchased for 50,000 USD classified as 5-year property, calculate the depreciation expense and the book value of the system at the end of each year for its 7-year useful life.\n\nCalculate the following:\n- Book Value at End of Year 1 \n- Book Value at End of Year 2 \n- Book Value at End of Year 3 \n- Book Value at End of Year 4 \n- Book Value at End of Year 5 \n- Book Value at End of Year 6 \n- Book Value at End of Year 7.'

In [175]:
def create_request_batch(content, custom_id):
    """
    Create a JSON request body for the chat completion API.

    Args:
        content (str): The content of the user message.
        custom_id (str): A custom identifier for the request.

    Returns:
        str: A JSON-formatted string for the request body.
    """
    request = {
        "custom_id": str(custom_id),
        "method": "POST",
        "url": "/v1/embeddings",
        "body": {
            "model": "text-embedding-3-small",
            "input": content
            }
        }
    return json.dumps(request, indent=4)
request = []
for i, (index, row) in enumerate(new_df.iterrows(), start=1):
    question_title = index
    data = row.to_dict()
    question = data.get('question','')
    print(question)
    request.append(create_request_batch(question,question_title))
print(request)
# Check base directory as thats where its probably saved
save_as_jsonl(request,'175batch_embedding.jsonl')

Create a table showing the depreciation and book value for each year (i.e., from EOY 0 through EOY 10) for a milling machine purchased for 20000 USD with a salvage value of 2000 USD and a useful life of 10 years using double declining balance depreciation and switching to straight-line depreciation.

What is the book value at EOY 1, EOY 5, and EOY 10?

- Book Value at EOY 1 = 
- Book Value at EOY 5 = 
- Book Value at EOY 10 = 
Create a table showing the depreciation and book value for each year (i.e., from EOY 0 through EOY 10) for a milling machine purchased for 50000 USD with a salvage value of 5000 USD and a useful life of 10 years using double declining balance depreciation.

What is the book value at EOY 1, EOY 5, and EOY 10?

- Book Value at EOY 1 = 45000
- Book Value at EOY 5 = 25000
- Book Value at EOY 10 = 5000
Using the MACRS deduction method for a large tape-drive system purchased for 50,000 USD classified as 5-year property, calculate the depreciation expense and the book v

In [177]:
batch_input_file = client.files.create(
    file=open(os.path.join(os.getcwd(),"175batch_embedding.jsonl"), "rb"),
    purpose="batch"
)
print(batch_input_file)
batch_input_file_id = batch_input_file.id
client.batches.create(
    input_file_id=batch_input_file_id,
    endpoint="/v1/embeddings",
    completion_window="24h",
    metadata={
        "description": "update"
    }
)

FileObject(id='file-XV1v4URA1APA3ThLnPzE3j', bytes=19221, created_at=1735586471, filename='175batch_embedding.jsonl', object='file', purpose='batch', status='processed', status_details=None)


Batch(id='batch_6772f2a8e16c81909a0ce6281f5e570d', completion_window='24h', created_at=1735586473, endpoint='/v1/embeddings', input_file_id='file-XV1v4URA1APA3ThLnPzE3j', object='batch', status='validating', cancelled_at=None, cancelling_at=None, completed_at=None, error_file_id=None, errors=None, expired_at=None, expires_at=1735672873, failed_at=None, finalizing_at=None, in_progress_at=None, metadata={'description': 'update'}, output_file_id=None, request_counts=BatchRequestCounts(completed=0, failed=0, total=0))

In [178]:
batch = client.batches.retrieve("batch_6772f2a8e16c81909a0ce6281f5e570d")
print(batch)

Batch(id='batch_6772f2a8e16c81909a0ce6281f5e570d', completion_window='24h', created_at=1735586473, endpoint='/v1/embeddings', input_file_id='file-XV1v4URA1APA3ThLnPzE3j', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1735586488, error_file_id=None, errors=None, expired_at=None, expires_at=1735672873, failed_at=None, finalizing_at=1735586483, in_progress_at=1735586474, metadata={'description': 'update'}, output_file_id='file-RDNyEUmi2dr5a7T8fiepxD', request_counts=BatchRequestCounts(completed=33, failed=0, total=33))


In [182]:
file_response = client.files.content("file-RDNyEUmi2dr5a7T8fiepxD")
file_response.text

'{"id": "batch_req_6772f2b364108190b9c8fd32a4d78794", "custom_id": "ddSwitchToSL", "response": {"status_code": 200, "request_id": "be3bbc62a68b3e7daede8cf7ce602fb6", "body": {"object": "list", "data": [{"object": "embedding", "index": 0, "embedding": [0.008822514, 0.06810361, 0.059388254, 0.0036820003, 0.016382966, 0.0030197166, 0.0067329705, 0.02826538, 0.03393274, -0.026884258, 0.082152955, -0.02728907, -0.05300652, -0.046053283, 0.04498172, 0.017394995, -0.04136223, 0.002664018, 0.019764334, 0.027170008, 0.017037809, 0.028503506, -0.009292809, -0.022324173, 0.020562053, -3.7648788e-05, 0.03981442, -0.011810976, 0.01656156, -0.031027624, -0.014180315, -0.0094654495, -0.018573713, 0.02631276, -0.015918624, -0.021276426, 0.022455143, 0.0005633133, -0.0074592503, -0.0050899116, 0.0066496274, -0.041719414, 0.030765688, 0.053768516, 0.020454897, 0.022467049, 0.008030749, -0.079438336, 0.036266364, 0.010263166, 0.017549777, -0.014430346, 0.037861798, 8.8878114e-05, -0.009649996, 0.01562096

In [206]:
data = {}
for line in file_response.text.split('\n'):
    line = line.strip()
    if line:  
        try:
            content = json.loads(line)
            question_id = content["custom_id"]
            embedding = content["response"]["body"]['data'][0]['embedding']
            data[question_id] = str(embedding) # Conver to string reminder to need to convert back 
            # print(embedding)
        except json.JSONDecodeError as e:
            print(f"Skipping invalid JSON line: {line} | Error: {e}")
df_batch_embedding = pd.DataFrame.from_dict(data, orient='index')

In [207]:
df_batch_embedding.head()

Unnamed: 0,0
ddSwitchToSL,"[0.008822514, 0.06810361, 0.059388254, 0.00368..."
doubleDecline,"[0.017977716, 0.06909924, 0.05670903, 0.004378..."
MARCSDeduction5year,"[0.0074357996, 0.046291497, 0.0006788812, -0.0..."
MARCSDeduction7year,"[0.018601537, 0.04095613, 0.004502403, -0.0155..."
straightLineMethod,"[0.0019877013, 0.0844819, 0.026290337, 0.01974..."


In [227]:
final_df =pd.merge(new_df, df_batch_embedding, left_index=True, right_index=True)
final_df.rename(columns={0: 'embeddings-3-small'}, inplace=True)
final_df.head()

Unnamed: 0,info.json,question.html,server.js,solution.html,question,embeddings-3-small
ddSwitchToSL,"{\n ""title"": ""DoubleDecliningBalanceDepreci...",<pl-question-panel>\n <p>Create a table sho...,const math = require('mathjs'); \n\nconst gene...,<pl-solution-panel>\n \n <pl-hint lev...,Create a table showing the depreciation and bo...,"[0.008822514, 0.06810361, 0.059388254, 0.00368..."
doubleDecline,"{\n ""title"": ""DoubleDecliningBalanceDepreci...",<pl-question-panel>\n <p>Create a table sho...,const math = require('mathjs');\n//const { bve...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",Create a table showing the depreciation and bo...,"[0.017977716, 0.06909924, 0.05670903, 0.004378..."
MARCSDeduction5year,"{\n ""title"": ""MacrsDepreciationCalculation""...",<pl-question-panel>\n <p>Using the MACRS de...,const math = require('mathjs');\n\nconst MACRS...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",Using the MACRS deduction method for a large t...,"[0.0074357996, 0.046291497, 0.0006788812, -0.0..."
MARCSDeduction7year,"{\n ""title"": ""MacrsDepreciationCalculation""...",<pl-question-panel>\n <p>Using the MACRS de...,const math = require('mathjs');\n\nconst MACRS...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",Using the MACRS deduction method for a large t...,"[0.018601537, 0.04095613, 0.004502403, -0.0155..."
straightLineMethod,"{\n ""title"": ""MillingMachineDepreciation"",\...",<pl-question-panel>\n <p>A milling machine ...,const math = require('mathjs');\n\nconst gener...,"<pl-solution-panel>\n <pl-hint level=""1"" da...","A milling machine is purchased for 15,000 USD ...","[0.0019877013, 0.0844819, 0.026290337, 0.01974..."


## Final Combine with existing dataset

In [260]:
old_path = os.path.abspath(os.path.join(os.getcwd(), '..','..'))+ r'\src\data\question_embeddings_2024_9_11.csv'
old_df = pd.read_csv(old_path)
print(old_df.columns)
old_df = old_df.drop(columns =['Unnamed: 0.2','Unnamed: 0.1'])
old_df.head()
old_df.tail()

Index(['Unnamed: 0.2', 'Unnamed: 0.1', 'Question Title', 'Unnamed: 0',
       'question.html', 'server.js', 'solution.html', 'server.py',
       'properties.js', 'info1.json', 'server_trap.js', 'server1.py',
       'server2.py', 'test1.py', 'server3.py', '.DS_Store', 'question',
       'question_embedding', 'uuid', 'title', 'stem', 'topic', 'tags',
       'prereqs', 'isAdaptive', 'createdBy', 'qType', 'nSteps', 'updatedBy',
       'difficulty', 'codelang', 'resources', 'stepType', 'dificulty',
       'embeddings-3-small'],
      dtype='object')


Unnamed: 0.1,Question Title,Unnamed: 0,question.html,server.js,solution.html,server.py,properties.js,info1.json,server_trap.js,server1.py,...,createdBy,qType,nSteps,updatedBy,difficulty,codelang,resources,stepType,dificulty,embeddings-3-small
282,,,"<div class=""card my-2"">\r\n <div class=""car...",,,import prairielearn as pl\r\nimport sympy\r\n\...,,,,,...,,,,,,,,,,"[0.036735910922288895, -0.005829245317727327, ..."
283,,,<pl-question-panel>\r\n <p>\r\n Ques...,,,import prairielearn as pl\r\n\r\n\r\ndef gener...,,,,,...,,,,,,,,,,"[0.012048509903252125, 0.027021152898669243, 0..."
284,,,\r\n<pl-question-panel>\r\n\r\n<p>The element ...,,,import numpy as np\r\nimport prairielearn as p...,,,,,...,,,,,,,,,,"[0.024972645565867424, 0.013344255276024342, 0..."
285,,,"<pl-card header=""Header"" title=""Title"" subtitl...",,,import numpy as np\r\nimport prairielearn as p...,,,,,...,,,,,,,,,,"[-0.015040713362395763, 0.010234958492219448, ..."
286,,,,,,,,,,,...,,,,,,,,,,"[0.01346839964389801, -0.0219414085149765, 0.0..."


In [254]:
# final_df = final_df.reset_index()
# final_df.rename(columns={'index': 'Question Title'}, inplace=True)
print(final_df.shape)
print(old_df.shape)

(33, 7)
(287, 33)


In [262]:
# Perform the first merge
df_new_data = pd.merge(old_df, final_df, left_index=True, right_index=True)

# Define overlap columns
overlap_columns = ['question.html', 'server.js', 'solution.html', 'question', 'embeddings-3-small', 'Question Title']

# Merge on overlap columns
merged_df = pd.merge(old_df, final_df, on=overlap_columns, how='outer', indicator=True)

# Handle duplicates in 'Question Title'
if 'Question Title' in merged_df.columns:
    merged_df['Question Title'] = merged_df['Question Title'].astype(str)  # Ensure values are strings
    merged_df['Question Title'] = merged_df.groupby('Question Title').cumcount().apply(
        lambda x: f"{x}" if x > 0 else ""
    ).radd(merged_df['Question Title'])

# Print the resulting DataFrame
merged_df.head()
merged_df.sort_values(by='Question Title', ascending=True, inplace=True)




In [240]:
df_new_data

Unnamed: 0.1,Question Title_x,Unnamed: 0,question.html_x,server.js_x,solution.html_x,server.py,properties.js,info1.json,server_trap.js,server1.py,...,stepType,dificulty,embeddings-3-small_x,Question Title_y,info.json,question.html_y,server.js_y,solution.html_y,question_y,embeddings-3-small_y
0,3dMoment1,0.0,<pl-question-panel>\r\n <pl-figure file-nam...,const math = require('mathjs');\r\n// const ma...,<pl-solution-panel>\r\n <pl-figure file-nam...,,,,,,...,,,"[0.021389668807387352, 0.0019783112220466137, ...",ddSwitchToSL,"{\n ""title"": ""DoubleDecliningBalanceDepreci...",<pl-question-panel>\n <p>Create a table sho...,const math = require('mathjs'); \n\nconst gene...,<pl-solution-panel>\n \n <pl-hint lev...,Create a table showing the depreciation and bo...,"[0.008822514, 0.06810361, 0.059388254, 0.00368..."
1,3dMoment2,1.0,<pl-question-panel>\r\n <pl-figure file-nam...,const math = require('mathjs');\r\n// const ma...,<pl-solution-panel>\r\n <pl-figure file-nam...,,,,,,...,,,"[-0.0008305445080623031, 0.021679654717445374,...",doubleDecline,"{\n ""title"": ""DoubleDecliningBalanceDepreci...",<pl-question-panel>\n <p>Create a table sho...,const math = require('mathjs');\n//const { bve...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",Create a table showing the depreciation and bo...,"[0.017977716, 0.06909924, 0.05670903, 0.004378..."
2,3dMoment3,2.0,<pl-question-panel>\r\n <pl-figure file-nam...,const math = require('mathjs');\r\n// const ma...,<pl-solution-panel>\r\n <pl-figure file-nam...,,,,,,...,,,"[0.01982959173619747, 0.02235875464975834, 0.0...",MARCSDeduction5year,"{\n ""title"": ""MacrsDepreciationCalculation""...",<pl-question-panel>\n <p>Using the MACRS de...,const math = require('mathjs');\n\nconst MACRS...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",Using the MACRS deduction method for a large t...,"[0.0074357996, 0.046291497, 0.0006788812, -0.0..."
3,3dMoment4,3.0,<pl-question-panel>\r\n <pl-figure file-nam...,const math = require('mathjs');\r\n// const ma...,<pl-solution-panel>\r\n <pl-figure file-nam...,,,,,,...,,,"[0.011207695119082928, 0.018756229430437088, 0...",MARCSDeduction7year,"{\n ""title"": ""MacrsDepreciationCalculation""...",<pl-question-panel>\n <p>Using the MACRS de...,const math = require('mathjs');\n\nconst MACRS...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",Using the MACRS deduction method for a large t...,"[0.018601537, 0.04095613, 0.004502403, -0.0155..."
4,Equilibrium1,4.0,"<pl-question-panel>\r\n<pl-figure file-name=""3...",const math = require('mathjs');\r\n// const ma...,<pl-solution-panel>\r\n <pl-figure file-nam...,,,,,,...,,,"[0.015392015688121319, 0.010134758427739143, 0...",straightLineMethod,"{\n ""title"": ""MillingMachineDepreciation"",\...",<pl-question-panel>\n <p>A milling machine ...,const math = require('mathjs');\n\nconst gener...,"<pl-solution-panel>\n <pl-hint level=""1"" da...","A milling machine is purchased for 15,000 USD ...","[0.0019877013, 0.0844819, 0.026290337, 0.01974..."
5,Equilibrium2,5.0,<pl-question-panel>\r\n <pl-figure file-nam...,const math = require('mathjs');\r\n// const ma...,<pl-solution-panel>\r\n <pl-figure file-nam...,,,,,,...,,,"[0.04372689500451088, -0.053088847547769547, 0...",AmountOfPayments,"{\n ""title"": ""MortgagePaymentCalculation"",\...",<pl-question-panel>\n <p> You purchase a home...,const math = require('mathjs');\n\nconst gener...,"<pl-solution-panel>\n <pl-hint level=""1"" da...","You purchase a home with a 3%/year fixed-rate,...","[0.020555738, 0.046258952, 0.07994531, 0.06559..."
6,Equilibrium3,6.0,<pl-question-panel>\r\n <pl-figure file-nam...,const math = require('mathjs');\r\n// const ma...,<pl-solution-panel>\r\n <pl-figure file-nam...,,,,,,...,,,"[0.06704465299844742, -0.04970955848693848, -0...",CarPurchase,"{\n ""title"": ""CarLoanPaymentCalculation"",\n...",<pl-question-panel>\n <p> You are considering...,const math = require('mathjs');\n\nconst gener...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",You are considering purchasing a new car from ...,"[0.007772759, -0.020161236, -0.030871892, 0.03..."
7,Equilibrium4,7.0,<pl-question-panel>\r\n <pl-figure file-nam...,const math = require('mathjs');\r\n// const ma...,<pl-solution-panel>\r\n <pl-figure file-nam...,,,,,,...,,,"[0.03697505220770836, -0.021439723670482635, 0...",FutureSavings,"{\n ""title"": ""InvestmentGrowthCalculation"",...",<pl-question-panel>\n <p> If you deposit {{pa...,const math = require('mathjs');\n\nconst gener...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",If you deposit 500 USD every other month into ...,"[0.033072267, 0.00800969, -0.008649761, 0.0593..."
8,Frame1,8.0,<pl-question-panel>\r\n <pl-figure file-nam...,const math = require('mathjs');\r\n// const ma...,<pl-solution-panel>\r\n <pl-figure file-nam...,,,,,,...,,,"[0.03559532016515732, 0.03461657837033272, 0.0...",InterestRateConversions,"{\n ""title"": ""InterestRateConversions"",\n ...",<pl-question-panel>\n <p>\n A loan h...,const math = require('mathjs');\n\nconst gener...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",A loan has a nominal interest rate of 5% per y...,"[-0.02511769, 0.026525773, 0.035119213, 0.0217..."
9,Frame3,9.0,\r\n\r\n\r\n<pl-question-panel>\r\n <pl-fig...,const math = require('mathjs');\r\n// const ma...,<pl-solution-panel>\r\n <pl-figure file-nam...,,,,,,...,,,"[0.04400460049510002, -0.01800517551600933, 0....",IntrestPayed,"{\n ""title"": ""MortgageInterestCalculation"",...",<pl-question-panel>\n <p> You purchase a home...,const math = require('mathjs');\n\nconst gener...,"<pl-solution-panel>\n <pl-hint level=""1"" da...",You purchase a home with a **{{params.interest...,"[-0.029988721, 0.027915265, 0.05102884, 0.0762..."


In [264]:
merged_df.to_csv('Question_Embedding_20241230.csv')