<a href="https://colab.research.google.com/github/lennyciotti/Feedback_squared/blob/main/feedback_desk_gpt_judge_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [28]:
import os
import pandas as pd
import json
from openai import OpenAI
from dotenv import load_dotenv
from typing import Dict, Any
from google.colab import userdata
from tqdm import tqdm

In [29]:
api_key = userdata.get("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = api_key

client = OpenAI(api_key=api_key, base_url="https://us.api.openai.com/v1/")
print("API loaded:", bool(api_key))

if api_key:
    print("‚úÖ OpenAI API key is loaded!")
else:
    print("‚ùå OpenAI API key NOT loaded.")

API loaded: True
‚úÖ OpenAI API key is loaded!


In [30]:
SYSTEM_ROLE_JUDGE = """
You are a judge that evaluates feedback on written assignments.
You are judging the feedback's {tone}, its level of {detail}, and how appropriate the feedback is around {grammar}, {structure} as well as {content}.
You will receive both the original essay and the feedback.
The feedback is broken into two sections, high level feedback on the entire assignment and comments
which contains tone, level of detail, grammar, structure and content.
For each assignment you will one score for each dimension of feedback.
Your evaluation must strictly adhere to the provided dimensions.
Your response must be and can only be a JSON object, containing no introductory text outside the JSON format (such as ‚ÄúHere is the evaluation...‚Äù).
"""

In [31]:
EVALUATION_RUBRIC = """
Please rate and evaluate this feedback according to the following criteria.
[Evaluation Criteria]
1.  Tone[1-5 points]: For tone you will consider the high level feedback and the comments and judge if the feedback's tone is
    4 = Very positive and encouraging
    3 = Somewhat positive and encouraging
    2 = Not really positive and encouraging
    1 = Not at all positive and encouraging
2.  Level of detail [1-5 points]: For the level of detail you will consider the high level feedback and the comments and judge if the feedback:
    4 = Very detailed, specific and actionable
    3 = Somewhat detailed, specific and actionable
    2 = Not really detailed, specific and actionable
    1 = Not at all detailed specific or actionable
3.  Grammar [1-5 points]: For grammar you will consider both the original essay and the high level feedback and comments and judge if the grammar feedback:
    4 = Addressed all grammar and mechanics issues in the original essay
    3 = Addressed most grammar and mechanics issues in the original essay
    2 = Addressed some grammar and mechanics issues in the original essay
    1 = Addressed none of the grammar and mechanics issues in the original essay
4.  Stucture [1-5 points]: For stucture you will consider both the original essay and the high level feedback and comments and judge if the stucture feedback:
    4 = Addressed all stucture and mechanics issues in the original essay
    3 = Addressed most stucture and mechanics issues in the original essay
    2 = Addressed some stucture and mechanics issues in the original essay
    1 = Addressed none of the stucture and mechanics issues in the original essay
5.  Content [1-5 points]: For content you will consider both the original essay and the high level feedback and comments and judge if the content feedback:
    4 = Addressed all content and mechanics issues in the original essay
    3 = Addressed most content and mechanics issues in the original essay
    2 = Addressed some content and mechanics issues in the original essay
    1 = Addressed none of the content and mechanics issues in the original essay

[Output Format]
Please strictly adhere to the following JSON format when returning your evaluation results:
{
  "Tone": <score (int)>,
  "Level of detail": <score (int)>,
  "Grammar": <score (int)>,
  "Stucture": <score (int)>,
  "Content": "<score (int)>"
}
"""

In [None]:
"""Input student sample data and the Feedback Desk feedback data from google drive saved pkl file."""

In [32]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [33]:
# CONNECT TO THE SAMPLES DATABASE

# Define the path to your pickle file
pickle_file_path = '/content/drive/MyDrive/Fall 2025/Practicum in Data Analysis/Copy of SAMPLES.pkl'

# Unpickle the file into a DataFrame
samples_df = pd.read_pickle(pickle_file_path)

# Now 'df' is your unpickled pandas DataFrame
print(samples_df.head())

   essay_id      created_at  \
0  2d75e7f0  11/18/25 18:22   
1  1f20d82f  11/18/25 18:22   
2  00b8f343  11/18/25 18:22   
3  b1b156cc  11/18/25 18:22   
4  332031c4  11/18/25 18:22   

                                               title  subject      grade  \
0  Thomas Jefferson And The Declaration Of Indepe...  History  9th grade   
1  Thomas Jefferson And The Declaration Of Indepe...  History  9th grade   
2  Thomas Jefferson And The Declaration Of Indepe...  History  9th grade   
3  Thomas Jefferson And The Declaration Of Indepe...  History  9th grade   
4  Thomas Jefferson And The Declaration Of Indepe...  History  9th grade   

             knowledge level                                   grammar level  \
0       3 - medium knowledge                  4 - good grammar and structure   
1         4 - deep knowledge  5 - great form, grammar and a broad vocabulary   
2  2 - superficial knowledge     3 - decent grammar but very basic structure   
3       3 - medium knowledge     3 -

In [34]:
# CONNECT TO THE FEEDBACK RESULTS DATABASE

# Define the path to your pickle file
pickle_file_path = '/content/drive/MyDrive/Fall 2025/Practicum in Data Analysis/feedback_results_df.pkl'

# Unpickle the file into a DataFrame
results_df = pd.read_pickle(pickle_file_path)

# Now 'df' is your unpickled pandas DataFrame
print(results_df.head())

   essay_id                                high_level_feedback  \
0  2d75e7f0  <p>This essay provides a comprehensive overvie...   
1  1f20d82f  <p>This essay provides a comprehensive overvie...   
2  00b8f343  <p>This assignment provides a clear and concis...   
3  b1b156cc  <p>This essay provides a comprehensive overvie...   
4  332031c4  <p>This assignment provides a comprehensive ov...   

                                            comments  
0  [{'label': 'Glow: Significance of Jefferson's ...  
1  [{'label': 'Glow: Clear and Structured Introdu...  
2  [{'label': 'Glow: Jefferson's Role', 'comment_...  
3  [{'label': 'Glow: Clear Outline of Jefferson's...  
4  [{'label': 'Historical Context', 'comment_text...  


In [35]:
import os
import pandas as pd
import json
from openai import OpenAI
from dotenv import load_dotenv
from typing import Dict, Any
from google.colab import userdata
from tqdm import tqdm

"""***2. Core Evaluation Logic***\n\nThis function acts as the bridge between our data and the LLM API.\n\nInput: It takes the Student Essay and the Feedback Desk Output.\n\nPrompt Engineering: It constructs a structured prompt combining the essay, feedback, and the grading rubric.\n\nAPI Call: It sends the request to the model (forcing a JSON output for structured data) and handles any potential connection errors.\n"""

def get_llm_evaluation(essay: str, feedback: str, client: OpenAI) -> Dict[str, Any] | None:
    """
    Construct the prompt and invoke the GPT-4o-mini model to obtain the evaluation.
    """

    user_prompt = f"""
    [Original student essay]
    {essay}

    [Feedback Desk's feedback]
    {feedback}

    [Evaluation Criteria and Output Format]
    {EVALUATION_RUBRIC}
    """

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": SYSTEM_ROLE_JUDGE},
                {"role": "user", "content": user_prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0.2
        )

        # The new API returns text via .output_text
        output_text = response.choices[0].message.content

        # Parse JSON
        evaluation_data = json.loads(output_text)
        return evaluation_data

    except json.JSONDecodeError as e:
        print("‚ùå Error: Model did not return valid JSON.")
        print(f"Raw model output:\n{output_text}")
        return None

    except Exception as e:
        print(f"‚ùå Error calling OpenAI API: {e}")
        return None

In [36]:
"""***3. Data Processing & Batch Execution***

This is the main execution loop of the pipeline.

Data Loading: Reads the .pkl files using Pandas.

Data Merging: Joins the Samples and Results dataframes on essay_id to ensure we are evaluating the correct pairs.

Batch Processing: Iterates through every row (with a progress bar), cleans column names, sends data to the LLM, and collects the results.

Saving: Finally, exports the evaluation results to a new .pkl file in Google Drive.
"""

def main():
    # Define Path
    INPUT_PKL_PATH_SAMPLES = '/content/drive/MyDrive/Fall 2025/Practicum in Data Analysis/Copy of SAMPLES.pkl'
    INPUT_PKL_PATH_RESULTS = '/content/drive/MyDrive/Fall 2025/Practicum in Data Analysis/feedback_results_df.pkl'
    OUTPUT_PKL_PATH = '/content/drive/MyDrive/Fall 2025/Practicum in Data Analysis/GPT_judges_evaluation_results.pkl'

    results_list = []

    # 1. load data
    try:
        print("Loading data...")
        samples_df = pd.read_pickle(INPUT_PKL_PATH_SAMPLES)
        results_df = pd.read_pickle(INPUT_PKL_PATH_RESULTS)

        samples_df['essay_id'] = samples_df['essay_id'].astype(str)
        results_df['essay_id'] = results_df['essay_id'].astype(str)

        # Merge Data
        # Use inner join to ensure only records with both original text and feedback are processed
        merged_data = pd.merge(samples_df, results_df, on='essay_id', how='inner')

        # Remove duplicates! Prevent duplicate rows from appearing in the original data.
        merged_data = merged_data.drop_duplicates(subset=['essay_id'])

        print(f"‚úÖ Data merged! Total unique essays to process: {len(merged_data)}")

        print(f"üîç Debug: Merged Columns: {merged_data.columns.tolist()}")

    except Exception as e:
        print(f"‚ùå Error loading data: {e}")
        return

# 2. Loop processing (using tqdm to display progress bar)
    print("üöÄ Starting Batch Evaluation...")

    for index, row in tqdm(merged_data.iterrows(), total=len(merged_data)):
        # Retrieve the ID (preferably essay_id)
        current_id = row.get('essay_id', row.get('sample_id', f'Unknown_{index}'))

        # Retrieve article content
        if 'essay' in row:
            current_essay = row['essay']
        elif 'essay_text' in row:
            current_essay = row['essay_text']
        # Handling potential suffix combinations
        elif 'essay_x' in row:
            current_essay = row['essay_x']
        elif 'essay_y' in row:
            current_essay = row['essay_y']
        else:
            # Only when none of the above names can be found should an error be reported.
            print(f"‚ö†Ô∏è Column missing for ID {current_id}: Check column names!")
            current_essay = ""

        # Get feedback (also check if high_level_feedback has a suffix)
        high_level = row.get('high_level_feedback', row.get('high_level_feedback_x', row.get('high_level_feedback_y', '')))
        comments = row.get('comments', row.get('comments_x', row.get('comments_y', '')))
        current_feedback = f"High Level Feedback:\n{high_level}\n\nSpecific Comments:\n{comments}"

        # Inspection Content
        if pd.isna(current_essay) or len(str(current_essay)) < 10:
            # Only when it's truly empty will it skip.
            print(f"‚ö†Ô∏è Skip ID {current_id}: Content empty.")
            continue

        # Call GPT (passing the current line's essay and feedback)
        evaluation = get_llm_evaluation(current_essay, current_feedback, client)

        if evaluation:
            evaluation['essay_id'] = current_id # Record the correct ID
            results_list.append(evaluation)
        else:
            # Failed to record
            results_list.append({
                "essay_id": current_id
            })

# 3. Save results
    final_df = pd.DataFrame(results_list)

    # Perform another deduplication to ensure absolute reliability.
    final_df = final_df.drop_duplicates(subset=['essay_id'])

    print(f"\nüíæ Saving {len(final_df)} unique results to Drive...")
    final_df.to_pickle(OUTPUT_PKL_PATH)

    print("üéâ Done! Preview:")
    print(final_df.head())

if __name__ == "__main__":
    main()

Loading data...
‚úÖ Data merged! Total unique essays to process: 122
üîç Debug: Merged Columns: ['essay_id', 'created_at', 'title', 'subject', 'grade', 'knowledge level', 'grammar level', 'flow level', 'essay', 'high_level_feedback', 'comments']
üöÄ Starting Batch Evaluation...


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 122/122 [02:53<00:00,  1.42s/it]


üíæ Saving 122 unique results to Drive...
üéâ Done! Preview:
   Tone  Level of detail  Grammar  Stucture  Content  essay_id
0     4                4        1         1        3  2d75e7f0
1     4                4        1         4        3  1f20d82f
2     4                4        1         1        3  00b8f343
3     4                4        1         1        3  b1b156cc
4     4                4        1         4        3  332031c4



