# Historical Score Distribution Appender Notebook

This notebook computes the proportion distribution of each component's scores in the training set and appends the historical score proportion distribution to the rubric of `Teach_1.json`.


## 1. Setup and Imports
Import necessary libraries.

In [1]:
import pandas as pd
import json
import os

## 2. Define File Paths
Set the paths for the transcripts CSV and the TEACH framework JSON.

In [2]:
TRANSCRIPTS_PATH = '/Users/mkrasnow/Desktop/montesa/new/formattedData/peru_cleaned_transcripts.csv'
FRAMEWORK_PATH = '/Users/mkrasnow/Desktop/montesa/new/models/_context/Teach_1.json'

## 3. Load Data
Read the CSV into a DataFrame and load the JSON framework.

In [3]:
df = pd.read_csv(TRANSCRIPTS_PATH)
with open(FRAMEWORK_PATH, 'r') as f:
    framework = json.load(f)

## 4. Filter Training Data
Exclude test rows to avoid data leakage.

In [4]:
train_df = df[df['split'] == 'train']

## 5. Compute Score Proportion Distributions and Update Rubrics
- Replace blank values and any 'n' markers with "N/A" for clarity.
- Compute the proportion that each score takes up in the training set (rounded to two decimal places).
- Append the resulting JSON-formatted proportions to each component's rubric.

In [5]:
for domain in framework['structure']['domains']:
    for component in domain['components']:
        col = component['id']
        # Replace empty strings or 'n' markers with pandas NA, then fill with 'N/A'
        col_series = (
            train_df[col]
            .replace('', pd.NA)
            .replace('n', pd.NA)
            .fillna('N/A')
        )
        # Compute normalized proportions
        proportions = col_series.value_counts(normalize=True)
        # Round to two decimal places
        distribution_dict = {score: round(pct, 2) for score, pct in proportions.items()}
        # Ensure any leftover 'n' keys are correctly labeled as 'N/A'
        fixed_dict = {}
        for score, pct in distribution_dict.items():
            key = 'N/A' if score == 'n' else score
            fixed_dict[key] = pct
        # Convert to JSON string for inclusion in rubric
        distribution_str = json.dumps(fixed_dict)
        # Append to rubric text
        original_rubric = component.get('rubric', '')
        component['rubric'] = (
            original_rubric
            + ' The historical score proportions for this component are: '
            + distribution_str
        )

## 6. Save Updated Framework
Write the modified framework to a new JSON file.

In [6]:
OUTPUT_PATH = os.path.join(
    os.path.dirname(FRAMEWORK_PATH),
    'Teach_1_with_distribution.json'
)
with open(OUTPUT_PATH, 'w') as f:
    json.dump(framework, f, indent=2)
print(f"Updated framework saved to {OUTPUT_PATH}")

Updated framework saved to /Users/mkrasnow/Desktop/montesa/new/models/_context/Teach_1_with_distribution.json


### Conclusion
Proportion distributions computed, blanks labeled as "N/A", and appended successfully.