# CP1 Evaluation Performance

The automated ASCOR assessment tool is fortunatley very easy to evaluate because of the existence of ground truth data in the form of ASCOR  CP1 country assessments.

Due to computational constraints, we will run CP1a, CP1b assessments for 10 countries also evaluated by ASCOR to see how our tools compare to ground truth. We can also compare how the two implementations of CP1a evaluation (seperated into subcriteria vs. one large model) performs.


## Setup

In [None]:
# Import necessary modules
import sys
import os
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt

# Get the absolute path of the project root directory
notebook_dir = Path(os.getcwd())  
project_root = notebook_dir.parent.parent  

# Add project root to Python path
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from scripts.climate_policy_pipelines.cp1.pipeline import run_cp1a_assessment
from scripts.climate_policy_pipelines.cp1.pipeline import run_cp1a_assessment_large_context
from scripts.climate_policy_pipelines.cp1.pipeline import run_cp1b_assessment
from scripts.evaluation.eval_support import evaluate_cp1_assessments

Added c:\Users\User\GitHub\group-6-final-project to sys.path




Lets also load in the actual ASCOR assessment, or "ground truth" and have a look at some of the assessments it did:

In [1]:
# Load ASCOR data
ascor_ground_truth = pd.read_excel("ASCOR_assessments_results.xlsx")

ascor_ground_truth.head(5)

NameError: name 'pd' is not defined

# Run Assement 

Here we iterate through 10 countries with high GDPs evaluated by ASCOR and run our assessment tools, compiling everything into a dataframe for direct comparison.

In [None]:
countries = ["USA", "Canada", "Mexico", "Germany", "France"]

evaluate_cp1_assessments(countries=countries, ascor_ground_truth=ascor_ground_truth)

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'local_model/climatebert/distilroberta-base-climate-f'. Use `repo_type` argument if needed.

### Compare Results

To evaluate performance, lets compute some statistics about its performance:

In [None]:
# Calculate accuracy for each assessment method
cp1a_accuracy = (comparison_df['ASCOR_True'] == comparison_df['CP1A_Assessment']).mean()
cp1a_large_accuracy = (comparison_df['ASCOR_True'] == comparison_df['CP1A_Large_Context_Assessment']).mean()
cp1b_accuracy = (comparison_df['ASCOR_True'] == comparison_df['CP1B_Assessment']).mean()

print(f"CP1A Accuracy: {cp1a_accuracy:.2%}")
print(f"CP1A Large Context Accuracy: {cp1a_large_accuracy:.2%}")
print(f"CP1B Accuracy: {cp1b_accuracy:.2%}")

# Create visualization
methods = ['CP1A', 'CP1A Large Context', 'CP1B']
accuracies = [cp1a_accuracy, cp1a_large_accuracy, cp1b_accuracy]

KeyError: 'ASCOR_True'

Lets also generate a plot of this:

In [None]:
plt.figure(figsize=(10, 6))
bars = plt.bar(methods, accuracies, color=['skyblue', 'lightgreen', 'coral'])
plt.title('Assessment Method Accuracy Comparison')
plt.ylabel('Accuracy')
plt.ylim(0, 1)

# Add percentage labels on bars
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{acc:.1%}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Display comparison DataFrame
print("\nDetailed Results:")
print(comparison_df)