# Evaluating Curriculum Rigor
## Background
In my experience with high school curriculum, I have found a wide variation in the rigor of course material.  This project seeks to develop a tool for evaluating the rigor of a curriculum, by measuring its alignment to the College Board's respective AP Course.  This project focuses on the College Board's AP Computer Science A course, which covers a first year Java and Object Orientied Design course.

For this course, the College Board defines a set of "Computational Thinking Practices" (skills) and content that will be assessed on a year-end summative assessment to determine student's mastery of the course.   

There are 5 main Computational Thinking Practices identified by the College Board, which it then breaks down into subskills:  

<img src="Reports/Images/Skills-List.png" width=600px> 

In addition, the College Board defines a set of "Essential Knowledge" (the content) to be assessed in the course, which it organizes under 5 "Big Ideas."  For example, the content for a lesson on iteration is: 

<img src="Reports/Images/Content-Sample.png" width=400px>

Every question on the College Board's end-of-course summative exam is aligned to a particular computational thinking skill and essential knowledge.  As a note, some school networks have found the College Board's standards to be very complete, and "backwards plan" their middle school and pre-AP high school courses to prepare students for the AP level work. 

As a first step, this project will focus on the assessment questions used in a particular curriculum, and measure how well they align to the College Board's Computational Thinking Practice and Curriculum Framework.  (As a note, AP classes in most subjects have an analagous set of  thinking practices and framework standards, so one day, this work may be generalized to assess curriculums in other subject areas.)

Two questions to assess are:  

1. Can a TF-IDF vectorization of College Board question prompt with a Logistic Regressor successfully classify an assessment question by Computational Thinking Practice?
2. If ChatGPT is supplied only with the College Board Framework for Computational Thinking, can it successfully identify the particular thinking practice being assessed by a question prompt?

### Initial Conclusions:
1. The TF-IDF and Logistic Regression together classify questions with a 74% accuracy rate.
2. ChatGPT, supplied with the College Board Framework, classify with a 47% accuracy.  

### Next Steps:
1. Supply the entire assessment question, not just the question prompt, to each classifier to help with classification.
2. Determine whether the classifier can also identify the "Essential Knowledge" assessed by the question, not just the computational skill.
3. Attempt to generalize the classifiers to classify non-assessment questions such as lecture material, lab questions, and homework problems.
4. Create a visualization that shows the distribution of thinking skills and content assessed over the course of the curriculum.

## TF-IDF with Logistic Regression Classification
To start, this project reads in a set of 40 assessment questions and their alignment to the Computational Thining Practices (CTP) .  It vectorizes the question prompt only, and then uses a logistic regression to map from the vectorized prompt to the CTP. 

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [7]:
df_2014 = pd.read_csv("Data/CollegeBoard/SamplePrompts-PracticeExam2014.csv")
classifiers_unique = df_2014["Classification"].unique()

Concatentate prompts for each CTP to create a "corpi" for each Thinking Practice.  

In [10]:
training_text = []
for x in classifiers_unique:
    text_string = ""
    for y in df_2014.loc[df_2014.Classification == x, "Prompt"]:
        text_string += y
    training_text.append({"Classification":x, "Prompt":text_string})


In [11]:
df = pd.DataFrame(training_text)
df

Unnamed: 0,Classification,Prompt
0,2.C,What value will be returned as a result of the...
1,1.C,Which of the following code segments will comp...
2,2.A,What is printed as a result of executing the c...
3,1.B,Which of the following should replace /* missi...
4,4.B,Which of the following declarations will compi...
5,2.B,What will be printed as a result of executing ...
6,4.C,The expression is equivalent to which of the f...
7,5.A,Which of the following describes what the meth...
8,5.B,Which of the following best explains why the c...
9,3.D,Which of the following will correctly print al...


In [19]:
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(df["Prompt"])
X = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out(), index=df["Classification"])

In [20]:
lr = LogisticRegression()
lr.fit(X, X.index )

### Test Logistic Regression
Read in a different set of question prompts, and use the classifier to classify them by CTP.

In [25]:
df_2020 = pd.read_csv("Data/CollegeBoard/SamplePrompts-PracticeExam2020.csv")
df_2020 = df_2020[~df_2020["Classification"].isin(["2.D", "5.C", "5.D"])]
X_test = vectorizer.transform(df_2020["Prompt"])
X_test = pd.DataFrame(X_test.toarray(), columns=vectorizer.get_feature_names_out())
print(lr.score(X_test, df_2020["Classification"]))
y_test_pred = lr.predict(X_test)
y_test_pred

0.2


array(['1.C', '1.B', '1.B', '1.B', '1.C', '1.B', '1.B', '4.B', '1.B',
       '1.B', '1.B', '1.B', '1.C', '5.B', '2.A', '5.B', '2.A', '2.A',
       '2.A', '2.A', '2.B', '2.A', '5.A', '5.A', '5.A', '2.A', '5.A',
       '5.A', '4.B', '5.B', '4.C', '5.B', '5.B', '2.B', '5.B'],
      dtype=object)

In [22]:
confusion_matrix(df_2020["Classification"], y_test_pred)

array([[4, 2, 0, 0, 0, 0, 0, 0, 0, 0],
       [5, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 2, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 5, 1, 0, 0, 0, 0, 3, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 2, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 2],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

In [27]:
print(classification_report(df_2020["Classification"], y_test_pred))

              precision    recall  f1-score   support

         1.B       0.44      0.67      0.53         6
         1.C       0.00      0.00      0.00         6
         2.A       0.00      0.00      0.00         1
         2.B       0.00      0.00      0.00         4
         2.C       0.00      0.00      0.00         9
         4.A       0.00      0.00      0.00         2
         4.B       0.50      1.00      0.67         1
         4.C       1.00      0.50      0.67         2
         5.A       0.00      0.00      0.00         3
         5.B       0.17      1.00      0.29         1

    accuracy                           0.20        35
   macro avg       0.21      0.32      0.22        35
weighted avg       0.15      0.20      0.16        35



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## OpenAI

Supply a LLM with the College Board's definitions for computational thinking practices, and determine how well it categories question prompts.

In [None]:
from openai import OpenAI



In [31]:
prompt_start = "Here are the categories for AP questions. \
1.B: Determine code that would be used to complete code segments \
1.C: Determine code that would be used to interact with completed program code. \
2.A: Apply the meaning of specific operators \
2.B: Determine the result or output based on statement execution order in a code segment without method calls (other than output) \
2.C: Determine the result or output based on the statement execution order in a code segment containing method calls. \
2.D: Determine the number of times a code segment will execute. \
4.A: Use test-cases to find errors or validate results. \
4.B: Identify errors in program code. \
4.C: Determine if two or more code segments yield equivalent results. \
5.A: Determine the behavior of a given segment of program code. \
5.B: Explain why a code segment will not compile or work as intended \
5.C: Explain how the result of program code changes, given a change to the initial code. \
5.D: Describe the initial conditions that must be met for a program segment to work as intended or described. \
Which of the categories above best classifies this question prompt below? "

In [32]:
def gpt_guess(prompt):
  response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[{"role": "user", "content": prompt_start + prompt}]
  )
  return response.choices[0].message.content


In [33]:
df = pd.read_csv("Data/CollegeBoard/SamplePrompts-PracticeExam2020.csv")

for i in df.index:
    df.loc[i,"GPT_Pred"] = gpt_guess(df.loc[i,"Prompt"])
    

NameError: name 'client' is not defined

In [None]:
import pandas as pd
df = pd.read_csv("Data/output1.csv")

In [None]:
for i in df.index:
  print(df.loc[i, "GPT_Pred"])
  code = input("What is the code?")
  df.loc[i, "GPT_Code"] = code

In [None]:
df.loc[0, "GPT_Code"] = "1.B"

In [None]:
df

In [None]:
from sklearn.metrics import classification_report

print(classification_report(df["Classification"], df["GPT_Code"]))

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

cm = confusion_matrix(df["Classification"], df["GPT_Code"])
(ConfusionMatrixDisplay(cm)).plot()

### Try to use simplified prompt with work

In [None]:
prompt_start2 = "Here are the categories for AP questions. \
1.B: Determine code that would be used to complete code segments \
1.C: Determine code that would be used to interact with completed program code. \
2.A: Apply the meaning of specific operators \
2.B: Determine the result or output based on statement execution order in a code segment without method calls (other than output) \
2.C: Determine the result or output based on the statement execution order in a code segment containing method calls. \
2.D: Determine the number of times a code segment will execute. \
4.A: Use test-cases to find errors or validate results. \
4.B: Identify errors in program code. \
4.C: Determine if two or more code segments yield equivalent results. \
5.A: Determine the behavior of a given segment of program code. \
5.B: Explain why a code segment will not compile or work as intended \
5.C: Explain how the result of program code changes, given a change to the initial code. \
5.D: Describe the initial conditions that must be met for a program segment to work as intended or described. "

prompt_question = "Using the categories previously listed, determine the category for this question prompt:"

In [None]:

response = client.chat.completions.create(
  model="gpt-4o-mini",
  messages=[{"role": "user", "content": prompt_start2},
            {"role": "user", "content": prompt_question+df.loc[0,"Prompt"]},
            {"role": "user", "content": prompt_question+df.loc[1,"Prompt"]},
            {"role": "user", "content": prompt_question+df.loc[2,"Prompt"]}
  ]
  )

response.choices[1].message.content

In [None]:
response

## Try Improving TF-IDF by Include Full Question Text in Document

## Try Extracting Text from PDF Automatically

In [None]:
import PyPDF2

def extract_text_from_pdf(pdf_path):
    text = "" 
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
    return text

if __name__ == "__main__":
    pdf_path = "Data/CollegeBoard/ap-computer-science-a-2014-practice-exam.pdf"
    extract_text = extract_text_from_pdf(pdf_path)

In [None]:
extract_text = extract_text.replace("\n","")

In [None]:
number = 1
questions = []

for number in range(1,40):
  begin = extract_text.find(str(number)+".")
  end = extract_text.find(str(number+1)+"2.")

  question = extract_text[begin:end]
  option_e = question.find("(E)")
  question = question[:option_e]
  questions.append({"number":number, "text":question})

In [None]:
df_questions = pd.DataFrame(questions)

In [None]:
import pandas as pd 

df = pd.read_csv("Data/CollegeBoard/SamplePrompts-PracticeExam2014.csv")

In [None]:
df["Question_Num"] = df["Source"].str.slice(14).astype(int)


In [None]:
df = df.merge(df_questions, left_on="Question_Num", right_on="number")
df.to_csv("Data/CollegeBoard/SamplePrompts-PracticeExam2014.csv")

## Try Extracting Text from 2020 Test as well

In [None]:
def extract_text_from_pdf(pdf_path):
    text = "" 
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
    return text

if __name__ == "__main__":
    pdf_path = "Data/CollegeBoard/ap-computer-science-a-2020-practice-exam-and-notes-1.pdf"
    extract_text = extract_text_from_pdf(pdf_path)

In [None]:
extract_text = extract_text.replace("\n","")

questions = []

for number in range(1,40):
  begin = extract_text.find(str(number)+".")
  end = extract_text.find(str(number+1)+"2.")

  question = extract_text[begin:end]
  option_e = question.find("(E)")
  question = question[:option_e]
  questions.append({"number":number, "text":question})

df_questions = pd.DataFrame(questions)

df = pd.read_csv("Data/CollegeBoard/SamplePrompts-PracticeExam2020.csv")



In [None]:
df["Question_Num"] = df["Source"].str.slice(14).astype(int)
df = df.merge(df_questions, left_on="Question_Num", right_on="number")
df.to_csv("Data/CollegeBoard/SamplePrompts-PracticeExam2020.csv")

## Try Training Model on 2020 and Classify 2014