Notes:
- You're going to need to have the questions labeled in the question text (like written response or something vs multiple choice, maybe with ** to denote it).
- I think the other thing you should do is in the question, clearly state the rubric / requirements right. Maybe just in small text at the bottom or something so the AI can go off of the actual criteria.

In [145]:
import pandas as pd
from groq import Groq
import os
from dotenv import load_dotenv

load_dotenv()
GROQ_KEY = os.getenv("GROQ_KEY")

client = Groq(
    api_key=GROQ_KEY,
)

def call_groq(query):
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": "You will be given a question and a set of responses to that question. Go through each response and evaluate each for correctness. Decide if they are \n3/3: Largely Correct \n2/3: Moderate mistakes / misunderstandings \n1/3: Mostly incorrect \n0/3: Not attempted or incoherent. \nProvide your response as a json object with the response number. For example, [{'id': 1, 'response': '<full text of their response>', 'eval': 'Your eval / feedback here...', 'score': 3}, {'id': 2, 'response': '<full text of their response>', 'eval': 'Your eval / feedback here...', 'score': 1}]",
            },
            {
                "role": "user",
                "content": query,
            }
        ],
        model="llama-3.1-70b-versatile",
        temperature=0,
    )

    return chat_completion.choices[0].message.content

In [146]:
df = pd.read_csv('Submissions/M5Entry.csv')
df.head()

for index, row in df.iterrows():
    print(row[13])

90% training / 10% testing: In this option, this model will likely have a more general familiarity with the training dataset and when encountering new and unseen datasets, while the other option will have a lack of accuracy and will be familiar with unseen datasets. In conclusion, while they have their advantages and disadvantages, a banlance between testing and training will give out a better model.
Smaller training set would result on the model to learn only a small portion of the data, thus when testing on unseen data it would have a higher MSE value. A larger training set would reduce the likeliness of that happening but with a smaller testing set it wouldn't  be able to test on a larger range of data
Using a larger training dataset (90 / 10) can be beneficial because the model can have a better idea of the trends of the dataset. It can perform "well" with interpreting the data, thus creating better generalizations and trends. The disadvantages to this is that it may overfit the da

  print(row[13])


In [147]:
# find index of column with 'Written Response'
written_index = [i for i in range(len(df.columns)) if '**Written Response**' in df.columns[i]]
print(written_index)

# Keep only the columns with 'Written Response'
df = df.iloc[:, written_index]
df.head()

[13]


Unnamed: 0,"177945147: **Written Response** When splitting a dataset into training and testing subsets, you have the option to choose different proportions of the data for training vs testing. Discuss the advantages and disadvantages of using a larger training set (such as 90% training / 10% testing) compared to a smaller training set (like 10% training / 90% testing)."
0,"90% training / 10% testing: In this option, th..."
1,Smaller training set would result on the model...
2,Using a larger training dataset (90 / 10) can ...
3,"Overfitting could be a disadvantage, the model..."
4,When training with a larger proportion of the ...


In [148]:
# iterate through and call a funcion each row call a function called 'score' that takes the question and written response as input and returns a score
def score_question(question, responses):
    # Call the GROQ API
    query = f"Question: '{question}':\n\nResponses:"
    for i, response in enumerate(responses):
        query += f"\n\n{i+1}:\n{response}"
    #print(query)
    response = call_groq(query)
    print(response)

# from written_indexes. There are multiple. get each one for each question and written response
for i in range(len(written_index)):
    question = df.columns[i]
    responses = df.iloc[:, i].tolist()
    score_question(question, responses)


Here are the evaluations of each response:

[
  {
    "id": 1,
    "response": "90% training / 10% testing: In this option, this model will likely have a more general familiarity with the training dataset and when encountering new and unseen datasets, while the other option will have a lack of accuracy and will be familiar with unseen datasets. In conclusion, while they have their advantages and disadvantages, a banlance between testing and training will give out a better model.",
    "eval": "The response touches on the idea of a balance between training and testing, but it's not clear what the advantages and disadvantages of each option are. The response also seems to contradict itself, stating that the 90% training option will have a 'more general familiarity' with the training dataset, but then saying that the 10% training option will be 'familiar with unseen datasets'.",
    "score": 1
  },
  {
    "id": 2,
    "response": "Smaller training set would result on the model to learn o