# Evaluating LLMs‘ Performance in Answering RD and CDCES Exam Prep Questions
This notebook provides code examples to get the questions from the database and ask the questions in batch to the LLMs. 
The LLMs evaluated in this project are 
- GPT 4/4o
- Gemini 1.5 Pro
- Claude 3 - Opus

We also tried Claude 3 - Haiku, Claude 3 - Sonnet, and Llama 3 8B, but due to the low accuracy rate, we didn't continue the evaluation and analysis of these models. The results will also be included in the notebook.

## Setup
Before running the rest of this notebook, you'll need to run the cells below to ensure necessary libaraies are installed. 


In [None]:
#%pip install openai
#%pip install boto3
#%pip install -q -U google-generativeai

You will need to store your enviromental variables in the .env file
Here is the sample .env file:

In [None]:
MYSQL_USER=<Your_MySQL_User>
MYSQL_PASSWORD=<Your_MySQL_Password>
MYSQL_HOST=<Your_MySQL_Host>
MYSQL_PORT=<Your_MySQL_Port>
DB_NAME=<Your_Database_Name>
OPENAI_API_KEY=<Your_OPENAI_API_Key>
GEMINI_API_KEY=<Your_GEMINI_API_Key>
GOOGLE_CLOUD_PROJECT=<Your_Google_Cloud_Project_ID>

You will also need to set AWS access. Please follow the instruction here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html


## Connect to the LLMs

### Connect to OPENAI API for GPT 4o to Answer RD/CDCES Exam Questions

In [None]:
import openai_api
import questions_mysql

api = openai_api.OpenAIAPI()
qsql = questions_mysql.QuestionsMysql()
# Connect to RD Exam Questions
question_dict = qsql.get_RD_questions()
# Connect to CDCES Exam Questions
# question_dict = qsql.get_questions()

response4 = "\n"

# Prompt the questions in a batch of 1, you can adjust the question number in each batch 
for startIndex in range (1, len(question_dict) + 1, 1):
    prompt_str = qsql.get_prompt_string(question_dict, startIndex, 1)
    print(prompt_str)
    # Use the GPT 4o model and set temperature to 0. To use GPT 3.5 ot GPT 4 model, change the "gpt-4o" to "gpt-4 or gpt-3.5"
    response4 += api.ask_chatgpt(prompt_str, "gpt-4o", 0) 
    response4 += "\n"

score4 = qsql.get_score(response4, question_dict)
print("\n")
print('gpt4o:' + score4)

### Connect to Amazon BedRock API for Claude 3 - Opus to Answer RD/CDCES Exam Questions

In [None]:
import anthropic_bedrock_api
import questions_mysql

api = anthropic_bedrock_api.AnthropicBedRockAPI()
qsql = questions_mysql.QuestionsMysql()
# Connect to RD Exam Questions
question_dict = qsql.get_RD_questions()
# Connect to CDCES Exam Questions
# question_dict = qsql.get_questions()

response_claude = ""

# Prompt the questions in a batch of 1, you can adjust the question number in each batch 
for startIndex in range (1, len(question_dict) + 1, 1):
    prompt_str = qsql.get_prompt_string(question_dict, startIndex, 1)
    print(prompt_str)
    # Use the Claude 3 - Opus model and set temperature to 0. 
    # To use Claude 3 Haiku/Sonnet, use "anthropic.claude-3-haiku-20240307-v1:0" or "anthropic.claude-3-sonnet-20240229-v1:0"
    response_claude += api.ask_claude(prompt_str, "anthropic.claude-3-opus-20240229-v1:0", 0)
    response_claude += "\n"

score_claude = qsql.get_score(response_claude, question_dict)
print("\n")
print('Claude 3 Opus: ' + score_claude)

### Connect to Gemini API for Gemini 1.5 Pro to Answer RD/CDCES Exam Questions 

In [None]:
import gemini_api
import questions_mysql

api = gemini_api.GeminiAIAPI()
qsql = questions_mysql.QuestionsMysql()
# Connect to RD Exam Questions
question_dict = qsql.get_RD_questions()
# Connect to CDCES Exam Questions
# question_dict = qsql.get_questions()

prompt_str = ""

response = "\n"
# Prompt the questions in a batch of 1, you can adjust the question number in each batch 
for startIndex in range (1, len(question_dict) + 1, 1):
    prompt_str = qsql.get_prompt_string(question_dict, startIndex,  1)
    print(prompt_str)
    # Use the Gemini 1.5 Pro model and set temperature to 0. 
    response += api.ask_gemini(prompt_str, 'gemini-1.5-pro', 0)


score = qsql.get_score(response, question_dict)
print("\n")
print('Gemini 1.5 Pro :' + score)

### Connect to Amazon BedRock API for Llama 3 8B to Answer RD/CDCES Exam Questions 

In [None]:
import llama_bedrock_api
import questions_mysql

api = llama_bedrock_api.LlamaBedRockAPI()
qsql = questions_mysql.QuestionsMysql()
# Connect to RD Exam Questions
question_dict = qsql.get_RD_questions()
# Connect to CDCES Exam Questions
# question_dict = qsql.get_questions()

response_llama = ""

# Prompt the questions in a batch of 1, you can adjust the question number in each batch 
for startIndex in range (1, len(question_dict) + 1, 1):
    prompt_str = qsql.get_prompt_string_llama(question_dict, startIndex, 1)
    # Use the llama 3 8B model and set temperature to 0.
    response_llama += api.ask_llama(prompt_str, "meta.llama3-8b-instruct-v1:0", 0)
    response_llama += "\n"

score_llama = qsql.get_score(response_llama, question_dict)
print("\n")
print('Llama 3 8B: ' + score_llama)

# Results & Analysis
The results of RD/CDCES Exam Questions of each model can be found: https://www.notion.so/Response-from-AIs-for-Multiple-Choice-Questions-f0b87ff56e0e4ae99101db305f8f3596

Detailed Question Investigation & Analysis for RD Exam Questions can be found: https://docs.google.com/spreadsheets/d/1ApqiEV2WM4n7Ytz5rNTgoUvWb5PT2MQlzMovsK3lZ38/edit#gid=1209999158

Analysis Summay Slides can be found: https://docs.google.com/presentation/d/1gtmfSIIMp7jbRxFJ3DS_iQVBMiVdeINyT6PcpF7Dj98/edit#slide=id.g273d235a019_0_139 