galen-evals

A coworker for life sciences!

Most work we do in professional settings are task based. Sure, we hire people based on how well they did in abstract tests like GMAT, but what you want them to actually do are tasks. And tasks are what we need to test LLMs for.

We learnt this the hard way, starting with dreams of training a cool model before figuring out what's actually needed! That's the purpose of this repo, to test LLMs against a set list of tasks and to evaluate them.

What you need to run this

OpenAI API key
Replicate API key
Add them to your .env file
A list of questions that are relevant to your job

Steps

Update the configfile.json if needed. Oh and check the knowledgebase.json and perturbations.json inside utils to see if they have verisimilitude!
Run the perturbations file to get various OSS LLM responses, choose the ones you want in the code
Run response_gpt4 same to get GPT-4 responses
Do the same for DB and RAG files, having put the relevant parts in the folders
Combine the results files, then change the format to select combined file
Run eval_by_gpt4 to get GPT-4 to evaluate the answers
Run human eval if you can

Charts!

Yi-34b seems remarkably good, slightly lower latency but higher rankings. Think there's a cold start data problem though with Replicate.

Interesting: the performance from Yi is wow!

Mixtral is really slow with DB, and GPT stays winning in terms of speed. Yi's the same throughout it seems

GPT is the one that's solved cold start problem the best

Files

The crucial ones are:

questions.xlsx, which has the list of Questions we want to ask. This is the starting point, and what you should generate from your job!
"results grouped by question" which has the final list of answers from all LLMs
model_rankings.xlsx with the rankings for the final list of answers that were evaluated
All other files are intermediate creations, kept for auditing and any error checks

To do

There's plenty to do, but in no order:

Create a test for tasks that combine the abilities together - search/ doc reading/ coding etc
Create tests against other hosting services (and more LLMs)
Speed up GPT execution by parallelising the API calls
Create "knowledgebase" and "perturbations" automatically from given information/ documents
Enable LLMs to write reports on a given topic, and then run PageRank on it afterwards based on RAG over a question set on it
Create a "Best Answer" for the questions in case we want to measure the answers against that - (can also use this to DPO the models later as needed)
Create a way to perturb the questions to see how well the LLMs react to new info coming in [Done]
Create a way to provide a "knowledgebase" to see how good the LLMs are at asking for help from the right quarters [Done]
Add answer clusters and plotting w.r.t categories to the automated model ranking file

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
charts		charts
files		files
utils		utils
.gitignore		.gitignore
1.1_setup_questions.py		1.1_setup_questions.py
2.1_perturbations_eval.ipynb		2.1_perturbations_eval.ipynb
2.2_DB_eval.ipynb		2.2_DB_eval.ipynb
2.3_RAG_eval.ipynb		2.3_RAG_eval.ipynb
2.4_combine_responses.py		2.4_combine_responses.py
3.1_combine_before_eval.py		3.1_combine_before_eval.py
3.2_eval_by_gpt4.py		3.2_eval_by_gpt4.py
3.2_eval_by_gpt4_adaptability.py		3.2_eval_by_gpt4_adaptability.py
3.2_human_eval.py		3.2_human_eval.py
3.3_analyses.py		3.3_analyses.py
3.3_analyses_adaptability.py		3.3_analyses_adaptability.py
3.4_archive_files.py		3.4_archive_files.py
README.md		README.md
config.py		config.py
configfile.json		configfile.json
output.json		output.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

galen-evals

What you need to run this

Steps

Charts!

Files

To do

About

Releases

Packages

Languages

marquisdepolis/galen-evals

Folders and files

Latest commit

History

Repository files navigation

galen-evals

What you need to run this

Steps

Charts!

Files

To do

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages