Ensemble of Minds: Combining Large Language Models for Enhanced Response Generation

This repo evaluates multiple code LLMs on HumanEval(+), computes per-model embeddings, and then clusters & selects an ensemble.

✅ Prerequisites

Python 3.11 (required)
(HPC) SLURM + a GPU partition
(Optional) Hugging Face token for gated models

1) Environment setup

Conda (recommended)

module load anaconda            # if your cluster uses modules
conda create -n eom python=3.11 -y
conda activate eom
pip install -r requirements.txt

venv

python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The SLURM script below assumes a conda env named eom. If you use a different env, change the source activate line.

2) Hugging Face auth (optional but recommended)

# Option A: export directly
export HUGGINGFACE_HUB_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXX

# Option B: store once and let the job read it
echo hf_XXXXXXXXXXXXXXXXXXXXXXXX > ~/.hf_token

3) Create data splits

split_data.py writes JSON files (arrays of task IDs like "HumanEval/0"), used as subsets for validation/test.

python split_data.py \
  --dataset humaneval \
  --out_prefix splits/humaneval_split \
  --val_size 100 \
  --seed 42

This produces:

splits/humaneval_split_val_ids.json
splits/humaneval_split_test_ids.json

Format example (*_ids.json)

[
  "HumanEval/0",
  "HumanEval/3",
  "HumanEval/17"
]

⚠️ Must be JSON arrays, not JSONL. Each entry should be a full task ID (e.g., "HumanEval/0").

4) Submit jobs for each model (SLURM array)

Use my_job_array.sh to (a) generate & evaluate on the validation subset and (b) compute the model embedding.

`my_job_array.sh`

#!/bin/bash
#SBATCH --partition main
#SBATCH --time 6-02:00:00
#SBATCH --job-name my_job
#SBATCH --output job-%J.out
#SBATCH --gpus=rtx_6000:1
#SBATCH --mail-user=you@example.com
#SBATCH --mail-type=ALL
#SBATCH --array=0-0          # <-- set to 0-(N-1) for N models
##SBATCH --mem=48G
##SBATCH --cpus-per-task=9

echo "$(date)"
echo -e "\nSLURM_JOBID:\t\t$SLURM_JOBID"
echo -e "SLURM_JOB_NODELIST:\t$SLURM_JOB_NODELIST\n"

# Model_Name                                Backend  Param_B  Optional_Flags
models_and_settings=(
  deepseek-ai/deepseek-coder-6.7b-instruct  hf       6.7      ""
  # Add more lines in groups of four:
  # Salesforce/codegen-6B-multi             hf       6.0      ""
  # meta-llama/CodeLlama-7b-Instruct-hf     hf       6.74     ""
)

i=$((SLURM_ARRAY_TASK_ID * 4))
model=${models_and_settings[$i]}
backend=${models_and_settings[$((i + 1))]}
param_num=${models_and_settings[$((i + 2))]}
optional_flags=${models_and_settings[$((i + 3))]}

temperature=0.0
dataset=humaneval
val_size=100
random_seed=42

validation_ids_path="splits/${dataset}_split_val_ids.json"
test_ids_path="splits/${dataset}_split_test_ids.json"

embedding_dir="embeddings"
validation_root="evalplus_validation_results"
test_root="evalplus_test_results"

echo "Running evaluation for model: $model (backend=$backend, params=${param_num}B, dataset=$dataset, T=$temperature) flags: $optional_flags"

module load anaconda
source activate eom

# HF token (optional)
export HUGGINGFACE_HUB_TOKEN=${HUGGINGFACE_HUB_TOKEN:-$(cat ~/.hf_token 2>/dev/null)}
python - <<'PY'
import os
try:
    from huggingface_hub import login
    tok = os.getenv("HUGGINGFACE_HUB_TOKEN")
    if tok: login(token=tok)
    else: print("Warning: Hugging Face token not found. Gated models may not be accessible.")
except Exception as e:
    print("HF login skipped or failed:", e)
PY


# 1) Create/overwrite splits (id files are JSON arrays)
python split_data.py \
  --dataset $dataset \
  --out_prefix splits/${dataset}_split \
  --val_size $val_size \
  --seed $random_seed

# 2) Generate, evaluate, and compute the model embedding (validation subset)
python embed_model.py \
  --model $model \
  --dataset $dataset \
  --backend $backend \
  --temperature $temperature \
  --embedding_dir $embedding_dir \
  --validation_ids_path $validation_ids_path \
  --root $validation_root \
  $optional_flags


### Submit the array

For **N** models, set `--array=0-(N-1)`:

```bash
# Example: 3 models (indices 0,1,2)
sbatch --array=0-2 my_job_array.sh

Check status & logs:

squeue -u $USER
tail -f job-<JOBID>.out

5) Files & directories

Generations / evaluation (validation)
evalplus_validation_results/humaneval/

<model>_<backend>_temp_<T>.jsonl          # sanitized generations
<model>_<backend>_temp_<T>.raw.jsonl      # raw generations
<model>_<backend>_temp_<T>.eval_results.json

Embeddings
embeddings/<model>_<backend>_temp_<T>/fingerprint.npy
Ensemble selection output
ensembles/selected.json

6) Ensemble selection (manual run)

python ensemble_select.py \
  --embeddings_dir embeddings \
  --strategy cluster_representatives \
  --distance cosine \
  --k 5 \
  --output ensembles/selected.json

Dependency method (run after the whole array finishes)

JID=$(sbatch --array=0-2 my_job_array.sh | awk '{print $4}')
cat > run_ensemble.sh <<'BASH'
#!/bin/bash
#SBATCH --partition main
#SBATCH --time 01:00:00
#SBATCH --job-name ensemble
#SBATCH --output ensemble-%J.out
module load anaconda
source activate eom
python ensemble_select.py --embeddings_dir embeddings --strategy cluster_representatives --distance cosine --k 5 --output ensembles/selected.json
BASH
sbatch --dependency=afterok:$JID run_ensemble.sh

7) Troubleshooting

“All samples are already cached. Skipping codegen.”
You already have ...temp_<T>.jsonl under --root. To force new generations, either:

delete the old JSONL, or
change --root, or
change --temperature (affects the filename/identifier).

Splits not applied
Ensure your *_ids.json is a JSON array of full task IDs like "HumanEval/37". Numeric IDs ([0,1,2]) won’t match unless you normalize them to full keys.

HF gated models fail to load
Export HUGGINGFACE_HUB_TOKEN or create ~/.hf_token.

8) Quick start

# 1) env
conda create -n eom python=3.11 -y
conda activate eom
pip install -r requirements.txt

# 2) splits
python split_data.py --dataset humaneval --out_prefix splits/humaneval_split --val_size 10 --seed 42

# 3) run models (edit script first)
sbatch --array=0-0 my_job_array.sh

# 4) (if not run inside the array)
python ensemble_select.py --embeddings_dir embeddings --strategy cluster_representatives --distance cosine --k 5 --output ensembles/selected.json

9) Repo structure (typical)

.
├── README.md
├── requirements.txt
└── evalplus/
├── split_data.py
├── embed_model.py
├── ensemble_select.py
├── my_job_array.sh
├── embeddings/
├── ensembles/
└── evalplus_validation_results/

⚖️ EvalPlus — Vendored (Copied) and Modified

This repository includes a vendored snapshot of EvalPlus to enable validation/test splits friendly evaluation.

Upstream provenance

Source: https://github.com/evalplus/evalplus
Upstream commit: 30f9f8eb2cdddd42906d1fb722756290fe7a1e17 (copy date: <2025-09-21>)
Location in this repo: ./evalplus/
License: MIT — preserved verbatim at ./evalplus/LICENSE

Local modifications

Added dataset split support (validation/test):
- modified: evalplus/evaluate.py (read subset ID files; run selected tasks only)
- modified: evalplus/codegen.py (plumb subset IDs through codegen loop)
Modified evalplus/provider/hf.py:
- modified: evalplus/provider/hf.py (Fixed dtype key in kwargs and improved input handling in HuggingFaceDecoder)
When no subset is provided, behavior matches upstream defaults.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ensemble of Minds: Combining Large Language Models for Enhanced Response Generation

✅ Prerequisites

1) Environment setup

Conda (recommended)

venv

2) Hugging Face auth (optional but recommended)

3) Create data splits

4) Submit jobs for each model (SLURM array)

`my_job_array.sh`

5) Files & directories

6) Ensemble selection (manual run)

Dependency method (run after the whole array finishes)

7) Troubleshooting

8) Quick start

9) Repo structure (typical)

⚖️ EvalPlus — Vendored (Copied) and Modified

About

Uh oh!

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
evalplus		evalplus
.gitignore		.gitignore
README.md		README.md
embed_model.py		embed_model.py
ensemble_select.py		ensemble_select.py
my_job_array.sh		my_job_array.sh
requirements-evalperf.txt		requirements-evalperf.txt
requirements.txt		requirements.txt
split_data.py		split_data.py

m0radt/EOM

Folders and files

Latest commit

History

Repository files navigation

Ensemble of Minds: Combining Large Language Models for Enhanced Response Generation

✅ Prerequisites

1) Environment setup

Conda (recommended)

venv

2) Hugging Face auth (optional but recommended)

3) Create data splits

4) Submit jobs for each model (SLURM array)

my_job_array.sh

5) Files & directories

6) Ensemble selection (manual run)

Dependency method (run after the whole array finishes)

7) Troubleshooting

8) Quick start

9) Repo structure (typical)

⚖️ EvalPlus — Vendored (Copied) and Modified

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages

`my_job_array.sh`