From c2ae89819f426c3626fc3cfef6bf33589223e3b1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Cl=C3=A9mentine?= <cle.fourrier@gmail.com>
Date: Tue, 6 Feb 2024 12:32:53 +0100
Subject: [PATCH 1/4] update thanks

---
 README.md | 147 ++++++++++++------------------------------------------
 1 file changed, 31 insertions(+), 116 deletions(-)
diff --git a/README.md b/README.md
index 05f295639..3b7dde4e9 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,23 @@
 # LightEval 🌤️
+LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library [datatrove](https://github.com/huggingface/datatrove), LLM training library [nanotron](https://github.com/huggingface/nanotron) and logging/experimentation code base [brrr](https://github.com/huggingface/brrr).
 
-## Context
-LightEval is an evaluation suite which gathers a selection of features from widely used benchmarks recently proposed:
-- from the [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness), we use the nice request management
-- from [HELM](https://crfm.stanford.edu/helm/latest/), we keep the qualitative and rich metrics
-- from our previous internal evaluation suite, we keep the easy edition, evaluation loading and speed.
-
-It is still an early, internal version - it should be nice to use but don't expect 100% stability!
+We're releasing it with the community in the spirit of building in the open. 
 
+Note that it is still very much early so don't expect 100% stability ^^'
 In case of problems or question, feel free to open an issue! 
 
+## Deep thanks
+`lighteval` was originally built on top of the great [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness) (which is powering the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). We also took a lot of inspiration from the amazing [HELM](https://crfm.stanford.edu/helm/latest/), notably for metrics.
+
+Through adding more and more logging functionalities, and making it compatible with increasingly different workflows and model codebases (including 3D parallelism) as well as allowing custom evaluation experiments, metrics and benchmarks, we ended up needing to change the code more and more deeply until `lighteval` became the small standalone library that it is now.
+
+However, we are very grateful to the Harness and HELM teams for their continued work on better evaluations.
+
 ## How to install and use
+Note: 
+- Use the Eleuther AI Harness (`lm_eval`) to share comparable numbers with everyone (e.g. on the Open LLM Leaderboard). 
+- Use `lighteval` during training with the nanotron/datatrove LLM training stack and/or for quick eval/benchmark experimentations.
+
 ### Requirements
 0) Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10
 
@@ -22,18 +29,14 @@ Optional:
 
 2) Add your user token to the environment variable `HUGGING_FACE_HUB_TOKEN` if you want to push your results to the hub
 
-
 ### Usage
 - Launching on CPU
-    - `python main.py --model_args="pretrained=<path to your model on the hub>" --device=cpu <task parameters>  --output_dir output_dir`
-- Launching on GPU
-    - On one GPU
-        - `python main.py --model_args="pretrained=<path to your model on the hub>" --device=gpu:0 <task parameters> --output_dir output_dir`
-    - Using data parallelism on several GPUs
-        - If you want to use data parallelism, first configure accelerate (`accelerate config`).
-        - `accelerate launch <accelerate parameters> main.py --model_args="pretrained=<path to your model on the hub>" <task parameters>`
-        for instance: `accelerate launch --multi_gpu --num_processes 8 main.py --model_args="pretrained=EleutherAI/gpt-j-6b,dtype=float16,model_parallel=True" --tasks "helm|hellaswag,lighteval|hellaswag" --override_batch_size 8 --num_fewshot 10 --output_dir output_dir`
-        - Note: if you use model_parallel, accelerate will use 2 processes for model parallel, num_processes for data parallel
+    - `python src/main.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir output_dir`
+- Using data parallelism on several GPUs (recommended)
+    - If you want to use data parallelism, first configure accelerate (`accelerate config`).
+    - `accelerate launch <accelerate parameters> src/main.py --model_args="pretrained=<path to your model on the hub>" <task parameters>  --output_dir=<your output dir>`
+    for instance: `python -m accelerate launch --multi_gpu --num_processes=8 src/main.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=tmp/`
+    - Note: if you use model_parallel, accelerate will use 2 processes for model parallel, num_processes for data parallel
 
 The task parameters indicate which tasks you want to launch. You can select:
 - one or several tasks, with `--tasks task_names`, with task_names in the [metadata table](metadata_table.json), separated by commas. You must specify which version of the task you want (= in which suite it is), by prepending the suite name, as well as the number of training few_shots prompts for the given task, and whether you want to automatically reduce the number of few_shots if they make the prompt too long (`suite|task|few_shot|1 or 0 to automatically reduce the number of few_shots or not`).
@@ -41,17 +44,14 @@ The task parameters indicate which tasks you want to launch. You can select:
 
 Example
 If you want to compare hellaswag from helm and the harness on Gpt-6j, you can do
-`python run_eval.py --model hf_causal --model_args="pretrained=EleutherAI/gpt-j-6b" --tasks helm|hellaswag,lighteval|hellaswag`
-
-Other cool parameters:
-- `--save_queries` will print the prompts, generations and golds.
-- `--max_samples num_samples` allows you to only run an eval on a subset of samples for debugging
-- `--batch_size size` selects the batch size to use for your xp otherwise we use `accelerate` `find_executable_batch_size` auto detection of max batch size
-- `--num_fewshots` selects the number of few-shot prompts you want to use to launch your experiment - it applies to all selected evals.
-- `--num_fewshot_seeds` allows you to launch the same few-shot experiment, with several samplings for the few shot prompts (like is done in HELM). Careful, each added num_fewshot_trial increases the time the suite takes to run.
+`python src/main.py --model hf_causal --model_args="pretrained=EleutherAI/gpt-j-6b" --tasks helm|hellaswag|0|0,lighteval|hellaswag|0|0 --output_dir output_dir`
 
+## Customisation
+### Adding a new metric
+If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `src.lighteval.metrics.metrics_sample`. If not, add it to either of these files depending on the level at which it is applied. 
+Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.
 
-## Adding a new task
+### Adding a new task
 To add a new task, first **add its dataset** on the hub.
 
 Then, **find a suitable prompt function** or **create a new prompt function** in `src/prompt_formatting.py`. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
@@ -61,11 +61,11 @@ Lastly, create a **line summary** of your evaluation, in `metadata_table.json`.
 - `suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval"]; you can add also add new ones (for test, we recommend using "custom").
 - `prompt_function` (str), the name of the prompt function you defined in the step above
 - `hf_repo` (str), the path to your evaluation dataset on the hub
-- `hf_subset` (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with `"default"`, not with `None` or `""`) 
+- `hf_subset` (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with `"default"`, not with `None` or `""`)
 - `hf_avail_splits` (list), all the splits available for your dataset (train, valid or validation, test, other...)
 - `evaluation_splits` (list), the splits you want to use for evaluation
 - `few_shots_split` (str, can be `null`), the specific split from which you want to select samples for your few-shot examples. It should be different from the sets included in `evaluation_splits`
-- `few_shots_select` (str, can be `null`), the method that you will use to select items for your few-shot examples. Can be `null`, or one of: 
+- `few_shots_select` (str, can be `null`), the method that you will use to select items for your few-shot examples. Can be `null`, or one of:
     - `balanced` selects examples from the `few_shots_split` with balanced labels, to avoid skewing the few shot examples (hence the model generations) towards one specific label
     - `random` selects examples at random from the `few_shots_split`
     - `random_sampling` selects new examples at random from the `few_shots_split` for every new item, but if a sampled item is equal to the current one, it is removed from the available samples
@@ -113,7 +113,7 @@ These metrics need the model to generate an output. They are therefore slower.
     - `exact_match_indicator`: Exact match with some preceding context (before an indicator) removed
     - `f1_score_quasi` (HELM): Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
     - `f1_score`:  Average F1 score in terms of word overlap between the model output and gold without normalisation
-    - `f1_score_macro`: Corpus level macro F1 score 
+    - `f1_score_macro`: Corpus level macro F1 score
     - `f1_score_macro`: Corpus level micro F1 score
 - Summarization:
     - `rouge` (Harness): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/)
@@ -151,9 +151,6 @@ To keep compatibility with the Harness for some specific tasks, we ported their
 These metrics need both the generation and its logprob. They are not working at the moment, as this fn is not in the AI Harness.
 - `prediction_perplexity` (HELM): Measure of the logprob of a given input.
 
-## Adding a new metric
-If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `metrics_sample`. If not, add it to either of these files depending on the level at which it is applied. Then, follow the example in `src.lighteval.metrics.metrics` to register your metric. 
-
 ## Examples of scripts to launch lighteval on the cluster
 ### Evaluate a whole suite on one node, 8 GPUs
 1) Create a config file for accelerate
@@ -199,92 +196,10 @@ echo "START TIME: $(date)"
 # Activate your relevant virtualenv
 source <path_to_your_venv>/activate #or conda activate yourenv
 
-cd <path_to_your_lighteval>/lighteval-harness
-
-accelerate_args="--config_file <path_to_your_config_file>"
+cd <path_to_your_lighteval>/lighteval
 
 export CUDA_LAUNCH_BLOCKING=1
-srun accelerate launch ${accelerate_args} run_eval.py --model "hf-causal" --model_args "pretrained=EleutherAI/gpt-j-6b" --suite "helm_general" --batch_size 8
-```
-
-### Evaluate a whole suite on 3 node, 24 GPUs total
-1) Create a shell script
-```bash
-#!/bin/bash
-# HOSTNAMES MASTER_ADDR MASTER_PORT COUNT_NODE are coming from the main script
-
-set -x -e
-
-echo "START TIME: $(date)"
-
-export TMPDIR=/scratch
-
-echo myuser=`whoami`
-echo COUNT_NODE=$COUNT_NODE
-echo LD_LIBRARY_PATH = $LD_LIBRARY_PATH
-echo PATH = $PATH
-echo HOSTNAMES = $HOSTNAMES
-echo hostname = `hostname`
-echo MASTER_ADDR= $MASTER_ADDR
-echo MASTER_PORT= $MASTER_PORT
-
-echo $SLURM_PROCID $SLURM_JOBID $SLURM_LOCALID $SLURM_NODEID
-
-H=`hostname`
-RANK=`echo -e $HOSTNAMES  | python3 -c "import sys;[sys.stdout.write(str(i)) for i,line in enumerate(next(sys.stdin).split(' ')) if line.strip() == '$H'.strip()]"`
-echo RANK=$RANK
-
-
-# Activate your relevant virtualenv
-source <path_to_your_venv>/activate #or conda activate yourenv
-# Check it worked
-echo python3 version = `python3 --version`
-
-cd <path_to_your_lighteval>/lighteval-harness
-
-# These arguments manage the multi node env
-accelerate_args="--num_processes $(( 8 * $COUNT_NODE )) --num_machines $COUNT_NODE --multi_gpu --machine_rank $RANK --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT"
-batch_size=$(( 8 * $COUNT_NODE ))
-
-srun accelerate launch ${accelerate_args} run_eval.py --model "hf-causal" --model_args "pretrained=EleutherAI/gpt-j-6b" --suite "helm_general" --batch_size ${batch_size}
-```
-
-2) Create the matching slurm file
-
-```bash
-#!/bin/bash
-#SBATCH --job-name=kirby-3-nodes
-#SBATCH --nodes=3
-#SBATCH --exclusive
-#SBATCH --ntasks-per-node=1
-#SBATCH --cpus-per-task=24
-#SBATCH --gres=gpu:8
-#SBATCH --mem-per-cpu=11G # This is essentially 1.1T / 96
-#SBATCH --partition=production-cluster
-#SBATCH --mail-type=ALL
-#SBATCH --output=slurm-%j-%x.out          # output file name
-#SBATCH --mail-user=clementine@huggingface.co
-
-set -x -e
-export TMPDIR=/scratch
-
-echo "START TIME: $(date)"
-
-# Activate your relevant virtualenv
-source <path_to_your_venv>/activate #or conda activate yourenv
-
-# sent to sub script
-export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
-export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
-export MASTER_PORT=12802
-export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`
-
-echo go $COUNT_NODE
-echo $HOSTNAMES
-
-mkdir -p logs/${SLURM_JOBID}
-
-srun --output=logs/%j/helm-%t.log bash launch_multinode.sh
+srun accelerate launch --multi_gpu --num_processes=8 src/main.py --model_args "pretrained=your model name" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir
 ```
 
 ## Releases

From 927761bd3620fd64edc8faeb25a2e67cfb744748 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Cl=C3=A9mentine?= <cle.fourrier@gmail.com>
Date: Tue, 6 Feb 2024 14:31:12 +0100
Subject: [PATCH 2/4] Update readm

---
 README.md | 66 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index 3b7dde4e9..61ab92de7 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,7 @@
 # LightEval 🌤️
+A lightweight LLM evaluation 
+
+## Context
 LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library [datatrove](https://github.com/huggingface/datatrove), LLM training library [nanotron](https://github.com/huggingface/nanotron) and logging/experimentation code base [brrr](https://github.com/huggingface/brrr).
 
 We're releasing it with the community in the spirit of building in the open. 
@@ -6,6 +9,9 @@ We're releasing it with the community in the spirit of building in the open.
 Note that it is still very much early so don't expect 100% stability ^^'
 In case of problems or question, feel free to open an issue! 
 
+## News
+- **Feb 07, 2024**: Release of `lighteval``
+
 ## Deep thanks
 `lighteval` was originally built on top of the great [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness) (which is powering the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). We also took a lot of inspiration from the amazing [HELM](https://crfm.stanford.edu/helm/latest/), notably for metrics.
 
@@ -13,21 +19,65 @@ Through adding more and more logging functionalities, and making it compatible w
 
 However, we are very grateful to the Harness and HELM teams for their continued work on better evaluations.
 
+## How to navigate this project
+`lighteval` is supposed to be used as a standalone evaluation library.
+- [src](https://github.com/huggingface/lighteval/tree/main/src) contains the lib itself
+    - [main.py](https://github.com/huggingface/lighteval/blob/main/src/main.py) is our launcher, that you should use to start evaluations
+    - [lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the library, divided in the following section
+        - [logging](https://github.com/huggingface/lighteval/tree/main/src/lighteval/logging): Our loggers, to display experiment information and push it to the hub after a run
+        - [metrics](https://github.com/huggingface/lighteval/tree/main/src/lighteval/metrics): All the available metrics you can use. They are described in metrics, and divided between sample metrics (applied at the sample level, such as a prediction accuracy) and corpus metrics (applied over the whole corpus). You'll also find available normalisation functions.
+        - [models](https://github.com/huggingface/lighteval/tree/main/src/lighteval/models): Possible models to use. We cover transformers (base_model), with adapter or delta weights, as well as TGI models locally deployed (it's likely the code here is out of date though), and brrr/nanotron models.
+        - [tasks](https://github.com/huggingface/lighteval/tree/main/src/lighteval/tasks): Available tasks. The complete list is in `tasks_table.jsonl`, and you'll find all the prompts in `ŧasks_prompt_formatting.py`. 
+- [tasks_examples](https://github.com/huggingface/lighteval/tree/main/tasks_examples) contains a list of available tasks you can launch. We advise using tasks in the `recommended_set`, as it's possible that some of the other tasks need double checking.
+- [tests](https://github.com/huggingface/lighteval/tree/main/tests) contains our test suite, that we run at each PR to prevent regressions in metrics/prompts/tasks, for a subset of important tasks.
+
 ## How to install and use
 Note: 
 - Use the Eleuther AI Harness (`lm_eval`) to share comparable numbers with everyone (e.g. on the Open LLM Leaderboard). 
 - Use `lighteval` during training with the nanotron/datatrove LLM training stack and/or for quick eval/benchmark experimentations.
 
-### Requirements
-0) Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10
+### Installation
+Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10
+```bash
+conda create -n lighteval python==3.10
+```
+
+Clone the package 
+```bash
+git clone
+cd lighteval-harness
+```
+
+Install the dependencies. For the default installation, you just need:
+```bash
+pip install -e .
+cd src
+```
+
+If you want to run your models using accelerate, tgi or optimum, do quantization, or use adapter weights, you will need to specify the optional dependencies group fitting your use case (`accelerate`,`tgi`,`optimum`,`quantization`,`adapters`,`nanotron`) at install time
+```bash
+pip install -e .[optional1,optional2]
+cd src
+```
+
+The setup we tested most is:
+```bash
+pip install -e .[accelerate,quantization,adapters]
+cd src
+```
 
-1) Clone the package using `git clone`, then `cd lighteval-harness`, `pip install -e .` Once the dependencies are installed, `cd src`.
-Optional:
-- if you want to run your models using accelerate, tgi or optimum, do quantization, or use adapter weights, you will need to specify the optional dependencies group fitting your use case (`accelerate`,`tgi`,`optimum`,`quantization`,`adapters`,`nanotron`) at install time using the following command `pip install -e .[optional1,optional2]`.
+Optional steps.
 - to load and push big models/datasets, your machine likely needs Git LFS. You can install it with `sudo apt-get install git-lfs`
 - If you want to run bigbench evaluations, install bigbench `pip install "bigbench@https://storage.googleapis.com/public_research_data/bigbench/bigbench-0.0.1.tar.gz"`
 
-2) Add your user token to the environment variable `HUGGING_FACE_HUB_TOKEN` if you want to push your results to the hub
+If you want to push your results to the hub, don't forget to add your user token to the environment variable `HUGGING_FACE_HUB_TOKEN`.
+
+### Testing that everything was installed correctly
+If you want to test your install, you can run your first evaluation on GPUs (8GPU, single node), using
+```bash
+mkdir tmp
+python -m accelerate launch --multi_gpu --num_processes=8 src/main.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir="tmp/"
+```
 
 ### Usage
 - Launching on CPU
@@ -54,9 +104,9 @@ Then, follow the example in `src.lighteval.metrics.metrics` to register your met
 ### Adding a new task
 To add a new task, first **add its dataset** on the hub.
 
-Then, **find a suitable prompt function** or **create a new prompt function** in `src/prompt_formatting.py`. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
+Then, **find a suitable prompt function** or **create a new prompt function** in `src.lighteval.tasks.task_prompt_formatting.py`. This function must output a `Doc` object, which should contain `query`, your prompt, and either `gold`, the gold output, or `choices` and `gold_index`, the list of choices and index or indices of correct answers. If your query contains an instruction which should not be repeated in a few shot setup, add it to an `instruction` field.
 
-Lastly, create a **line summary** of your evaluation, in `metadata_table.json`. This summary should contain the following fields:
+Lastly, create a **line summary** of your evaluation, in `src/lighteval/tasks/tasks_table.jsonl`. This summary should contain the following fields:
 - `name` (str), your evaluation name
 - `suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval"]; you can add also add new ones (for test, we recommend using "custom").
 - `prompt_function` (str), the name of the prompt function you defined in the step above

From 77984bd340fda5c41622a9490772b86a0c291fc8 Mon Sep 17 00:00:00 2001
From: "clementine@huggingface.co" <clementine@huggingface.co>
Date: Wed, 7 Feb 2024 14:13:43 +0000
Subject: [PATCH 3/4] updated from review

---
 README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 6f5611b13..60774c695 100644
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@
 A lightweight LLM evaluation
 
 ## Context
-LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library [datatrove](https://github.com/huggingface/datatrove), LLM training library [nanotron](https://github.com/huggingface/nanotron) and logging/experimentation code base [brrr](https://github.com/huggingface/brrr).
+LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library [datatrove](https://github.com/huggingface/datatrove) and LLM training library [nanotron](https://github.com/huggingface/nanotron).
 
 We're releasing it with the community in the spirit of building in the open.
 
@@ -37,7 +37,7 @@ Note:
 - Use `lighteval` during training with the nanotron/datatrove LLM training stack and/or for quick eval/benchmark experimentations.
 
 ### Installation
-Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10
+Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10 or above.
 ```bash
 conda create -n lighteval python==3.10
 ```

From ddb17c6c8fc66c2121a1a0e4cfc5e1117404f565 Mon Sep 17 00:00:00 2001
From: "clementine@huggingface.co" <clementine@huggingface.co>
Date: Wed, 7 Feb 2024 18:26:31 +0000
Subject: [PATCH 4/4] update given new paths of PR

---
 README.md | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/README.md b/README.md
index 60774c695..af21c04f0 100644
--- a/README.md
+++ b/README.md
@@ -21,9 +21,10 @@ However, we are very grateful to the Harness and HELM teams for their continued
 
 ## How to navigate this project
 `lighteval` is supposed to be used as a standalone evaluation library.
-- [src](https://github.com/huggingface/lighteval/tree/main/src) contains the lib itself
-    - [main.py](https://github.com/huggingface/lighteval/blob/main/src/main.py) is our launcher, that you should use to start evaluations
+- To run the evaluations, you can use `run_evals_accelerate.py` or `run_evals_nanotron.py`.
+- [src/lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the lib itself
     - [lighteval](https://github.com/huggingface/lighteval/tree/main/src/lighteval) contains the core of the library, divided in the following section
+        - [main_accelerate.py](https://github.com/huggingface/lighteval/blob/main/src/main_accelerate.py) and [main_accelerate.py](https://github.com/huggingface/lighteval/blob/main/src/main_nanotron.py) are our entry points to run evaluation
         - [logging](https://github.com/huggingface/lighteval/tree/main/src/lighteval/logging): Our loggers, to display experiment information and push it to the hub after a run
         - [metrics](https://github.com/huggingface/lighteval/tree/main/src/lighteval/metrics): All the available metrics you can use. They are described in metrics, and divided between sample metrics (applied at the sample level, such as a prediction accuracy) and corpus metrics (applied over the whole corpus). You'll also find available normalisation functions.
         - [models](https://github.com/huggingface/lighteval/tree/main/src/lighteval/models): Possible models to use. We cover transformers (base_model), with adapter or delta weights, as well as TGI models locally deployed (it's likely the code here is out of date though), and brrr/nanotron models.
@@ -80,16 +81,16 @@ Optional steps.
 If you want to test your install, you can run your first evaluation on GPUs (8GPU, single node), using
 ```bash
 mkdir tmp
-python -m accelerate launch --multi_gpu --num_processes=8 src/main.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir="tmp/"
+python -m accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir="tmp/"
 ```
 
 ### Usage
 - Launching on CPU
-    - `python src/main.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir output_dir`
+    - `python run_evals_accelerate.py --model_args="pretrained=<path to your model on the hub>" <task parameters> --output_dir output_dir`
 - Using data parallelism on several GPUs (recommended)
     - If you want to use data parallelism, first configure accelerate (`accelerate config`).
-    - `accelerate launch <accelerate parameters> src/main.py --model_args="pretrained=<path to your model on the hub>" <task parameters>  --output_dir=<your output dir>`
-    for instance: `python -m accelerate launch --multi_gpu --num_processes=8 src/main.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=tmp/`
+    - `accelerate launch <accelerate parameters> run_evals_accelerate.py --model_args="pretrained=<path to your model on the hub>" <task parameters>  --output_dir=<your output dir>`
+    for instance: `python -m accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=gpt2" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=tmp/`
     - Note: if you use model_parallel, accelerate will use 2 processes for model parallel, num_processes for data parallel
 
 The task parameters indicate which tasks you want to launch. You can select:
@@ -98,7 +99,7 @@ The task parameters indicate which tasks you want to launch. You can select:
 
 Example
 If you want to compare hellaswag from helm and the harness on Gpt-6j, you can do
-`python src/main.py --model hf_causal --model_args="pretrained=EleutherAI/gpt-j-6b" --tasks helm|hellaswag|0|0,lighteval|hellaswag|0|0 --output_dir output_dir`
+`python run_evals_accelerate.py --model hf_causal --model_args="pretrained=EleutherAI/gpt-j-6b" --tasks helm|hellaswag|0|0,lighteval|hellaswag|0|0 --output_dir output_dir`
 
 ## Customisation
 ### Adding a new metric
@@ -253,7 +254,7 @@ source <path_to_your_venv>/activate #or conda activate yourenv
 cd <path_to_your_lighteval>/lighteval
 
 export CUDA_LAUNCH_BLOCKING=1
-srun accelerate launch --multi_gpu --num_processes=8 src/main.py --model_args "pretrained=your model name" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir
+srun accelerate launch --multi_gpu --num_processes=8 run_evals_accelerate.py --model_args "pretrained=your model name" --tasks tasks_examples/open_llm_leaderboard_tasks.txt --override_batch_size 1 --save_details --output_dir=your output dir
 ```
 
 ## Releases