Evaluate GPT-4 on classical NLP tasks #246

LifeIsStrange · 2023-03-16T14:05:46Z

Addressing the elephant in the room

When the concept of transformers were first unleashed, their revolutionnary accuracy results where mostly shown in the standard NLP tasks, such as POS-tagging, dependency parsing, coreference resolution, WSD, etc..
But I've observed, since PALM and other very large language models, the published benchmarks results are on much higher level tasks, such as common sense reasoning tests, question answering, etc
Both sets of benchmarks are useful and needed, but I would like to highlight that the standard NLP tasks are now completely under-benchmarked by those newer language models and that this impairs progress towards AGI or industrial uses.

If it could be argued, that purely symbolic AI progress has stalled since decades, there is a real huge potential for neuro-symbolic hybrid systems that uses neural networks for low level analysis tasks (POS-tag, etc), and feed those linguistic data to other higher level neural networks or to symbolic systems, in order to push the boundaries of what is possible, especially regarding semantic analysis AKA true NLU systems.

foundational NLP tasks of interest:

Give feedback

Therefore this issue is a call of contributions for implementing evals on those standard tasks, especially dependency parsing.
I believe GPT-4 has the potential to improve the SOTA in at least some foundational NLP tasks and an even greater potential once someone finetune it and combine it to domain specific optimizations (as is currently done on BERT SOTAs, such as HPSG for dependency parsing).

andrew-openai · 2023-04-22T15:12:09Z

Great idea!

sudarshansivakumar · 2023-12-14T17:28:53Z

I know this thread is from a while back but curious if anyone has managed to do this?

Mukhsin0508 · 2023-12-14T18:29:05Z

I tried to do this, still working on it! how about you? чт, 14 дек. 2023 г. в 22:29, sudarshansivakumar ***@***.***>:

…

I know this thread is from a while back but curious if anyone has managed to do this? — Reply to this email directly, view it on GitHub <#246 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A6W7PNBMGUYUP6OUPFUHDJ3YJMZPNAVCNFSM6AAAAAAV5IKFB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJWGI3TENRXGY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

commit af58ab97de097ab65cf8c5bbf4b58c1abf645f6c Author: Andrei Alexandru <inwaves@users.noreply.github.com> Date: Tue Mar 19 03:14:40 2024 +0000 Remove Ainu dataset from skill acquisition (#345) * Remove Ainu dataset * Remove Ainu dataset commit 5317164a5091db464772da76e18c5806523a7cba Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Fri Mar 15 14:37:48 2024 +0000 Multithreading for GeminiSolver (#330) * GeminiSolver supports multithreading * add documentation for glm client setup * better doc * only share client between solver copies * implement backoff * add model version property commit ac9024921855cff4e698718d10de11b91cab19b4 Author: Giulio Starace <giulio.starace@gmail.com> Date: Fri Mar 15 15:24:32 2024 +0100 AnthropicSolver (#331) * add anthropic to pyproject toml * anthropic solver skeleton * name property * mvp; need to fix backoff/rate limiting; need to test cot * forgot to commit registry entry * implement backoff * implement alternating roles to support CoT * fill in the rest of the yaml with remaining anthropic models * log usage * make it a static method * anthropic solver pytests * update docs; implement model_version * dont use abbreviation Co-authored-by: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> * let _solve accept kwargs * include haiku (came out today) * switch ordering of haiku and sonnet * reqs handled by pyproject.toml --------- Co-authored-by: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> commit 209fa98a10604b3d7645ab64dec148cb167abc16 Author: Dane <danesherbs@users.noreply.github.com> Date: Sat Mar 16 00:25:59 2024 +1100 Refactor MLAB v2 (#340) commit 081d74af1701fef72c8c0ad39e847e05a168289f Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Fri Mar 15 10:26:42 2024 +0000 correct eval names for error recovery (#341) commit 66684a555b0abe51d1be3c09cf33278bf67a7f76 Author: Dane <danesherbs@users.noreply.github.com> Date: Fri Mar 15 19:56:27 2024 +1100 Update MLAB LICENSE.md (#338) commit f2a900a6aa6c356305817fb0282073b97deaa291 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Fri Mar 15 06:33:07 2024 +0000 Final cleanup for release (#339) * Remove redundant httpx logging settings * Minor bugged tools readme * minor cleanup and precommit icrl * Cleanup fdeduct todos commit 7f5f8c6e44b22d11f51ccc506ebee0bd161b8d3d Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Thu Mar 14 14:05:49 2024 -0700 Update contribution statements in READMEs (#337) commit cadff08108a7c81c15561d968fb8ebc3ab7832c5 Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Thu Mar 14 05:40:14 2024 -0700 LICENSE files for Error Recover and 20 Questions (#335) * 20Q and Error Recover dataset Licenses * Create LICENSE files for 20Q and ER datasets commit b7cffc0cf06b2cd3fd42a9491f0f6fbffa15c242 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Thu Mar 14 12:11:17 2024 +0000 skip test (#334) commit ffd004e65c7b7dda4f726916dc1fd453d7368347 Author: Dane <danesherbs@users.noreply.github.com> Date: Thu Mar 14 19:54:07 2024 +1100 Add licenses to MLAB v2 (#333) * Create LICENSE * Update LICENSE.md commit 5815222c56f8e0b186bbe8ba69f189e704db17e6 Author: Dane <danesherbs@users.noreply.github.com> Date: Thu Mar 14 19:42:26 2024 +1100 Update LICENSE.md (#332) commit 3a00e556ea1a1e18eb7063321e14d8534bf0e48d Author: Giulio Starace <giulio.starace@gmail.com> Date: Thu Mar 14 08:53:16 2024 +0100 Coherence - final PR (#325) * get rid of .dev alias * ast precommit checks+fixes * tts precommit checks+fixes commit c71db16e4fafa93d44ec6dec58a24cbd5ad82259 Author: Dane <danesherbs@users.noreply.github.com> Date: Thu Mar 14 02:06:33 2024 +1100 Update MLAB v2 scripts (#323) commit 7399673ad004ca32a2bf668df1d7a3cdf11df5fa Author: Giulio Starace <giulio.starace@gmail.com> Date: Wed Mar 13 14:27:21 2024 +0100 wordnet license (#329) commit d3bf8c96c5aca7a3733dcee985de7f1b77ba32d4 Author: Giulio Starace <giulio.starace@gmail.com> Date: Wed Mar 13 12:53:48 2024 +0100 AST - Use wordnet corpus rather than brown corpus (#328) * wordnet, not brown * fix path * wordnet words, not brown words * wordnet corpus, not brown commit 024ee14e7d06bf8f1d3f5a83f6e0ce84794e4992 Author: Ian McKenzie <140545726+ianmckenzie-oai@users.noreply.github.com> Date: Tue Mar 12 23:17:23 2024 -0700 Multi-Step Web Tasks (formerly Bucket of Tasks) (#49) * add partial webarena * very basic task runnable (for real) * replace BoTSolver with Solver * change TaskState, change log, remove some redundant code/files * change logging debug to info * change to using local rather than global docker_client * delete unnecessary files * set up skeleton of BashBrowserEnv * add BashBrowser auxiliary classes * add more BashBrowser stuff, and fix running multiple tasks in sequence * simple task that uses bash and browser working! * setup and teardown code for simple-web working * abstract more container loading logic, downloading is untested * fix downloading, improve container setup * move Session class to separate file * add (untested) setups for webarena containers * basic setup working for all environments * remove accidentally committed scratch file * add gpt-3.5 solvers * refactor container setup, add networking * register bash with session, set up networking * remove input statement, add more waiting * move bash setup into Session and enable building from Dockerfile * add bugged internet disabling * disabling internet access seems to work * delete unused agent code * add try..except around Session setup * fix stopping condition to include early stopping * set up BoTTaskState such that default Solvers can try the task * add the three easy tasks * slightly edit the agent prompt to see if it helps * fix easy Python task! * change to gpt-4-32k * fix BashObservation to have data property, not method * more prompt tuning * add way to run environments in repro * start adding medium/hard tasks (not complete) * add homepage docker container (first draft) * remove unused apps from homepage * add homepage to Session * add medium tasks, fix evaluator * add individual tasks for testing * fix ProgramHTML evaluation * replace 'match' with 'elif's * fixes to make medium and hard tasks run * clean up datasets and yaml * fix '|' mistakes in match-statement-replacement elifs * add gitlab url fix * remove unused field from json * add reproducibility run_experiments.sh script * solvers and 3.9 compat: replacing '| None' with 'Optional' and 'match' with 'elif' * more 3.9 compat changes * add longer timeout to ready check (for simple-web) * small task fixes * some logging tweaks * some prompt tweaks * change to using StrongSolver in run_experiments.sh * re: Dane's review, add timeout in constants, fix elif * update gitlab task to say 'main' * update prompt to emphasize homepage * remove redundant network from bash env * fix homepage links * save final report as dictionaries with task_ids * add some scratch code to run the bash container too * fix task 7 to use different repo * add sleep and empty check to avoid issues with browser env * add hack to prevent 'goto' accessing the internet * use 0613 checkpoints * add homepage to all tasks * use model context length to choose observation history * change to using 0613 snapshots * change message fitting to cut long observations * have Session log errors that cause it to shut down * switch default action splitter to single backtick * add first draft of README * add explicit check for chat model * Revert "switch default action splitter to single backtick" This reverts commit 813d832b26ba3aefebc6144a12812adf0f9968f0. * add action splitter to previous actions in prompt * reduce chars per token estimate from 3.2 to 3 * update wikipedia task to use Lorentz rather than Croatian election * change logging file extensions to match other evals * update StrongSolver to use tiktoken to fit context length * start on data parsing for plots * first version of plotting * change wikipedia task to not be answerable by gpt-4 * modify prompts based on jun's feedback * remove vscode file * update import and remove redundant todo * make output dir automatically in make_plots, add to run_experiments.sh * tidy up plot, add descriptive labels * changes from Dane/Jun comments * add contribution statement to README * change Eval -> SolverEval * add cleanup script to reproducibility * remove old task file * remove 'BoTSolverResult' * remove unused 'browser_early_stop' fn * move requirements into pyproject.toml and add setup instructions * improve setup instructions in README and improve error handling in session.py * initial version of playwright-flask app * two options: exec or define all functions * have a draft of basic structure of client and server code down * remove unused 'run_function' methods * debugged some issues with return values * first draft of dockerizing api (untested) * more incremental progress on flask-playwright api * incremental: remove iptables, debug more commands * change cleanup to prune networks, and add wikipedia to run_environments * refactor to use 'Forwarder's everywhere, and change how ports are used * add separation between client urls and server urls * add flask-playwright to CLEANUP * end-to-end run of task 1 successful * add slightly better logging to failed actions * remove 'bridge' network, containers only accessible via 'bucket-of-tasks_network' now * fix task urls * update strong solver prompt (fix typo, add http://homepage.com) * change gitlab container to use 'http://gitlab.com' as its url * add better resetting, try to debug Execution context error * just wait after navigation commands * fix bugs with retrying and base urls inside containers * fix issue with quoting in page.evaluate and fix bug with retry logging * remove redundant url fixing method and continue after invalid url in goto * better error logging in evaluators.py * change urls on homepage to be docker internal * change to using homepage as start url for all tasks * update debugging script * remove hardcoded hack to ensure urls start with 127.0.0.1 * revert change to gitlab internal url (should be gitlab:8023 for git clone) * change curl command logging info -> debug to reduce spam * move BoT reproducibility to elsuite * fix output dir in run_experiments * remove messy debugging scripts * remove linux restriction now that we use playwright container * remove exposed ports from container setup * change cache dir to be more portable * rename bucket-of-tasks to multistep-web-tasks * rename bucket of tasks in README * change default task to simple-web for CI, fix minor bugs * Remove commented-out code * Add spacing for docstrings --------- Co-authored-by: Dane <danesherbs@gmail.com> commit c4c9517bb9ef2eee705a2d1ad00b9a805f2ab33c Author: Giulio Starace <giulio.starace@gmail.com> Date: Mon Mar 11 12:48:51 2024 +0100 Coherence - Readmes + minor cleanup (#316) commit ebf453c16aeace5d9884aa809aae3c0ebc5f3a99 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Mon Mar 11 10:37:37 2024 +0000 Bugged Tools - minor fixes (#315) * don't use judge with DummySolver * basic error handling around tools commit b1c5cc0339d66202efcf4a1868f07b96025fa6aa Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Mon Mar 11 08:50:57 2024 +0000 Function Deduction plotting code (#305) * Redo plotting code * Make colors less jarring * Fix missing text annotations bug * Update evals/elsuite/function_deduction/scripts/make_plots.py Co-authored-by: James Aung <129281094+james-aung@users.noreply.github.com> * Update evals/elsuite/function_deduction/scripts/make_plots.py Co-authored-by: James Aung <129281094+james-aung@users.noreply.github.com> --------- Co-authored-by: James Aung <129281094+james-aung@users.noreply.github.com> commit ce40f1eafb653fa3da30c003f64b8132cbd0bb8b Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Mon Mar 11 08:27:30 2024 +0000 don't attempt to sum None (#314) commit c92857de8df82cc74861b493954c28dd239ab0ef Author: Dane <danesherbs@users.noreply.github.com> Date: Mon Mar 11 19:20:18 2024 +1100 Standardize terminology in MLAB v2 baselines (#321) * Update baselines to report "return" instead of "reward" * Update bipedal walker baseline * Update random sampling seed in naive baselines commit e47ce920b4290209bf77216da1b5532eae788fed Merge: e3b7360b 7efc3f65 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Mon Mar 11 08:17:48 2024 +0000 Merge pull request #322 from openai/jun/change-steg-datasets Drop two datasets from steganography commit e3b7360b13de77c9b901d3e33d2eff939ffcac7e Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Mon Mar 11 01:02:30 2024 -0700 Add InContext RL READMEs (#318) * Create README * Add ICRL to main README * Add Dataset section --------- Co-authored-by: Chan Jun Shern <chanjunshern@gmail.com> commit 40b5f144325c75dfbf3087692348a3e214490129 Merge: 7e85428d 030b2324 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Mon Mar 11 06:15:55 2024 +0000 Merge pull request #320 from openai/jun/sync-oss-20240311 Jun/sync oss 20240311 commit 7efc3f6573182d039cd7fa5de729c195e26e8cbb Author: Chan Jun Shern <chanjunshern@gmail.com> Date: Mon Mar 11 12:50:58 2024 +0800 Drop two datasets from steganography commit 030b2324798b81a043c0a56fa0b75e33317a72bf Merge: 7e85428d 82ec660e Author: Chan Jun Shern <chanjunshern@gmail.com> Date: Mon Mar 11 12:12:22 2024 +0800 Merge remote-tracking branch 'public/main' into jun/sync-oss-20240311 commit 7e85428db2a1587f9a824be23543be32f78cd13e Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Sun Mar 10 20:20:48 2024 -0700 In-Context RL plotting (#280) * init template scripts * plotting code * correct plots * plotting code improvement * Got plotting working in notebook * Updated line styles for baselines * Change opacity of lines * Un-messup merge * Update anti-cot solver for 4-turbo and the new 3.5 * Run experiments * Enable printing of each command before execution in run_experiments.sh script * New plotting code * Plotting code working * Delete old files * Add final average reward as a metric * Fix threading issue? * fix explanations arg name * change what solver we use to be generation/direct * no longer run sequentially thanks to threadding fix * Change qlearning baseline to train for max steps instead of max episodes * fix too many messages * new plotting code * qlearning for 1m steps * Add loop to run experiments multiple times * pretty names * Add filter script for log file processing * adjust fig size and add labels * New custom map for FrozenLake * Change max_steps to 200 * add evaluation function * simplify plotting 😌 * Fix saving plot to correct directory * per-env window sized and prettify * annotate lines with final values * invalid response rate plots * Update qlearning_baseline.ipynb * add labels * Catch broader class of GoogleAPIError * Widen catch for response.text errors * adjust invalid action message * new episode reward metrics * del plotting playground * Fix _calculate_episode_rewards method signature * Add handler for response.parts as well * cleanup plotting * update qlearning baselien for correct custom map * add short variant * max rolling avg in json * remove filter notebook * fix max reward of windows * Fix final settings for experiment script * Clean up prettifying functions to be dicts instead * clenn up episode rewards calculation in eval.py --------- Co-authored-by: Chan Jun Shern <chanjunshern@gmail.com> commit 25aa67e6babcb32dfd6ac92ab1e87cb9387f321b Author: Dane <danesherbs@users.noreply.github.com> Date: Sat Mar 9 20:48:28 2024 +1100 Fix spelling mistake in HRMLAB prompts (#319) commit caa46568eafbc6e7fc7d7ccd662e70d141fe02eb Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Fri Mar 8 16:28:12 2024 +0000 prevent division by zero (#317) commit d872aeb3bb326790078dc24fe9498e25ee802926 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Fri Mar 8 11:20:08 2024 +0000 CDTA - readme (#313) * add eval README * compute headline metric * add metrics + description to eval yaml. Add shorter * add shorter version of eval * fix dataset dupliation bug * readme updates * point users towards prompts in readme * comment on custom chess solvers * add CDTA to root README * Update evals/registry/evals/cant_do_that_anymore.yaml Co-authored-by: Giulio Starace <giulio.starace@gmail.com> * replace subsequent -> special * Update evals/elsuite/cant_do_that_anymore/README.md Co-authored-by: Giulio Starace <giulio.starace@gmail.com> * mention more explicitly that each solver has a unique dataset * Update evals/elsuite/cant_do_that_anymore/README.md Co-authored-by: Giulio Starace <giulio.starace@gmail.com> * include both input and output tokens * include all metrics in eval yaml --------- Co-authored-by: Giulio Starace <giulio.starace@gmail.com> commit 90f0bc6f95985243cb31107cbee5747772ea182a Author: Dane <danesherbs@users.noreply.github.com> Date: Fri Mar 8 21:21:07 2024 +1100 Add OpenAI Gym environments to HR-MLAB (#242) * Merge commit with main * Update plotting code * Update default max steps to be `30` * Add gym to HR-MLAB requirements * Tidy up vectorization's `train.py` * Add the BipedalWalker-v3 environment * Add the CartPole-v1 environment * Add partially-implemented llama inference task * Update Cart Pole baselines * Update baselines and scoring for BipedalWalker * Update Cart Pole baseline and grading * Update bipedal walker baselines and grading * Add the inverted pendulum env * Fix human baseline for inverted pendulum * Fix bipedal walker baselines * Add pusher environment * Add task descriptions and update yaml * Add the `Ant-v4` env * Add `Ant-v4` data and yaml * Add `Pong-ramDeterministic-v4` env * Make Ant normalization fn clearer * Update Ant grading * Add `Humanoid-v4` env * Update Cart Pole grading * Update inverted pendulum grading * Updating `Pusher-v4` grading * Update Pong human baseline and grading * Update yaml and add humanoid jsonl * Update humanoid grading * Update humanoid human baseline and grading * Update the bipedal walker grading script * Update bipedal walker grading docstring * Update `Cart Pole` time limit and baselines * Update time limits for inverted pendulum * Update bipedal walker max time limit * Update yaml file * Update `Ant-v4` environment * Update `BipedalWalker-v3` env * Update `CartPole-v1` env * Update `Humanoid-v4` env * Update `InvertedPendulum-v4` env * Update `PongNoFrameskip-v4` env * Update `Pusher-v4` env * Update .gitignore * Add time limit to attempt * Update the `Ant-v4` env * Update the `Humanoid-v3` env * Refactor get_baseline_score function to allow for additional files and saving checkpoints * Refactor human baseline script and add checkpoint file * Remove unnecessary code for file copying * Add cache decorator to score calculation functions * Add cache decorator to score calculation functions * Update CartPole human baseline * Cache baselines for humanoid env * Cache baselines for inverted pendulum env * Cache baselines for pong * Cache baselines for pusher * Update experiment scripts * Update timeout error message in environment.py * Create logic to stop eval on out of context error * Update .gitignore * Add time and steps remaining reminder for model * Make normalization functions linear * Handle context length exceeded in solver * Update README * Merge pyproject.toml * Add `max_time` parameter to v1 jsonl files * Remove default max steps etc. and improve instructions * Remove `shell=True` from file execution fn * Remove video recorder from baselines * Add script to calculate token estimates * Apply hooks * Refactor error message for invalid action input * Update task descriptions for llama inference and vectorization * Refactor solver class name to SimpleActionAgent * Update assertion error message in baseline solver * Remove commented-out code in autoeval.py * Update run_experiments.py * Update Bipedal Walker task description * Simplify token consumption estimate * Update README.md * Handle edge case for unknown completion fn * Change baseline solver to use `OpenAISolver` instead of `OpenAIChatCompletionFn` * Re-add files with Git LFS tracking commit 2e0676e14616895baa5456a72c822476feef906a Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Fri Mar 8 09:55:40 2024 +0000 Final plots for error rec (#312) commit 1fa967fa8d470f88da0ea780666c250069a5a717 Author: Ian McKenzie <140545726+ianmckenzie-oai@users.noreply.github.com> Date: Fri Mar 8 01:48:48 2024 -0800 [Skill Acquisition] Changes from rerun (#300) * Fix copy bug for subclasses * Add retries with backoff * Add task_description to record_sampling * Add SkillAcqAssistantsSolver along with yaml entries * Log token usage in final report * reduce yaml + run exps script to just solvers used * changes for rerun * add back some solvers * remove my absolute path * Update plotting script * Remove commented out code --------- Co-authored-by: Chan Jun Shern <chanjunshern@gmail.com> commit e775967ecd1d9c639593d78d2ed5f612fafb1a5a Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Thu Mar 7 07:58:52 2024 -0800 error recovery README commit 752432e3112f2eb387fa6ff6213ee651c5c193f6 Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Thu Mar 7 07:58:23 2024 -0800 Update README.md (#309) commit b9938df47d4ddbc3baffbef859e0d18f5cdd5e29 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Thu Mar 7 15:53:35 2024 +0000 Mode Collapse - Variants + Polish (#307) * add option to create control dataset * rename dataset creation script * various readability + documentation improvements * log how many games have been examined during dataset creation process * only update tqdm bar + dataset if more than one move is found * remove unnecessary any_previous_move_found var * add option to save dataset throughout creation * remove unused n_threads arg for dataset_creation.py * remove unused recorder * skeleton structure for creating diagonal dataset * logic for finding diagonal moves * add support for diagonal experiment * remove unused plot. Render headline metrics * add plot for diagonal variant. Plot performance vs. num previous bishop moves * update progress bar with correct num. new examples * fixed bug where dataset would contain duplicate examples * run reproducability multiple times to compute SE * reproducability script to run diagonal variant * wrap single move in main+control datasets in list * seed changes on each different eval run * combine run_experiments scripts * remove unnecessary sort * add chess to pyproject.toml. Also correct typing * rename eval dir mode_collapse -> cant_do_that_anymore * rename mode collapse -> cant do that anymore * add diagonal dataset. * add required base solvers * up reproducability n_samples to 1000 * remove marker from plots. Provide correct path for saved figure * update datasets to use new keys * use default dataset for DummySolver commit 7f5f2a61d530485892a1e9a006e82006a3c3a7a3 Author: Giulio Starace <giulio.starace@gmail.com> Date: Thu Mar 7 13:36:28 2024 +0100 Coherence reproducibility (#261) * repro script for already_said_that * make it exectuable * solvers on outer loop * use aliases rather than specific snapshot * autoformat * missing gpt-4-base solver * temporarily suspend cot solvers * switch to 10 threads * run_experiments script for track the stat * temporarily disable gpt-4-base * .log; temporarily skip gpt-4-base * specify seed * additional helper method for specs * distractorless baseline * formatting * script for making track_the_stat plots * already_said_that plotting * increase thread count to 100 * enable gpt-4-base * explicit gpt-4-base was missing * get gpt-4-base too! * 50 threads; 100 causes hanging issues * indefinite (or long) hanging iwth more than 10 threads. sorry do gpt-4 first * use first letter rather than find-letter * clarify task even more * switch to first letter * all seeds of one solver before moving to the next * first letters (plural) * first letters plural * fix label; make results_dict init more general * use a global var for MODELS; correct first-letters to plural * fix legends * rename ideal to max * divide up plots and json * update cot solver names; only run gpt-4-turbo-preview cot by default * prepare plotting for gpt-4-base and cot 4 turbo * n_samples to match what ive been running (250 rather than 500) and reproducibility * make distractorless a bar * implement token counting * token counts in track the stat * handle CoT models * styling * run exps wrapping up * integrate gemini * tts explicit state for gemini and together * allow specifying the role of the explicit state message * integrate gemini in tts run exps * random baseline * Catch broader class of GoogleAPIError * Widen catch for response.text errors * integrate random baseline * remove extra comment * add direct solvers for together models * integrate together models in tts run exps * fix shortsight in comment * Add handler for response.parts as well * move gemini to bottom * handle new models * integrate random baseline * move main to the bottom * all stats in json stats * handle single seeds; third party models * plot random and human baseline * adjust plot size * switch to 0.05 * snapshot name * move main to bottom * dont hardcode num repeats * gemini pro 1.0 * better labels * turns, not steps * annotate n_samples as Optional * contract array definition/declaration * add diagnostic echo statements in loop * add explanatory comment * no longer need stat to legend loc * dont hardcode the length of the samples to 500 --------- Co-authored-by: Chan Jun Shern <chanjunshern@gmail.com> commit 1b85e50f1bfdc76ce8258054700c3d17bba473ba Author: Giulio Starace <giulio.starace@gmail.com> Date: Thu Mar 7 10:40:20 2024 +0100 Id vars - support for mixtral/gemini/llama in plots (#304) * autoformat * integrate third party solvers * clarify that third party is corrset only * tell users why were skipping commit 0acb25cc4a79b5a22e4e674030a1cee6dd8c7ecf Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Thu Mar 7 09:20:45 2024 +0000 Mode Collapse - Automatically Create Dataset (#294) * add model_version property to solvers * solvers that aren't nested now have str model_version * rename model -> solver. Initialise solver before calling get_solver_predictions * optionally create dataset for evaluated solver if doesn't exist * if solver isn't nested solver, clone to force temp=0 * special_move_creation now doesn't compute solver predictions * update datasets * update args for reproducability script * move model prediction logic outside of special_move_creation.py * don't attempt to remake solver with temp=0 * update default eval args to match standard setup * use dummy recorder for dataset creation * add special moves dataset to git lfs * move funcs in mode_collapse/scripts/utils.py -> mode_collapse/utils.py * add warning when generating dataset * up num. samples to 1000 to match default setup commit 751004bf500cca60ee581f1e7fee98579abe4289 Author: Andrei Alexandru <inwaves@users.noreply.github.com> Date: Thu Mar 7 09:09:46 2024 +0000 20 questions: add readme, final update to plotting code (#306) * WIP plotting and repro code * Add dataset generation script and generate datasets * Remove plurals from lexicon dataset * Update repro & plotting code, add explicit max_questions arg to specs * Uncomment other solvers * Remove sequential run, uncomment standard * Remove dataset creation script, other datasets, update YAML and repro * Add final dataset, accept lowercase guesses for proper nouns * Fix word difficulty constant in plotting code * Include variant as plot title * Less verbose plot titles, update prompt, temperature value * Add baseline, labels * Add readme, final updates to plotting code * uppercase 'they' in prompt * Swap names in contrb statement commit 2ca19c356ec79a9f24cf0ed37c041e97512bbc83 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Thu Mar 7 08:56:42 2024 +0000 Add model_version to solvers (#288) * add model_version property to solvers * solvers that aren't nested now have str model_version commit 3df5e01870829d641aad22edf7771d6d34bf1377 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Wed Mar 6 13:51:09 2024 +0000 Mode Collapse - Tweaks & Violations (#293) * update task description. Include more information on notation * raise error if notation parser is given incorrect input * construct controller for all solvers * measure and log violations * log std of previous move length * remove random solver * tidy incorrect notation handling * split variable init * tidy violation metric calculations * better naming of specific violation * move get_binary_avg to mode_collapse/utils.py, add typing commit cff5aba866cc85fd1e541de2afcd1be2594e05b7 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Wed Mar 6 03:11:04 2024 +0000 AssistantsSolver for SkillAcq (#291) * Fix copy bug for subclasses * Add retries with backoff * Add task_description to record_sampling * Add SkillAcqAssistantsSolver along with yaml entries * Place files under current_state['files'] * Include files on every message in thread instead of only one; remove redundant self.all_uploaded_files variable * Expand SkillAcqAssistantsSolver to SkillAcquisitionAssistantsSolver commit 037b1765e8c795d332261b022964e0834d0f10a5 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Wed Mar 6 03:10:28 2024 +0000 Log token usage in final report (#298) * Log token usage in final report * Print comma-delimited numbers commit 827f906736be5f2f55f43a2f0a8320e7f6ec4f4a Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Tue Mar 5 03:03:27 2024 +0000 Jun/gemini bugfixes (#302) * Catch broader class of GoogleAPIError * Widen catch for response.text errors * Add handler for response.parts as well commit 038e43530f7dbf1c154a48952d3af6198f0144b1 Author: Giulio Starace <giulio.starace@gmail.com> Date: Mon Mar 4 12:06:01 2024 +0100 Coherence - solver config for third party models (gemini/together) (#301) * tts explicit state for gemini and together * allow specifying the role of the explicit state message commit 60be685819f24e44bab014fde330ecefc6c492b9 Author: Giulio Starace <giulio.starace@gmail.com> Date: Fri Mar 1 17:39:41 2024 +0100 Coherence - Random baseline solvers (#266) * random baseline solver for track the stat * random baseline solver for already said that * add registry=None to inits so that they dont crash see #267 * more consistent rounding commit 9b6ee76d463683cd8974f34e4f03b6f866567d2e Author: Ian McKenzie <140545726+ianmckenzie-oai@users.noreply.github.com> Date: Fri Mar 1 01:26:43 2024 -0800 [Solvers] Add 'stop' sequences to base model calls (#295) * use msg seps as stop seqs for base model api * log warning if too many stop sequences are used commit aa7431deb3b493d72626699c3fda5ae30f2e23d0 Author: Ian McKenzie <140545726+ianmckenzie-oai@users.noreply.github.com> Date: Thu Feb 29 14:43:04 2024 -0800 [Solvers] Add open-source models from together.ai (#284) * initial version of together.ai solver * modify messages in TogetherSolver to fit format * add mixtral and 70b * add scripts for running on os models * add script for running evals over weekend * add sleep between evals * fix for loop in bugged_tools script * use 10 threads * make skill acq scripts executable * handle together.ai context length errors * add cot and custom os solvers * switch to using cot solvers * add run_os_experiments for error recovery * change run_os_experiments to executable * fix custom solvers to use correct model names * use n_repeat=1 to match gemini * remove print statement * fix bugged_tools log naming * fix skill_acquisition log naming * os plotting changes for fdeduct and ivariables * add os/gemini models to error recovery plots * revert os plotting/experiment scripts * add default api base url * remove unused completion_fn config for together * add optional message merging and tests commit 0e8b05d18b8bd737a39c48107c606b8c0dee91fb Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Thu Feb 29 19:56:49 2024 +0000 [Solvers] Add GeminiSolver (#263) * Add initial working prototype for GeminiSolver * Add basic tests, add solvers/gemini.yaml * Add google-generativeai to pyproject.toml * Relax safety settings; handle API errors gracefully * Remove redundancy in model name * Revert to resp.text instead of longer version * Make test case less ambiguous * Update docstrings * Add postprocessors and CoT solver * Log messages in non-google format for logviz compatibility * Explicitly require EVALS_SEQUENTIAL while we haven't figured out threading * Catch known error about while we haven't figured it out * Drop postprocs that were causing issues with CoT * Fix failing test * Register gemini for fdeduct and skillacq eval-specific solvers * point to github issue for gemini threading todo --------- Co-authored-by: Ian <ian.mckenzie@c-openai.com> commit 0912197f18b0d4a27a2fb1d1764f5e52e2cfbccc Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Thu Feb 29 18:17:04 2024 +0000 [Skill Acquisition] Jun + Ian bug fixes (#289) * Fix bug in get_average_bleu_score * Compute acc for non-translation q's only * Log more metrics * Use full paths for files in current_state * Fix wrong key name * Log number of translation and non-translation samples in final report * fix typo in 'wrong_section' prompt * stop removing quotes from model outputs --------- Co-authored-by: Ian McKenzie <ian.mckenzie@c-openai.com> commit 19aa0f48759fb6881486b8670dcdd2edb445f740 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Mon Feb 26 15:19:50 2024 +0000 Add handler for oai max messages exceeded (#290) commit 5706f315c94748e41e48ea02f71ce3da3b5657a7 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Fri Feb 23 09:42:49 2024 +0000 Mode Collapse - reproducibility (#276) * select dataset depending on evaluated model * add random solver * record avg. number of previous moves in dataset * reproducability code for running and plotting main experiment * dataset creation follows models defined in eval.py * tqdm has correct target length * fix plotting bug, check if value is True correctly * sample dataset consistently between runs * remove preprocessors * add strip to solver ouputs since postprocessors no longer used * only create controller if necessary * pass current state as None if no legal moves are passed * rename plot_experiments.py -> make_plots.py * add TODOs for remaining work * use snapshot for gpt-4-turbo commit f6f6a765477eda6391083cd92fa16a0f42797000 Author: Andrei Alexandru <inwaves@users.noreply.github.com> Date: Fri Feb 23 10:01:00 2024 +0200 20 questions: datasets + repro/plotting code (#258) * WIP plotting and repro code * Add dataset generation script and generate datasets * Remove plurals from lexicon dataset * Update repro & plotting code, add explicit max_questions arg to specs * Uncomment other solvers * Remove sequential run, uncomment standard * Remove dataset creation script, other datasets, update YAML and repro * Add final dataset, accept lowercase guesses for proper nouns * Fix word difficulty constant in plotting code * Include variant as plot title commit c6f24dca6788185af403834ee0b1a86008518064 Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Fri Feb 23 07:50:07 2024 +0000 String explanations and frozenlake variant for InContext RL (#283) * refactor main loop to use a single for loop * Change logging severity level * Switch out keys for an explanations string * Add FrozenLake variant with custom map * Update evals/elsuite/incontext_rl/defaults.py * Update use_explanations variable name commit 6f83b69be681b3fd7a79fd8ac7c8caf9a325c2b5 Author: Ian McKenzie <140545726+ianmckenzie-oai@users.noreply.github.com> Date: Thu Feb 22 21:27:40 2024 -0800 [Error recovery] Final eval + plot changes (#269) * partial progress on plots * progress on plots (still more to go) * small tweaks after putting plots in report * put step plots on same fig * make small changes to plots * messy own vs other plotting * change to have models next to each other, and refactor cli * add option to have reasoning in user message * add option to have answer prompt be user or system * parse mark_as_own_reasoning better * fix up run_experiments.sh * clean up own_reasoning vs other_reasoning * add gpt-4-base solver that continues assistant message * Remove unused function --------- Co-authored-by: Chan Jun Shern <JunShern@users.noreply.github.com> commit 82fb4401b2b88acda979dcc9df715777274dd426 Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Thu Feb 22 17:29:54 2024 +0000 Clean CurrentState for InContext RL (#282) * refactor main loop to use a single for loop * Change logging severity level * Clean CurrentState to have fewer properties * Fix random solver commit 1d0fcf608729b5ea4830faae72ca2f2698ab952f Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Thu Feb 22 15:55:26 2024 +0000 Refactor main loop of in-context RL to use a single for loop (#281) * refactor main loop to use a single for loop * Change logging severity level commit 2961e233dac2b5908a181c23d899413b0d53daeb Author: Giulio Starace <giulio.starace@gmail.com> Date: Thu Feb 22 14:55:28 2024 +0100 Coherence - missing solvers (#287) * missing gpt-4-base solver * CoT solvers for ast commit c65dac28ce91403ff4810623c00ea7540ad97927 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Thu Feb 22 08:25:46 2024 +0000 Rolling back postprocessors :( (#286) * rollback postproc * Make private interaction mismatch error msg more informative commit 819657351194b01048fd50ee14e20415015cdd6c Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Thu Feb 22 07:55:50 2024 +0000 Add anti-cot solver to In-Context RL (#279) commit 44dd342c3ab5bea59be36cec41e21cc818281f04 Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Wed Feb 21 14:40:44 2024 +0000 Correct token counts for SkillAcq README (#277) commit d9a8da7a9042adef3336c47b1720021f6689bf78 Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Wed Feb 21 14:40:29 2024 +0000 Correct token counts for IV README (#277) commit 3b741a791ec2724eca7405d539ab3c5fb9e006d8 Merge: 7dcd3a64 23ec5f9d Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Wed Feb 21 11:03:33 2024 +0000 Merge pull request #248 from openai/james/icrl First working version of InContext RL commit 7dcd3a6493f0e4f81a71974dde6f5d3487dfc868 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Wed Feb 21 08:03:35 2024 +0000 OpenAISolver generate from prefix (#264) * allow OpenAISolver to continue generating from prefix * return raw_completion_result in OpenAISolver SolverResult * rename start -> prefix. Check msgs length to avoid IndexError * correct typing * simplify prefix, store in fixed_start * add spaces around prefix commit 9c0804cfb9d55b5e47e644df0b0b50951e0148da Author: Giulio Starace <giulio.starace@gmail.com> Date: Wed Feb 21 08:47:21 2024 +0100 Replace `find-letter` with `first-letter` (#274) * use first letter rather than find-letter * first letters (plural) commit bd6f1850582e2c312ae546d3da0082384f4f34c5 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Wed Feb 21 07:45:20 2024 +0000 Bugfix: Postproc for empty/short strings (#275) * Apply handling for edge cases where string is empty or shorter than required * Add tests * Simplify RemoveQuotes using string.strip args Co-authored-by: Giulio Starace <giulio.starace@gmail.com> * Simplify RemovePeriod using string.strip args Co-authored-by: Giulio Starace <giulio.starace@gmail.com> * Revert RemoveQuotes to only remove matching pairs; extend tests --------- Co-authored-by: Giulio Starace <giulio.starace@gmail.com> commit 0d05bdb3472a982050049d3e2361308a65ae3a93 Author: Giulio Starace <giulio.starace@gmail.com> Date: Wed Feb 21 04:57:33 2024 +0100 HumanCLI solver convenience wrapper for Coherence evals (#262) * include task name in current state * allow for custom input prompt (default behaviour is unchanged) * human cli solver for track the stat * humancli wrapper for already said that * input_prompt is always a string, we provide a default * no need for `is None` check now Co-authored-by: Chan Jun Shern <JunShern@users.noreply.github.com> * autoformat with black * my solvers no longer need to override input prompt * need to provide at least one SolverSpec arg for it to be recognised as one --------- Co-authored-by: Chan Jun Shern <JunShern@users.noreply.github.com> commit 2f29882a14d61cca682773e0ac5adeef805c30ff Author: Giulio Starace <giulio.starace@gmail.com> Date: Wed Feb 21 04:28:04 2024 +0100 Address evals hanging on certain samples (#260) * dont do nested multithreading * more aggressive timeout * switch back to 40 seconds commit 53085bf9c6ce76489d824556d982d885c1a78d93 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Wed Feb 21 03:10:37 2024 +0000 add guards for solvers that subclass Solver in diff ways (#270) commit 23ec5f9dc1883088a3a5153b7b8660cfd46119f1 Author: james-aung <james.aung@c-openai.com> Date: Tue Feb 20 15:16:30 2024 +0000 create variant which only runs on built in gymnasium environments commit 78975fb5428855076fd3df19b09381ebe09fe6c1 Author: james-aung <james.aung@c-openai.com> Date: Tue Feb 20 15:11:13 2024 +0000 Add gymnasium as a dependency commit fcab6456b5b6ca7aa91083d016bebb0e6d7c77ed Author: james-aung <james.aung@c-openai.com> Date: Tue Feb 20 15:08:38 2024 +0000 Add Q-table initialization in QlearningSolver commit f263ba676c147d7d7aaa73e81245849c1cbaae2d Author: james-aung <james.aung@c-openai.com> Date: Tue Feb 20 14:54:55 2024 +0000 remove action and observation space counts from sample commit e850311a692e725035331f5511d88ac596c9f35c Author: Giulio Starace <giulio.starace@gmail.com> Date: Tue Feb 20 15:17:01 2024 +0100 clarify task even more (#273) commit c0416c152a93992e2f318b5b38a4fc56334dbb79 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Tue Feb 20 12:12:28 2024 +0000 Mode Collapse - Main Experiment (#265) * skeleton eval * add solver prediction evaluation * add measuring probability of predicting moves * remove probability calculation * add eval yaml, add n_samples param * add dataset creation script * add gpt-3.5-turbo dataset * add other model datasets * fix failing test, update eval call * add documentation for variant rules * improve task desc, define rules of all pieces * simplify message construction Co-authored-by: Giulio Starace <giulio.starace@gmail.com> * rename prop -> proportion * strip now handled by postprocessors * replace dataset creation notebook with script * update datasets * remove unused SolverEval arg * pass jsonl_dir rather than relying on global args * get_model_predictions returns rather than dumping * improve documentation, make clear which rules of chess apply * default args as False * fix typo * changed large list to set * rename n_samples n_special_moves * fix previous move filtering logic --------- Co-authored-by: Giulio Starace <giulio.starace@gmail.com> commit 2814496a01da52c1c365741134d58e15fc2c8b24 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Tue Feb 20 09:20:37 2024 +0000 Update gpt-3.5-turbo-0125 to just gpt-3.5-turbo (#271) commit 45aeb778dc5ec9637d8160e971bfe182b4a16fcb Author: Giulio Starace <giulio.starace@gmail.com> Date: Mon Feb 19 17:03:34 2024 +0100 Coherence - Task description adjustments (#268) * remove notion of distractors from track_the_stat * take a word from the distractor question to use in the example transcript commit 27bd87811d0b46e07f9822fbd66dd14b0c5d31a0 Author: Giulio Starace <giulio.starace@gmail.com> Date: Mon Feb 19 15:04:42 2024 +0100 Already said that - mvp (#259) * implement build message * implement build_base_task_message * implement build_distractor_question_message * implement parse_solver_output * minor fixes * fix old implementations; distractor also evaluated now * mark violations as mistakes * no longer need rng * fill out yaml * actually use slef.task_description 🤦 * clarify instructions * adjust weighting * fix parse solver output * clearer docstring * modularize a bit * track fp and fn rate * add adversarial flag * make it extra clear when a sample is for the main task * replace rectangles and next-val-series with find-letter and which-is-heavier * balance how often we show new vs old words * .strip() handled by postprocessor; .lower() handled if match * switch to dict * use oneliner commit 2e5c35437484629327d8bd037c1fceecd2fa52ab Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Mon Feb 19 07:49:42 2024 +0000 Update evals/registry/solvers/incontext_rl.yaml Co-authored-by: Chan Jun Shern <JunShern@users.noreply.github.com> commit 7d71c63b5b44673488514e6ae88de81d3b70e21a Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Mon Feb 19 07:15:38 2024 +0000 Add postprocessors for Solver outputs (#245) * basic postprocessors working * Add postprocessors to subclass constructors * add to defaults * Add test for combination of postprocs * Add docstrings * Change import from just classname to full_path:classname * Log postprocessor events * Add README * Fix missing recorder in tests commit 5259ff8bb65c1668cdb8a7fba80d1e157a73e0e7 Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 16:59:09 2024 +0000 cleanup commit edffa345885775aa792dc1d5b707e7d941c550cb Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 16:31:27 2024 +0000 update qlearning baseline workbook commit e4a77cda0db30b8d351082e1ed8702207848295a Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 15:51:52 2024 +0000 attempt at qlearning solver commit 26156168a2fcc64d9058ffe18c36edd8f67cbf05 Author: Giulio Starace <giulio.starace@gmail.com> Date: Fri Feb 16 15:33:36 2024 +0100 Already said that (coherence) - Dataset (#254) * move data generation to scripts/ subdir * implement bulk of high-level of data generation * only missing distractor corpora * the distractors dont live in the dataset they already exist as their own separate datasets * move distractor stuff to its own module * num words as an arg * fix hanging * script and words dataset * make it executable * distractorsample dataclass * skeleton/scaffolding * missing samples_jsonl from yaml * wrong import * leverage evals.registry and evals.data * make it more robust * implement rectangles * implement next-val-series * implement ambig sentences * implement reverse word sort * make ideal just a string, not a set of strings * show distractor-specific example * add note about ifmain * remove TODO * assume running from repo root * add distractors tests commit ecd011dff7d58cbd235fa7cd8c56707703fbbf5d Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 14:17:20 2024 +0000 implement random solver commit 58ff0b1bf49d1cea9b4c4c1db2101a3493fdee08 Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 13:40:28 2024 +0000 update yaml commit 68377f4f259bfd1049d8ec78f1cc2b771b36ecd8 Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 13:39:57 2024 +0000 keep CurrentState up to date commit efd57bd44306c6605060bb14e98400b5e238235e Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 13:17:46 2024 +0000 update samples commit 5d7fd725ada83c563a66f7db7d588e83293adb51 Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 13:17:31 2024 +0000 major refacto commit e20bb6e6bb4f7d69a34e8b1e7e38b70f5bc4e06a Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 13:17:25 2024 +0000 add more linebreaks to prompts commit a10afc384b3a6a844481a8d6ac69e2502a6c4c9a Merge: 6ef1e812 96a6bf34 Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 11:41:38 2024 +0000 Merge branch 'main' of github.com:openai/dangerous-capability-evaluations into james/icrl merge changes from main commit 96a6bf3401baa78fdd676bad9dc1ee9648cf123d Merge: e807f2dc d6e3e915 Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Fri Feb 16 11:41:32 2024 +0000 Merge pull request #255 from openai/feature/20Questions Small improvements to human CLI usage commit 6ef1e8122f7d190da3f6520c2d7ee70238ac4820 Merge: 12b6b191 e807f2dc Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 11:40:19 2024 +0000 Merge branch 'main' of github.com:openai/dangerous-capability-evaluations into james/icrl merge main commit e807f2dc4e85d3c8ea9c0207028ccb03caa06e25 Author: Giulio Starace <giulio.starace@gmail.com> Date: Fri Feb 16 12:05:34 2024 +0100 simplify median state (#257) commit 12b6b191985ec82512c2f0293a411124b266148e Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 10:59:53 2024 +0000 Refactor environment setup and task description generation commit 96442b5bc254f2d3e4f9ae463d99f0de601511cb Author: james-aung <james.aung@c-openai.com> Date: Fri Feb 16 10:43:55 2024 +0000 add baselines notebook commit 471fc9be934eb12da52e76a358cfb0603d801027 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Fri Feb 16 09:32:19 2024 +0000 Add gpt-3.5-turbo-instruct, remove trailing whitespace from completion prompts (#256) * support gpt-3.5-turbo-instruct completion model * remove trailing whitespace from end of prompt * correct spelling of completion model * use chat_to_prefixes for rendering completion text, but strip whitespace * add documentation about removing trailing whitespace commit d6e3e915fcb9cb341538737f01d98103c30b050c Author: Andrei Alexandru <inwaves@live.com> Date: Fri Feb 16 09:47:18 2024 +0200 Small improvements to human CLI usage commit c8dc96a9ddf13a11e5fa890896032ee446c8f868 Author: Andrei Alexandru <inwaves@users.noreply.github.com> Date: Fri Feb 16 09:00:31 2024 +0200 20 questions: add shortlist variant (#253) * Add shortlist variant * Add a shortlist variant for the 'full' spec in the YAML file * Address feedback for shortlist variant * Replace how we extract the guess with regular expression * Address feedback for regex commit ca4773f0e624f66a0a8c15202b4b2328aea7d3a5 Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Thu Feb 15 17:11:09 2024 +0000 Suppress excessive 'HTTP/1.1 200 OK' logs from openai library (#252) commit b9014652eec22e2ef2e5d4b540b10c175d3c3d31 Author: Giulio Starace <giulio.starace@gmail.com> Date: Thu Feb 15 14:13:54 2024 +0100 Coherence - Already said that distractors eval - Skeleton (#251) * scaffolding * further scaffoldign * [wip] more skeleton * cleaner skeleton * more skeleton, add utils * fix name * implement agg metrics * set constant for num turns * add docstrings * add mocks * fix flow to enable tracking perf on distractor task * rename violation to violation_occurred * move convo loop to its own func * make elif more readable * remove redudnant empty messages * address constants commit b4875c129c61b35aea0d1cf7c3d1cac262c4654d Author: Andrei Alexandru <inwaves@users.noreply.github.com> Date: Thu Feb 15 10:11:25 2024 +0200 20 questions: adding features I (#250) * Add features: logit bias, 'skip' option, handling incorrect guesses and rule violations * Add counter for gamemaster 'skip' replies * Remove finished TODOs commit b97ec783ec75483148f63c5e5c109fba7e9430c9 Author: james-aung <james.aung@c-openai.com> Date: Wed Feb 14 15:37:35 2024 +0000 calculate rolling reward commit 8797064be11d6fda3ab37dc529e81fb409321729 Author: james-aung <james.aung@c-openai.com> Date: Wed Feb 14 14:12:12 2024 +0000 track which steps episodes end commit 1b7661bea9c41902b5575a4ce05c8a13b338b535 Author: james-aung <james.aung@c-openai.com> Date: Wed Feb 14 14:04:44 2024 +0000 reintroduce backup max_steps commit 2466a1df070f7bbec90c26624692344bd6ea699c Author: james-aung <james.aung@c-openai.com> Date: Wed Feb 14 13:58:14 2024 +0000 sample now runs until context limit is reached commit 557ddddb33914a6e5e778d684b9a76f5a014c8ea Author: james-aung <james.aung@c-openai.com> Date: Wed Feb 14 12:50:38 2024 +0000 remove n_trials concept commit 5ee3fff6f1f04b535058da26de342401af1c6726 Author: james-aung <james.aung@c-openai.com> Date: Wed Feb 14 12:05:53 2024 +0000 adjust task description based on feedback commit a236da4c8bd8924343b1818d261951349c02b365 Author: james-aung <james.aung@c-openai.com> Date: Tue Feb 13 17:30:12 2024 +0000 tidy up commit 9a5a6f27a49e608df693c0631c0788d6744e2966 Author: James Aung <129281094+james-aung@users.noreply.github.com> Date: Tue Feb 13 16:59:42 2024 +0000 Update evals/elsuite/incontext_rl/requirements.txt commit 2f2609043b5293ffdc6fe8e47be33c2c71b4ae97 Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk> Date: Tue Feb 13 12:34:08 2024 +0000 Mode Collapse - Chess Framework (#247) * chess game skeleton logic * implemented algebraic notation parser * logic for moving pieces * prevent moves that threaten the king * add support for castling moves * add support for en passant moves * add support for pawn double moves * fix pawn moves, don't allow them to capture forwards * add support for promotion moves * add method to find piece on board * compute normal moves before promotions, otherwise no promotions are possible * add testing for our framework vs. python-chess * verbose option for test * add missing typing * skip board testing to stop CI failing * prevent pawns promoting to kings * simplify notation parsing. Assume full start and end position is given * rename notation function to intended name * rename board function to more descriptive name * rename possible_moves -> possible_transformations, since wasn't using Move object * move import within test to prevent CI failing * Piece object now calculates its possible moves * add file containing move variants * rename pieces -> piece_id_to_instance * add running test upon executing file. rename testing vars * notation parser no longer requires player_id * rename notation parser functions to reflect that move object is being processed * removed unused initial_board and _find_pieces func * add documentation for _update_board * remove unnecessary validation of move * split some get_piece_moves logic into separate function * rename functions to indicate if transformation or move is returned * refactor to avoid circular imports. now Board isn't passed to notation parser or Piece * add documentation to board classes * nit: update typing commit 36c7faac0ab15e53e523f14d801fef1306f7ce99 Author: Giulio Starace <giulio.starace@gmail.com> Date: Tue Feb 13 11:27:25 2024 +0100 Coherence - Explicit State baseline solver (#249) * implement eval side * fix serialization issue * make it a dict explicitly * implement explicitstate solver * idk how this crept back in * rename to track the stat * fix imports and ids * move helper functions to utils.py * keep logger, not logging commit b10f23c436b64562068c3c880dbd3494ece7da95 Author: Giulio Starace <giulio.starace@gmail.com> Date: Tue Feb 13 10:53:35 2024 +0100 Coherence - Implicit State Tracking (#246) * implement turns, add task desc * append messages * implement task_fn and parse_solver_output * singleturn was missing Message construction * fill in prompts * remove fulfilled todos * 500 samples * avoid info logs from httpx * integrate singleturn task desc * use stateless solver to avoid state leak in singleturn * make a submodule for the prompts * further modularize prompts * remove extra space in warning * fix lsit rendering in singleturn example * more metrics * switch to aritmean median * strip notion of singleturn/multiturn dichotomy * fix serialization issue * rename this to implicit state tracking * get existing logger and use that instead of logging * add note to todo * rename to track_the_stat * finish renaming commit 066a6bb011e50041e93b72ecf82cb9d6446e691a Author: Andrei Alexandru <inwaves@users.noreply.github.com> Date: Mon Feb 12 10:47:34 2024 +0200 Initial PR for 20 questions (#240) * Twenty questions MVP * Refactor to conversation loop * Fix tests to use Message class * Some questions have clarifications after the ?, like (eg.) * Log word complexity metric * Tweak guesser prompt * Add lexicon from Maddela and Xu, 2018 * Add dev5, remove completed TODOs * Add score hint in guesser prompt * Add flexible max score to prompt * Consistent naming for max_questions * Move from accuracy to score * Add scoring, TODOs, fix dev5 num_samples * Remove test file, it's no longer needed * Shuffle before we sample * Move from specific gpt-4 version to -turbo alias * Add fall-back condition commit f525f70c3f55533c9a328feef3fa69e59c05a03f Author: Chan Jun Shern <JunShern@users.noreply.github.com> Date: Mon Feb 12 08:41:13 2024 +0000 Add "System: " prefix when rendering system messages for base model (#244) * Modify base model msg rendering; OAISolver uses 'System:' prefix * Fix failing tests commit d99e6c46e512ab306b800050692a041f4179eb0c Author: james-aung <james.aung@c-openai.com> Date: Sun Feb 11 20:19:35 2024 +0000 change to 50 steps commit c71cb3a6b096a15612cdc54b6e5254c73d1fcdd0 Author: james-aung <james.aung@c-openai.com> Date: Sun Feb 11 19:22:26 2024 +0000 adjust variants commit c82ee84b33f4ac821275d00c169674075eca4642 Author: james-aung <james.aung@c-openai.com> Date: Sun Feb 11 19:21:49 2024 +0000 Cast cumulative_reward to float in InContextRl eval commit 9dd1ea31720e973f73ff9f3ae1498b835cda26a7 Author: james-aung <james.aung@c-openai.com> Date: Sun Feb 11 18:36:44 2024 +0000 fixed dataset by removing duplicate sample commit e17e8e87c4d93e8268d526708dee6a137a04720e Author: james-aung <james.aung@c-openai.com> Date: Sun Feb 11 18:19:03 2024 +0000 split out a explanations variant commit a0e2278fc1988b9eae4d483a9b4bf1b5fbe90bf9 Author: james-aung <james.aung@c-openai.com> Date: Sun Feb 11 18:18:37 2024 +0000 add cliffwalking and 10armed bandit envs commit d5725e76a69cd141e684619212dda0b8444d3193 Author: james-aung <james.aung@c-openai.com> Date: Sun Feb 11 18:18:10 2024 +0000 pass through explanined observation to step messages commit 20268cbc85977238aeeda3a24a87a80af4a54fd9 Author: james-aung <james.aung@c-openai.com> Date: Sun Feb 11 18:17:43 2024 +0000 update prompts commit 8a399ea54121b6f592aae5344a7d1e6347c09992 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 19:33:45 2024 +0000 Update to 60 steps commit f99a6130366fda51dcb45f3e1703666e1a688dc6 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 19:33:32 2024 +0000 Suppress annoying logging commit 13c05dc5b4956e10d26c3947510974d39ed1aaf5 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 19:11:12 2024 +0000 cleanup commit ea7d93f875e154c64ed0b12f800eb7a1bba6cea5 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 15:12:36 2024 +0000 add suport for multiple trials per env commit 8dd4ebe98b0795a611f95a6501d799fa6c414bd8 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 15:12:18 2024 +0000 toggle explanations on commit 300902e1233318deda0b467bb979ab2d98fa4ce3 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 15:12:06 2024 +0000 update samples commit e1a5499394394fb6a62ab4c4759f5ee024598adb Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 14:05:10 2024 +0000 Update incontext_rl.yaml with new n_steps value commit 1a773de8876be16e5f9715f39adb1eaa1563b5c1 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 14:05:04 2024 +0000 add exlanations to dataset commit d9dd017a75eec5e387624ae00323d10130393325 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 14:04:57 2024 +0000 allow for explanations commit bfd80b7f864d8388e17798c0822c3b68867020f1 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 13:57:08 2024 +0000 ffix spacing commit 7188886ec34ceb57f84e362400019786554eb775 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 13:47:52 2024 +0000 add action parsing commit 678957f66d2253d0740c2bffa111a66a19d45867 Author: james-aung <james.aung@c-openai.com> Date: Sat Feb 10 13:47:42 2024 +0000 remove old notebook commit 21132e552a46466efd5f97ca6aac5a3c5cbccf2a Author…

jwang47 added the Idea for Eval These issues keep track of requests for different kinds of eval PRs label Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate GPT-4 on classical NLP tasks #246

Evaluate GPT-4 on classical NLP tasks #246

LifeIsStrange commented Mar 16, 2023 •

edited

Loading

foundational NLP tasks of interest:

andrew-openai commented Apr 22, 2023

sudarshansivakumar commented Dec 14, 2023

Mukhsin0508 commented Dec 14, 2023 via email

Evaluate GPT-4 on classical NLP tasks #246

Evaluate GPT-4 on classical NLP tasks #246

Comments

LifeIsStrange commented Mar 16, 2023 • edited Loading

Addressing the elephant in the room

foundational NLP tasks of interest:

andrew-openai commented Apr 22, 2023

sudarshansivakumar commented Dec 14, 2023

Mukhsin0508 commented Dec 14, 2023 via email

LifeIsStrange commented Mar 16, 2023 •

edited

Loading