LLM Data Curation Challenge

You will be given chunked C4 data that is annotated along many different axes

Writing Style
Facts and Trivia
Educational Value
Required Expertise
Cluster

Track 1, Ordering - Can you find an ordering of the data such that the final loss is lowest on a held-out evaluation set? Submit a function that orders the data for the best validation perplexity after 1-pass over the data.

Track 2, Filtering - Can you find a subset of the data to train on that's better than training on the entire dataset? Submit a subset of the data such that training on a random ordering of this data for a fixed number of steps (say 20k) gets optimal validation perplexity.

I need to give tons of credit to the awesome work of QuRating for annotating all of this data and Bingbin Liu for setting up a minimal training code infrastructure! This repo is a very lightweight wrapper around this effort.

Setup

Create a conda environment (very vanilla huggingface, torch, etc so you might be able to skip this)

conda env create -f environment.yml

Modify the order or contents of the dataset by playing around in modify_data.py (possibly poking around using inspect_data.py) and then run the following. Keep filtering submissions in filter_data.py and ordering submissions in order_data.py.

python modify_data.py

Train a model determined by the settings in conf/config.yaml

python train.py

Evaluate a model determined by the settings in conf/eval_config.yaml

python eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Data Curation Challenge

Setup

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
conf		conf
.gitignore		.gitignore
README.md		README.md
data.py		data.py
environment.yml		environment.yml
eval.py		eval.py
filter_data.py		filter_data.py
inspect_data.py		inspect_data.py
model.py		model.py
modify_data.py		modify_data.py
order_data.py		order_data.py
train.py		train.py

kothasuhas/lm-data-challenge

Folders and files

Latest commit

History

Repository files navigation

LLM Data Curation Challenge

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages