Skip to content

justindal/leetcode-python-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

leetcode-python-dataset

Code for building and publishing the justindal/leetcode-python-dataset dataset on Hugging Face.

Merges two open-source LeetCode datasets into a unified schema with consistent formatting, field normalisation, and solution validation.

Dataset

Split Rows Source
train 5000 newfacade + greengerong
test 228 newfacade only

Schema

Column Type Description
task_id string Problem slug e.g. two-sum
difficulty string Easy, Medium, or Hard
tags list[string] Topic tags e.g. ["Array", "Hash Table"]
problem string Full problem description
starter_code string Function signature to complete
solution string Accepted Python solution
source string newfacade or greengerong

Sources

Usage

uv:

git clone https://github.com/justindal/leetcode-python-dataset
cd leetcode-python-dataset
uv sync

Run the build:

uv run leetcode-dataset

# or
./.venv/bin/leetcode-dataset

pip:

git clone https://github.com/justindal/leetcode-python-dataset
cd leetcode-python-dataset
python -m venv .venv && source .venv/bin/activate
pip install -e .

Build the dataset locally:

leetcode-dataset

# or
python3 main.py

Citation

newfacade/LeetCodeDataset:

@misc{xia2025leetcodedatasettemporaldatasetrobust,
	title={LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs},
	author={Yunhui Xia and Wei Shen and Yan Wang and Jason Klein Liu and Huifeng Sun and Siyue Wu and Jian Hu and Xiaolong Xu},
	year={2025},
	eprint={2504.14655},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2504.14655},
}

License

Apache 2.0

About

Code for building and publishing the justindal/leetcode-python-dataset dataset on Hugging Face

Topics

Resources

Stars

Watchers

Forks

Contributors