Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

190K data #4

Open
manojitrtc1in opened this issue Aug 21, 2023 · 2 comments
Open

190K data #4

manojitrtc1in opened this issue Aug 21, 2023 · 2 comments

Comments

@manojitrtc1in
Copy link

Where can I find the codeup_190k.json file ? I want to do the training with this data. Thanks.

@juyongjiang
Copy link
Owner

juyongjiang commented Aug 23, 2023

@manojitrtc1in Sorry for the late reply. I filter the existing data from the Hugging Face to gain codeup_190k.json. You can download the original 200k data from here (https://huggingface.co/datasets/rombodawg/Legacy_MegaCodeTraining200k).

Then, run the following command to obtain higher-quality instruction data, i.e., codeup_190k.json (I don't put it in GitHub due to its large size).

cd data
python preprocess.py

@huliang2016
Copy link

Could you please share the data since rombodawg/Legacy_MegaCodeTraining200k is missing...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants