Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training data format for Magicoder-OSS-Instruct-75K #23

Closed
VoiceBeer opened this issue Dec 27, 2023 · 4 comments
Closed

Training data format for Magicoder-OSS-Instruct-75K #23

VoiceBeer opened this issue Dec 27, 2023 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@VoiceBeer
Copy link

Hi, thx for the work!

I was wondering how you format the OSS75k data for training? Is it in the alpaca format like:

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
{instruction} # problem column of the OSS75k dataset

@@ Response
# solution column of the OSS75k dataset

Thx

@UniverseFly UniverseFly self-assigned this Dec 28, 2023
@UniverseFly UniverseFly added the question Further information is requested label Dec 28, 2023
@UniverseFly
Copy link
Member

Hi, here is the exact format we used when finetuning the model on OSS75K:

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Write a solution to the following coding problem:
{problem}

@@ Response
{solution}

We haven't tested other templates ourselves, but we encourage anyone interested to explore them!

@VoiceBeer
Copy link
Author

Thx! Appreciate it :>

@shatealaboxiaowang
Copy link

Hi, here is the exact format we used when finetuning the model on OSS75K:

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Write a solution to the following coding problem:
{problem}

@@ Response
{solution}

We haven't tested other templates ourselves, but we encourage anyone interested to explore them!

Thx, is there a py script to convert the dataset Magicoder-OSS-Instruct-75K download from huggingface to the above format (instruction-response pairs)?

@shatealaboxiaowang
Copy link

Hi, here is the exact format we used when finetuning the model on OSS75K:

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Write a solution to the following coding problem:
{problem}

@@ Response
{solution}

We haven't tested other templates ourselves, but we encourage anyone interested to explore them!

Thx, is there a py script to convert the dataset Magicoder-OSS-Instruct-75K download from huggingface to the above format (instruction-response pairs)?

@UniverseFly Thx, I have found the py script, yes it is preprocess_data.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants