Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are you going about cleaning? #31

Open
cbsudux opened this issue Mar 29, 2023 · 4 comments
Open

How are you going about cleaning? #31

cbsudux opened this issue Mar 29, 2023 · 4 comments

Comments

@cbsudux
Copy link

cbsudux commented Mar 29, 2023

How are you going about cleaning this?

Manually or with GPT-4.

@HideLord
Copy link
Contributor

I use a regex to search for a specific subset of instructions/inputs/outputs that are highly likely to be wrong. After that, I manually go and verify that something is indeed incorrect and query gpt-3.5-turbo or fix them by hand.

@gururise
Copy link
Owner

gururise commented Mar 29, 2023

How are you going about cleaning this?

Manually or with GPT-4.

Much of it has been hand curated with the help of various tools and regular expressions (see tools directory). There exists a tool that can use GPT3.5/4 to double-check the answers.

@ehartford
Copy link

i bet one could convince the gpt-3.5 api to take in a line, and decide whether it's acceptable or needs fixed, and if it needs fixed to suggest a fix, which could then be quick-reviewed (accept or reject) by a human

@gururise
Copy link
Owner

i bet one could convince the gpt-3.5 api to take in a line, and decide whether it's acceptable or needs fixed, and if it needs fixed to suggest a fix, which could then be quick-reviewed (accept or reject) by a human

We have this capability already built out in one of the tools. I think the first 1000 entries have been checked in this manner.

@gururise gururise mentioned this issue Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants