Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we replace https://the-eye.eu/public/AI/pile/val.jsonl.zst #20

Open
luchangli03 opened this issue Jun 27, 2023 · 3 comments
Open

can we replace https://the-eye.eu/public/AI/pile/val.jsonl.zst #20

luchangli03 opened this issue Jun 27, 2023 · 3 comments

Comments

@luchangli03
Copy link

I can't download the file "https://the-eye.eu/public/AI/pile/val.jsonl.zst" in get_calib_dataset, can we use other data to replace it?
Thanks a lot.

@tonylins
Copy link
Contributor

Hi, you can swap in other general-domain corpus and see how it works. AWQ should be quite robust to the distribution, but the final results might be a bit different.

@abhinavkulkarni
Copy link

In fact, the entire Pile dataset is not available on the website. I'm getting 404 error. Not sure temporary or permanent.

Not sure about any mirrors either.

@abhinavkulkarni
Copy link

Pile dataset has been uploaded again, so the AWQ should work.

In case it disappears again, you can replace

dataset = load_dataset("json", data_files="https://the-eye.eu/public/AI/pile/val.jsonl.zst", split="train")

with

dataset = load_dataset("imdb", "plain_text", split="train")

The idea is to pipe a small portion of training data through the LLM to search for optimal AWQ parameters. I'm sure the IMDB reviews are used as training data by most of these big LLMs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants