Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you release the sharegpt dataset? #90

Closed
LZY-the-boys opened this issue Mar 31, 2023 · 26 comments
Closed

Can you release the sharegpt dataset? #90

LZY-the-boys opened this issue Mar 31, 2023 · 26 comments
Labels
question Further information is requested

Comments

@LZY-the-boys
Copy link

I am wandering can the sharegpt data be released?

@ari9dam
Copy link

ari9dam commented Mar 31, 2023

If data can't be released, can you please share the code for dataset crawling and all the processing you did to get markdown from HTML?

@MarkSchmidty
Copy link

MarkSchmidty commented Mar 31, 2023

Up until 2 days ago shareGPT had an explore page which could be easily scraped. They removed that page to prevent scraping.

image

@Kreijstal
Copy link

"Open-Source"

  • No Weights
  • No Dataset
  • No checkpoints

That's not open source. Not at all, don't claim to be open.

@merrymercy
Copy link
Member

merrymercy commented Mar 31, 2023

Hi @Kreijstal, @LZY-the-boys and @ari9dam

Thanks for your interest! We plan to release the weights once we have addressed all concerns and have a low-resource version of the inference code ready. We released the demo first to get some early feedback on the model.

We have no current plans to release the dataset and will first communicate with the ShareGPT team.

The data cleaning script is this

"""
Usage: python3 -m fastchat.data.clean_sharegpt --in sharegpt_html.json --out sharegpt_clean.json
"""

@timatom
Copy link

timatom commented Mar 31, 2023

@merrymercy,

In terms of the dataset, is avoiding the release out of respect to the ShareGPT team disabling their endpoint? My understanding is it was for security reasons, which I can respect.

If so, do you know of any efforts being made to make public datasets for building foundational models like Vicuna? If not, do you know of any resource that could help others interested in such efforts?

@Jeffwan
Copy link
Contributor

Jeffwan commented Apr 1, 2023

@merrymercy Seems the clean_sharegpt accepts a json file. I don't know whether https://sharegpt.com/ provides json or now. Do you have some process to convert the pure HTML page to json the scripts expect?

@MarkSchmidty
Copy link

MarkSchmidty commented Apr 2, 2023

ShareGPT Dataset:

Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each:
part 1: https://files.catbox.moe/bhtp9i.zip
part 2: https://files.catbox.moe/ahoivx.zip

Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md

(Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.)


The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset

A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

@Kreijstal
Copy link

@MarkSchmidty
you are doing god's work, good job, democratizing AI

@timatom
Copy link

timatom commented Apr 2, 2023

For all you scrapers out there, there's another site that also has ChatGPT conversations that's rather easy:

https://chatlogs.net/

It has around 80k conversations from what I can tell.

@BadisG
Copy link

BadisG commented Apr 2, 2023

ShareGPT Dataset:

Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each: part 1: https://files.catbox.moe/bhtp9i.zip part 2: https://files.catbox.moe/ahoivx.zip

Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md

(Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.)

The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset

A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

Not all heroes wear capes!

@clulece
Copy link

clulece commented Apr 3, 2023

@MarkSchmidty Thank you for providing the higher quality version that has all the senseless/misguided OpanAI moralizing purged.

@DemonFemaleAlpha1
Copy link

hello

@alanxmay
Copy link
Contributor

alanxmay commented Apr 7, 2023

ShareGPT Dataset:

Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each: part 1: https://files.catbox.moe/bhtp9i.zip part 2: https://files.catbox.moe/ahoivx.zip

Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md

(Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.)

The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset

A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

I fine-tuned the 13B model using the dataset from the huggingface link above, but the model's performance was poor - in some cases it failed to correctly output the end symbol.

@BadisG
Copy link

BadisG commented Apr 7, 2023

@alanxmay you fine tuned it with the unfiiltered dataset?

@ethanyanjiali
Copy link
Contributor

"Open-Source"

  • No Weights
  • No Dataset
  • No checkpoints

That's not open source. Not at all, don't claim to be open.

Don't take everything for granted. Given that OpenAI is so closed, I feel really respectful for Meta to release LLaMA and also all the research groups that released follow-up works on LLaMA.

@BadisG
Copy link

BadisG commented Apr 7, 2023

Don't take everything for granted. Given that OpenAI is so closed,

ClosedAI, they really parted way with all the nice principles they once had.

@merrymercy merrymercy added the question Further information is requested label Apr 8, 2023
@zhisbug
Copy link
Collaborator

zhisbug commented Apr 8, 2023

Closing this issue for now.

So far, we have released the

  • weights, 7B and 13B (and checkpoints)
  • training recipes
  • data processing scripts
  • various ways to run the bot on diverse hardware

We're unable to release the data due to various factors out of our control.

We'll keep pushing the limit and get the community better and more open LLMs!

"Open-Source"

  • No Weights
  • No Dataset
  • No checkpoints

That's not open source. Not at all, don't claim to be open.

@zhisbug zhisbug closed this as completed Apr 8, 2023
@alanxmay
Copy link
Contributor

alanxmay commented Apr 10, 2023

@alanxmay you fine tuned it with the unfiiltered dataset?

@BadisG Yes, I am using this one: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_unfiltered_cleaned_split.json

@BadisG
Copy link

BadisG commented Apr 10, 2023

@alanxmay How did it go? Did you manage to make it better?

@eeric
Copy link

eeric commented Apr 15, 2023

@MarkSchmidty
how to generate sg_90k_part1_clear.json?
how to generate sg_90k_part1_clear.json?
in addition, how to crawl data from sharegpt.com?
do you have crawl data script?

@MarkSchmidty
Copy link

MarkSchmidty commented Apr 15, 2023

I didn't generate these. I was sent them by an anonymous source.

It's not possible to crawl sharegpt anymore. Sharegpt used to have a page which you could crawl. But now it does not.

@eeric
Copy link

eeric commented Apr 15, 2023

ok, that's sad news.

@timatom
Copy link

timatom commented Apr 16, 2023

ok, that's sad news.

Theoretically, you could scrape Twitter. Anything someone share's publicly on social media is fair game to scrape, technically.

@abhinavchoudhry
Copy link

ok, that's sad news.

Theoretically, you could scrape Twitter. Anything someone share's publicly on social media is fair game to scrape, technically.

Yeah, but they are charging exorbitant fees for scraping now. Twitter is as good as closed now, at least for ordinary developers and researchers. Academic access RIP.

@timatom
Copy link

timatom commented Jun 5, 2023

ok, that's sad news.

Theoretically, you could scrape Twitter. Anything someone share's publicly on social media is fair game to scrape, technically.

Yeah, but they are charging exorbitant fees for scraping now. Twitter is as good as closed now, at least for ordinary developers and researchers. Academic access RIP.

Ya, it's another case of sad news. Not sure how this is all going to play out long-term. Best of luck to everyone.

@kkkparty
Copy link

@alanxmay you fine tuned it with the unfiiltered dataset?

@BadisG Yes, I am using this one: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_unfiltered_cleaned_split.json

i use this dataset with baichuan 7B model, and command as follows :CUDA_VISIBLE_DEVICES="7" torchrun --nproc_per_node=1 --master_port=20001 fastchat/train/train_baichuan.py --model_name_or_path /workspace/baichuan/model_para/Baichuan-7B --data_path /workspace/baichuan/dataset/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json --bf16 False --output_dir output_baichuan --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 10 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 False --model_max_length 64 --gradient_checkpointing True --lazy_preprocess True ,
but it crushed with dataset problems. is there some procedure about using sharegpt datasert with basichuan 7B weights should i take?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests