-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load JSON files, get the errors #3333
Comments
Hi ! The message you're getting is not an error. It simply says that your JSON dataset is being prepared to a location in |
but I want to load local JSON file by command squad-retrain-data/train-v2.0.json is the local JSON file, how to load it and map it to a special structure? |
You can load it with Also feel free to share your |
I want to load JSON squad dataset instead |
If your JSON has the same format as the SQuAD dataset, then you need to pass dataset = datasets.load_dataset('json', data_files=args.dataset, field="data") Let me know if that helps :) |
Yes, code works. but the format is not as expected.
************ train_dataset: Dataset({
************ train_dataset: Dataset({ I want the JSON to have the same format as before features. https://github.com/huggingface/datasets/blob/master/datasets/squad_v2/squad_v2.py is the script dealing with squad but how can I apply it by using JSON? |
Ok I see, you have the paragraphs so you just need to process them to extract the questions and answers. I think you can process the SQuAD-like data this way: def process_squad(articles):
out = {
"title": [],
"context": [],
"question": [],
"id": [],
"answers": [],
}
for title, paragraphs in zip(articles["title"], articles["paragraphs"]):
for paragraph in paragraphs:
for qa in paragraph["qas"]:
out["title"].append(title)
out["context"].append(paragraph["context"])
out["question"].append(qa["question"])
out["id"].append(qa["id"])
out["answers"].append({
"answer_start": [answer["answer_start"] for answer in qa["answers"]],
"text": [answer["text"] for answer in qa["answers"]],
})
return out
dataset = dataset.map(process_squad, batched=True, remove_columns=["paragraphs"]) I adapted the code from squad.py. The code takes as input a batch of articles (title + paragraphs) and gets all the questions and answers from the JSON structure. The output is a dataset with Let me know if that helps ! |
Yes, this works. But how to get the training output during training the squad by Trainer |
I think you may need to implement your own Trainer, from the |
does there have any function to be overwritten to do this? |
ok, I overwrote the compute_loss, thank you. |
Hi, I add one field example_id, but I can't see it in the comput_loss function, how can I do this? below is the information of inputs
|
Hi, does this bug be fixed? when I load JSON files, I get the same errors by the command
!python3 run.py --do_train --task qa --dataset squad-retrain-data/train-v2.0.json --output_dir ./re_trained_model/
change the dateset to load json by refering to https://huggingface.co/docs/datasets/loading.html
dataset = datasets.load_dataset('json', data_files=args.dataset)
Errors:
Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-c1e124ad488911b8/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264...
Originally posted by @yanllearnn in #730 (comment)
The text was updated successfully, but these errors were encountered: