Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load JSON files, get the errors #3333

Closed
PatricYan opened this issue Nov 28, 2021 · 12 comments
Closed

load JSON files, get the errors #3333

PatricYan opened this issue Nov 28, 2021 · 12 comments

Comments

@PatricYan
Copy link

Hi, does this bug be fixed? when I load JSON files, I get the same errors by the command
!python3 run.py --do_train --task qa --dataset squad-retrain-data/train-v2.0.json --output_dir ./re_trained_model/

change the dateset to load json by refering to https://huggingface.co/docs/datasets/loading.html
dataset = datasets.load_dataset('json', data_files=args.dataset)

Errors:
Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-c1e124ad488911b8/0.0.0/45636811569ec4a6630521c18235dfbbab83b7ab572e3393c5ba68ccabe98264...

Originally posted by @yanllearnn in #730 (comment)

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2021

Hi ! The message you're getting is not an error. It simply says that your JSON dataset is being prepared to a location in /root/.cache/huggingface/datasets

@PatricYan
Copy link
Author

but I want to load local JSON file by command
python3 run.py --do_train --task qa --dataset squad-retrain-data/train-v2.0.json --output_dir ./re_trained_model/

squad-retrain-data/train-v2.0.json is the local JSON file, how to load it and map it to a special structure?

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2021

You can load it with dataset = datasets.load_dataset('json', data_files=args.dataset) as you said.
Then if you need to apply additional processing to map it to a special structure, you can use rename columns or use dataset.map. For more information, you can check the documentation here: https://huggingface.co/docs/datasets/process.html

Also feel free to share your run.py code so we can take a look

@PatricYan
Copy link
Author

PatricYan commented Nov 29, 2021

# Dataset selection
    if args.dataset.endswith('.json') or args.dataset.endswith('.jsonl'):
        dataset_id = None
        # Load from local json/jsonl file
        dataset = datasets.load_dataset('json', data_files=args.dataset)
        # By default, the "json" dataset loader places all examples in the train split,
        # so if we want to use a jsonl file for evaluation we need to get the "train" split
        # from the loaded dataset
        eval_split = 'train'
    else:
        default_datasets = {'qa': ('squad',), 'nli': ('snli',)}
        dataset_id = tuple(args.dataset.split(':')) if args.dataset is not None else \
            default_datasets[args.task]
        # MNLI has two validation splits (one with matched domains and one with mismatched domains). Most datasets just have one "validation" split
        eval_split = 'validation_matched' if dataset_id == ('glue', 'mnli') else 'validation'
        # Load the raw data
        dataset = datasets.load_dataset(*dataset_id)

I want to load JSON squad dataset instead dataset = datasets.load_dataset('squad') to retrain the model.

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2021

If your JSON has the same format as the SQuAD dataset, then you need to pass field="data" to load_dataset, since the SQuAD format is one big JSON object in which the "data" field contains the list of questions and answers.

dataset = datasets.load_dataset('json', data_files=args.dataset, field="data")

Let me know if that helps :)

@PatricYan
Copy link
Author

PatricYan commented Nov 29, 2021

Yes, code works. but the format is not as expected.

dataset = datasets.load_dataset('json', data_files=args.dataset, field="data")
python3 run.py --do_train --task qa --dataset squad --output_dir ./re_trained_model/

************ train_dataset: Dataset({
features: ['id', 'title', 'context', 'question', 'answers'],
num_rows: 87599
})

python3 run.py --do_train --task qa --dataset squad-retrain-data/train-v2.0.json --output_dir ./re_trained_model/

************ train_dataset: Dataset({
features: ['title', 'paragraphs'],
num_rows: 442
})

I want the JSON to have the same format as before features. https://github.com/huggingface/datasets/blob/master/datasets/squad_v2/squad_v2.py is the script dealing with squad but how can I apply it by using JSON?

@lhoestq
Copy link
Member

lhoestq commented Nov 29, 2021

Ok I see, you have the paragraphs so you just need to process them to extract the questions and answers. I think you can process the SQuAD-like data this way:

def process_squad(articles):
    out = {
        "title": [],
        "context": [],
        "question": [],
        "id": [],
        "answers": [],
    }
    for title, paragraphs in zip(articles["title"], articles["paragraphs"]):
        for paragraph in paragraphs:
            for qa in paragraph["qas"]:
                out["title"].append(title)
                out["context"].append(paragraph["context"])
                out["question"].append(qa["question"])
                out["id"].append(qa["id"])
                out["answers"].append({
                    "answer_start": [answer["answer_start"] for answer in qa["answers"]],
                    "text": [answer["text"] for answer in qa["answers"]],
                })
    return out

dataset = dataset.map(process_squad, batched=True, remove_columns=["paragraphs"])

I adapted the code from squad.py. The code takes as input a batch of articles (title + paragraphs) and gets all the questions and answers from the JSON structure.

The output is a dataset with features: ['answers', 'context', 'id', 'question', 'title']

Let me know if that helps !

@PatricYan
Copy link
Author

PatricYan commented Nov 30, 2021

Yes, this works. But how to get the training output during training the squad by Trainer
for example https://github.com/huggingface/transformers/blob/master/examples/pytorch/question-answering/trainer_qa.py
I want the training inputs, labels, outputs for every epoch and step to produce the training dynamic graph

@lhoestq
Copy link
Member

lhoestq commented Nov 30, 2021

I think you may need to implement your own Trainer, from the QuestionAnsweringTrainer for example.
This way you can have the flexibility of saving all the inputs/output used at each step

@PatricYan
Copy link
Author

does there have any function to be overwritten to do this?

@PatricYan
Copy link
Author

does there have any function to be overwritten to do this?

ok, I overwrote the compute_loss, thank you.

@PatricYan
Copy link
Author

Hi, I add one field example_id, but I can't see it in the comput_loss function, how can I do this? below is the information of inputs

*********************** inputs: {'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0'), 'end_positions': tensor([ 25,  97,  93,  44,  25, 112, 109, 134], device='cuda:0'), 'input_ids': tensor([[ 101, 2054, 2390,  ...,    0,    0,    0],
        [ 101, 2054, 2515,  ...,    0,    0,    0],
        [ 101, 2054, 2106,  ...,    0,    0,    0],
        ...,
        [ 101, 2339, 2001,  ...,    0,    0,    0],
        [ 101, 2054, 2515,  ...,    0,    0,    0],
        [ 101, 2054, 2003,  ...,    0,    0,    0]], device='cuda:0'), 'start_positions': tensor([ 20,  90,  89,  41,  25,  96, 106, 132], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0')} 
# This function preprocesses a question answering dataset, tokenizing the question and context text
# and finding the right offsets for the answer spans in the tokenized context (to use as labels).
# Adapted from https://github.com/huggingface/transformers/blob/master/examples/pytorch/question-answering/run_qa.py
def prepare_train_dataset_qa(examples, tokenizer, max_seq_length=None):
    questions = [q.lstrip() for q in examples["question"]]
    max_seq_length = tokenizer.model_max_length
    # tokenize both questions and the corresponding context
    # if the context length is longer than max_length, we split it to several
    # chunks of max_length
    tokenized_examples = tokenizer(
        questions,
        examples["context"],
        truncation="only_second",
        max_length=max_seq_length,
        stride=min(max_seq_length // 2, 128),
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )

    # Since one example might give us several features if it has a long context,
    # we need a map from a feature to its corresponding example.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position
    # in the original context. This will help us compute the start_positions
    # and end_positions to get the final answer string.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    tokenized_examples["example_id"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        # We will label features not containing the answer the index of the CLS token.
        cls_index = input_ids.index(tokenizer.cls_token_id)
        sequence_ids = tokenized_examples.sequence_ids(i)
        # from the feature idx to sample idx
        sample_index = sample_mapping[i]
        # get the answer for a feature
        answers = examples["answers"][sample_index]

        tokenized_examples["example_id"].append(examples["id"][sample_index])

        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and
                    offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and \
                        offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(
                    token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants