Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New model: BERT-Squad #190

Merged
merged 27 commits into from
Aug 14, 2019
Merged

New model: BERT-Squad #190

merged 27 commits into from
Aug 14, 2019

Conversation

kundanapillari
Copy link
Contributor

@kundanapillari kundanapillari commented Aug 1, 2019

BERT-Squad

Use-cases

This model answers questions based on the context of the given input paragraph.

Description

BERT (Bidirectional Encoder Representations from Transformers) applies Transformers, a popular attention model, to language modelling. This mechanism has an encoder to read the input text and a decoder that produces a prediction for the task. This model uses the technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. BERT also learns to model relationships between sentences, predicts if the sentences are connected or not.

Model

Model Download Checksum Download (with sample test data) ONNX version Opset version
BERT-Squad 393 MB MD5 788 MB 1.3 8
BERT-Squad 393 MB MD5 788 MB 1.5 10

Dependencies

Inference

We used ONNX Runtime to perform the inference.

Input

The input is a paragraph and questions relating to that paragraph. The model uses WordPiece tokenisation method to split the input text into list of tokens that are available in the vocabulary (30,522 words).
Then converts these tokens into features

  • input_ids: list of numerical ids for the tokenised text
  • input_mask: will be set to 1 for real tokens and 0 for the padding tokens
  • segment_ids: for our case, this will be set to the list of ones
  • label_ids: one-hot encoded labels for the text

    Preprocessing

    Write an inputs.json file that includes the context paragraph and questions.

    %%writefile inputs.json
    {
      "version": "1.4",
      "data": [
        {
          "paragraphs": [
            {
              "context": "In its early years, the new convention center failed to meet attendance and revenue expectations.[12] By 2002, many Silicon Valley businesses were choosing the much larger Moscone Center in San Francisco over the San Jose Convention Center due to the latter's limited space. A ballot measure to finance an expansion via a hotel tax failed to reach the required two-thirds majority to pass. In June 2005, Team San Jose built the South Hall, a $6.77 million, blue and white tent, adding 80,000 square feet (7,400 m2) of exhibit space",
              "qas": [
                {
                  "question": "where is the businesses choosing to go?",
                  "id": "1"
                },
                {
                  "question": "how may votes did the ballot measure need?",
                  "id": "2"
                },
                {
                  "question": "By what year many Silicon Valley businesses were choosing the Moscone Center?",
                  "id": "3"
                }
              ]
            }
          ],
          "title": "Conference Center"
        }
      ]
    }

    Get parameters and convert input examples into features

    # preprocess input 
    predict_file = 'inputs.json'
    
    # Use read_squad_examples method from run_onnx_squad to read the input file
    eval_examples = read_squad_examples(input_file=predict_file)
    
    max_seq_length = 256
    doc_stride = 128
    max_query_length = 64
    batch_size = 1
    n_best_size = 20
    max_answer_length = 30
    
    vocab_file = os.path.join('uncased_L-12_H-768_A-12', 'vocab.txt')
    tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=True)
    
    # Use convert_examples_to_features method from run_onnx_squad to get parameters from the input 
    input_ids, input_mask, segment_ids, extra_data = convert_examples_to_features(eval_examples, tokenizer,
                                                                                  max_seq_length, doc_stride, max_query_length)

    Output

    For each question about the context paragraph, the model predicts a start and an end token from the paragraph that most likely answers the questions.

    Postprocessing

    Write the predictions (answers to the questions) in a file.

    # postprocess results
    output_dir = 'predictions'
    os.makedirs(output_dir, exist_ok=True)
    output_prediction_file = os.path.join(output_dir, "predictions.json")
    output_nbest_file = os.path.join(output_dir, "nbest_predictions.json")
    write_predictions(eval_examples, extra_data, all_results,
                      n_best_size, max_answer_length,
                      True, output_prediction_file, output_nbest_file)

    Dataset (Train and Validation)

    The model is trained with SQuAD v1.1 dataset that contains 100,000+ question-answer pairs on 500+ articles.

    Validation accuracy

    Metric is Exact Matching (EM) of 80.7, computed over SQuAD v1.1 dev data, for this onnx model.

    Training

    Fine-tuned the model using SQuAD-1.1 dataset. Look at BertTutorial.ipynb for more information for converting the model from tensorflow to onnx and for fine-tuning.

    References

    Contributors

    Kundana Pillari

    License

    Apache 2.0


  • text/machine_comprehension/bert-squad/README.md Outdated Show resolved Hide resolved
    text/machine_comprehension/bert-squad/README.md Outdated Show resolved Hide resolved
    text/machine_comprehension/bert-squad/README.md Outdated Show resolved Hide resolved
    text/machine_comprehension/bert-squad/README.md Outdated Show resolved Hide resolved
    text/machine_comprehension/bert-squad/README.md Outdated Show resolved Hide resolved
    text/machine_comprehension/bert-squad/README.md Outdated Show resolved Hide resolved
    text/machine_comprehension/bert-squad/README.md Outdated Show resolved Hide resolved
    @CLAassistant
    Copy link

    CLAassistant commented Aug 12, 2019

    CLA assistant check
    All committers have signed the CLA.

    Copy link

    @WilliamTambellini WilliamTambellini left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    renaming onnx files bert.onnx to bertsquad.onnx

    @GeorgeS2019
    Copy link

    GeorgeS2019 commented Nov 7, 2019

    @kundanapillari ONNX needs more BERT related pre-trained models (e.g. other languages)

    This BERT-Squad works perfectly with Microsoft ML.NET

    @GeorgeS2019
    Copy link

    GeorgeS2019 commented Nov 11, 2019

    @kundanapillari Can someone show how to get this work? PyTorch to ONNX

    For PyTorch, example of exporting ONNX model and inference using ONNX Runtime:
    https://github.com/tianleiwu/pytorch-pretrained-BERT/blob/master/examples/run_classifier.py

    This has been solved! thanks = 18th Nov.

    jcwchen pushed a commit to jcwchen/models that referenced this pull request Apr 26, 2022
    * Added bert-squad model and dependencies
    
    * Fixed README.md table
    
    * Added bert-squad notebook
    
    * Added bert-squad notebook
    
    * Delete BERT-Squad.ipynb
    
    * Added notebook
    
    * Updated README.md
    
    * Fixed file path
    
    * Updated input, output, and preprocessing sections
    
    * Added License information
    
    * Updated README.md
    
    * Updated README.md and added sample file
    
    * Updated README.md
    
    * Updated Readme.md
    
    * Changed Inference
    
    * Updated main README.md
    
    * Updated onnx version and opset
    
    * Made requested changes
    
    * Changed Metric
    
    * Reverted changes
    
    * Reverted changes
    
    * Added files
    
    * Fixed changes
    
    * Changed README.md
    
    * Changed onnx file names
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    6 participants