Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cross encoder support #1615

Merged
merged 18 commits into from
Dec 7, 2023
Merged

Conversation

HenryL27
Copy link
Collaborator

@HenryL27 HenryL27 commented Nov 10, 2023

Description

Adds support for (huggingface) cross encoders to ml-commons. Uses a new function name (TEXT_SIMILARITY) which takes as input a list of text pairs and spits out 1-dimensional tensors representing the similarity of the items in each pair. E.g.

{
  "query_text": "today is sunny"
  "text_docs": [ 
    "today is sunny",
    "today is july fifth",
    "it is winter"
  ] 
}

yields

{
  "inference_results": [
    {
      "output": [
        {
          "name": "logits",
          "data_type": "FLOAT32",
          "shape": [
            1
          ],
          "data": [
            10.939743
          ],
          "byte_buffer": {
            "array": "MAkvQQ==",
            "order": "LITTLE_ENDIAN"
          }
        }
      ]
    },
    {
      "output": [
        {
          "name": "logits",
          "data_type": "FLOAT32",
          "shape": [
            1
          ],
          "data": [
            -6.067284
          ],
          "byte_buffer": {
            "array": "MSfCwA==",
            "order": "LITTLE_ENDIAN"
          }
        }
      ]
    },
    {
      "output": [
        {
          "name": "logits",
          "data_type": "FLOAT32",
          "shape": [
            1
          ],
          "data": [
            -11.261627
          ],
          "byte_buffer": {
            "array": "oC80wQ==",
            "order": "LITTLE_ENDIAN"
          }
        }
      ]
    }
  ]
}

This was using the model cross-encoder/ms-marco-TinyBERT-L-2-v2 - the config I used to upload it looked like

{
  "name": model_name,
  "version": "1.0.0",
  "description": "Cross Encoder text similarity model",
  "model_format": "TORCH_SCRIPT",
  "function_name": "TEXT_SIMILARITY",
  "model_content_hash_value": hash_value,
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 1,
    "framework_type": "huggingface_transformers",
    "all_config": cfg.to_json_string(),
  }
}

Issues Resolved

Check List

  • [ x] New functionality includes testing.
    • [ x] All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • [ x] Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link

codecov bot commented Nov 10, 2023

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (df644ff) 80.83% compared to head (be113ed) 80.98%.
Report is 7 commits behind head on main.

Files Patch % Lines
...rch/ml/common/input/nlp/TextSimilarityMLInput.java 86.95% 2 Missing and 4 partials ⚠️
...n/java/org/opensearch/ml/common/input/MLInput.java 66.66% 3 Missing and 2 partials ⚠️
...in/java/org/opensearch/ml/common/FunctionName.java 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1615      +/-   ##
============================================
+ Coverage     80.83%   80.98%   +0.15%     
- Complexity     4215     4246      +31     
============================================
  Files           404      408       +4     
  Lines         16977    17122     +145     
  Branches       1818     1835      +17     
============================================
+ Hits          13723    13867     +144     
+ Misses         2539     2534       -5     
- Partials        715      721       +6     
Flag Coverage Δ
ml-commons 80.98% <91.83%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@austintlee austintlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor question, but overall looks great!

austintlee
austintlee previously approved these changes Nov 16, 2023
Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
@dhrubo-os
Copy link
Collaborator

Thanks for working on this. Approved.

Copy link
Collaborator

@austintlee austintlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with one minor question. You can answer and resolve.)

@HenryL27 HenryL27 merged commit 2761d7d into opensearch-project:main Dec 7, 2023
8 of 12 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 7, 2023
* add text similarity inputs and function name

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity cross encoder model

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity unit tests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity input unittests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity dataset unittests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add function name annotation

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* refactor API to use single query

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* omit private from class vars

Co-authored-by: Navneet Verma <vermanavneet003@gmail.com>
Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* change output name from logits to similarity

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* hashify isDLModel

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add error message for non-torchscript cross encoders

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* allow onnx, actually.

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* apply spotless after rebase

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add unittest for new mlinput toXcontent clause

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* static DLModels

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add tests and error message tweaks

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* name test models w framework

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* change pt->torch_script

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

---------

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Navneet Verma <vermanavneet003@gmail.com>
(cherry picked from commit 2761d7d)
dhrubo-os pushed a commit that referenced this pull request Dec 7, 2023
* add text similarity inputs and function name

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity cross encoder model

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity unit tests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity input unittests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity dataset unittests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add function name annotation

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* refactor API to use single query

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* omit private from class vars

Co-authored-by: Navneet Verma <vermanavneet003@gmail.com>
Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* change output name from logits to similarity

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* hashify isDLModel

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add error message for non-torchscript cross encoders

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* allow onnx, actually.

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* apply spotless after rebase

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add unittest for new mlinput toXcontent clause

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* static DLModels

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add tests and error message tweaks

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* name test models w framework

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* change pt->torch_script

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

---------

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Navneet Verma <vermanavneet003@gmail.com>
(cherry picked from commit 2761d7d)

Co-authored-by: HenryL27 <hmlindeman@yahoo.com>
@martin-gaievski
Copy link
Member

@HenryL27 can you please share details of meta config for ms-marco-TinyBERT-L-2-v2 model?
I'm using following request but I'm getting errors, probably some param is missing:

POST /_plugins/_ml/models/meta
{
    "name": "ms-marco-TinyBERT-L-2-v2",
    "version": "1.0.0",
    "function_name": "TEXT_SIMILARITY",
    "description": "test model",
    "model_format": "TORCH_SCRIPT",
    "model_group_id": "<MODEL_GROUP_ID>",
    "model_content_hash_value": "90e39a926101d1a4e542aade0794319404689b12acfd5d7e65c03d91c668b5cf",
    "model_config": {
        "model_type": "bert",
        "embedding_dimension": 1,
        "framework_type": "huggingface_transformers",
        "all_config": "{\"total_chunks\":2,\"is_hidden\":false}"
    },
    "url": "https://github.com/opensearch-project/ml-commons/blob/main/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_similarity/TinyBERT-CE-torch_script.zip?raw=true"
}

error response:

        "type": "illegal_argument_exception",
        "reason": "total chunks field is null"

austintlee pushed a commit to austintlee/ml-commons that referenced this pull request Feb 29, 2024
* add text similarity inputs and function name

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity cross encoder model

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity unit tests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity input unittests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add text similarity dataset unittests

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add function name annotation

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* refactor API to use single query

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* omit private from class vars

Co-authored-by: Navneet Verma <vermanavneet003@gmail.com>
Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* change output name from logits to similarity

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* hashify isDLModel

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add error message for non-torchscript cross encoders

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* allow onnx, actually.

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* apply spotless after rebase

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add unittest for new mlinput toXcontent clause

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* static DLModels

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* add tests and error message tweaks

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* name test models w framework

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

* change pt->torch_script

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>

---------

Signed-off-by: HenryL27 <hmlindeman@yahoo.com>
Co-authored-by: Navneet Verma <vermanavneet003@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants