Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add binary format support for Faiss IVF #2

Closed

Conversation

junqiu-lei
Copy link

@junqiu-lei junqiu-lei commented Jul 9, 2024

Because opensearch-project#1784 is merged, I rebased junqiu-lei:binary-ivf against opensearch-project:feature/binary-format

Description

Raising this PR in Heemin's forked k-NN repository to facilitate the review of the code changes from opensearch-project#1784

It contains code change in Java and JNI layer to support:

  • train model with binary format index
  • add data_type field when train model
  • create IVF index with binary format
  • query IVF index with binary format

Will add tests in another PR.

Example workflow

1. Create binary format train index

PUT /train-index HTTP/1.1
Host: localhost:9200
Content-Type: application/json
Content-Length: 481

{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 0,
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "train-field": {
        "type": "knn_vector",
        "dimension": 8,
        "data_type": "binary",
        "method": {
          "name": "hnsw",
          "space_type": "hammingbit",
          "engine": "faiss",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      }
    }
  }
}

2. Ingest tran index

POST /_bulk HTTP/1.1
Host: localhost:9200
Content-Type: application/json
Content-Length: 7067

{ "index": { "_index": "train-index", "_id": "1" } }
{ "train-field": [1]}
{ "index": { "_index": "train-index", "_id": "2" } }
{ "train-field": [2]}
{ "index": { "_index": "train-index", "_id": "3" } }
{ "train-field": [4]}
{ "index": { "_index": "train-index", "_id": "4" } }
........

3. Create train model

POST /_plugins/_knn/models/my-model/_train HTTP/1.1
Host: localhost:9200
Content-Type: application/json
Content-Length: 313

{
  "training_index": "train-index",
  "training_field": "train-field",
  "dimension": 8,
  "description": "My model description",
  "data_type": "binary",
  "method": {
    "name": "ivf",
    "engine": "faiss",
    "space_type": "hammingbit",
    "parameters": {
      "nlist": 4,
      "nprobes":2 
    }
  }
}

4. Create IVF binary format target index

PUT /target-index HTTP/1.1
Host: localhost:9200
Content-Type: application/json
Content-Length: 242

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "target-field": {
        "type": "knn_vector",
        "model_id": "my-model"
      }
    }
  }
}

5. Bulk target index

POST /_bulk HTTP/1.1
Host: localhost:9200
Content-Type: application/json
Content-Length: 931

{ "index": { "_index": "target-index", "_id": "1" } }
{ "target-field": [2]}
{ "index": { "_index": "target-index", "_id": "2" } }
{ "target-field": [3]}
.......

6. Query target index

GET /target-index/_search HTTP/1.1
Host: localhost:9200
Content-Type: application/json
Content-Length: 110

{
  "query": {
    "knn": {
      "target-field": {
        "vector": [10],
        "k": 5
      }
    }
  }
}

7. Query result

{
    "took": 2898,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "target-index",
                "_id": "9",
                "_score": 1.0,
                "_source": {
                    "target-field": [
                        10
                    ]
                }
            },
            {
                "_index": "target-index",
                "_id": "1",
                "_score": 0.5,
                "_source": {
                    "target-field": [
                        2
                    ]
                }
            },
            {
                "_index": "target-index",
                "_id": "7",
                "_score": 0.5,
                "_source": {
                    "target-field": [
                        8
                    ]
                }
            },
            {
                "_index": "target-index",
                "_id": "10",
                "_score": 0.5,
                "_source": {
                    "target-field": [
                        11
                    ]
                }
            },
            {
                "_index": "target-index",
                "_id": "2",
                "_score": 0.33333334,
                "_source": {
                    "target-field": [
                        3
                    ]
                }
            }
        ]
    }
}

@@ -29,6 +29,12 @@ namespace knn_jni {
jlong vectorsAddressJ, jint dimJ, jstring indexPathJ, jbyteArray templateIndexJ,
jobject parametersJ);

// Create an index with ids and vectors. Instead of creating a new index, this function creates the index
// based off of the template index passed in. The index is serialized to indexPathJ.
void CreateBinaryIndexFromTemplate(knn_jni::JNIUtilInterface * jniUtil, JNIEnv * env, jintArray idsJ,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce Binary<...> methods, but for Free take parameter?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to reflect underlying cpp method.

Faiss has two different method between non binary and binary for create index method. However, there is only one method, delete` to free the memory.

throw std::runtime_error("Template index cannot be null");
}

// Set thread count if it is passed in as a parameter. Setting this variable will only impact the current thread

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid code duplication?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to add it into faiss_index_service, but failed with some c++ null pointer error, will try again with workable solution

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll raise another PR to try to add it into faiss_index_service

@@ -596,8 +597,13 @@ protected List<Field> getFieldsForByteVector(final byte[] array, final FieldType
return fields;
}

protected void parseCreateField(ParseContext context, int dimension, SpaceType spaceType, MethodComponentContext methodComponentContext)
throws IOException {
protected void parseCreateField(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this need to be changed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we create model based index, we don't specify the data_type, which makes vectorDataType to be default float in KNNVectorFieldMapper, so I update this function to be able to pass in the correct vector datatype for model based index from modelMetaData

@navneet1v
Copy link

@junqiu-lei in this PR I can see heemin's change and your change both. Is that intensional?

@junqiu-lei
Copy link
Author

@junqiu-lei in this PR I can see heemin's change and your change both. Is that intensional?

@navneet1v It might because Heemin updated his branch which I targeted on this PR, let me rebase and update it.

@junqiu-lei
Copy link
Author

@junqiu-lei in this PR I can see heemin's change and your change both. Is that intensional?

@navneet1v It might because Heemin updated his branch which I targeted on this PR, let me rebase and update it.

Rebased and only my commit on this PR.

@junqiu-lei junqiu-lei force-pushed the binary-ivf branch 3 times, most recently from b3fc615 to b8b127d Compare July 11, 2024 19:39
@@ -128,6 +137,7 @@ public ModelMetadata(
this.error = Objects.requireNonNull(error, "error must not be null");
this.trainingNodeAssignment = Objects.requireNonNull(trainingNodeAssignment, "node assignment must not be null");
this.methodComponentContext = Objects.requireNonNull(methodComponentContext, "method context must not be null");
this.vectorDataType = Objects.requireNonNull(vectorDataType, "vector data type must not be null");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update Java doc of the constructor?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@junqiu-lei junqiu-lei force-pushed the binary-ivf branch 2 times, most recently from f5529f2 to 90479ed Compare July 11, 2024 21:21
@naveentatikonda
Copy link

@junqiu-lei @heemin32 Are you adding validation check to not use an encoder with binary datatype because I believe as of now all the encoders are not supported with Binary datatype ?

@heemin32
Copy link
Owner

@junqiu-lei @heemin32 Are you adding validation check to not use an encoder with binary datatype because I believe as of now all the encoders are not supported with Binary datatype ?

We need. https://github.com/opensearch-project/k-NN/pull/1781/files#diff-c6d56dded611534c5be8f8afdba6bf320f7892d47465deee44bd6390f2bd1f12R347

@junqiu-lei junqiu-lei changed the base branch from binary-java to feature/binary-format July 15, 2024 17:49
@junqiu-lei junqiu-lei force-pushed the binary-ivf branch 3 times, most recently from 3a5491b to 7037a6e Compare July 15, 2024 22:27
@junqiu-lei junqiu-lei requested a review from heemin32 July 15, 2024 22:28
@junqiu-lei junqiu-lei force-pushed the binary-ivf branch 2 times, most recently from 800c7c5 to 7294db4 Compare July 16, 2024 19:11
Signed-off-by: Junqiu Lei <junqiu@amazon.com>
@junqiu-lei
Copy link
Author

junqiu-lei commented Jul 16, 2024

Closing this PR, the comments on this PR are resolved and continue being reviewing on opensearch-project#1784

@junqiu-lei junqiu-lei closed this Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants