Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for get_entity_by_id() in specific partition #4564

Closed
shiyu22 opened this issue Jan 5, 2021 · 5 comments
Closed

API for get_entity_by_id() in specific partition #4564

shiyu22 opened this issue Jan 5, 2021 · 5 comments
Assignees
Labels
kind/feature Issues related to feature request from users
Milestone

Comments

@shiyu22
Copy link
Contributor

shiyu22 commented Jan 5, 2021

Please state your issue using the following template and, most importantly, in English.

Is your feature request related to a problem? Please describe.
To get the vectors I have inserted, I can now use get_entity_by_id(), but I can't specify partition_name.

Describe alternatives you've considered
I think the process would be faster if I could use the partition_name to use get_entity_by_id().

@shengjun1985 shengjun1985 added the kind/feature Issues related to feature request from users label Jan 5, 2021
@hs-pedro
Copy link

Hi team. I currently faced this same situation: need to get a vector inside a specific partition. While this is an ongoing ticket for a new feature, do you guys have any ideas of possible workarounds on how can I achieve this?

I'd reckon that performing any kind of search for a specific vector using partitions is only available through the search API that uses ANN, right? Is there any other way around it even if it is costly?

@hs-pedro
Copy link

Looking into the code, I've seen that GetVectorsByIdHelper uses a FileHolder structure to loop through and find the vector, however this structure is not filtered by any partitions (comes from FilesByTypeEx())

milvus/core/src/db/DBImpl.cpp

Lines 1302 to 1319 in f709349

std::vector<meta::CollectionSchema> collection_array;
auto status = meta_ptr_->ShowPartitions(collection.collection_id_, collection_array);
collection_array.push_back(collection);
status = meta_ptr_->FilesByTypeEx(collection_array, file_types, files_holder);
if (!status.ok()) {
std::string err_msg = "Failed to get files for GetVectorByID: " + status.message();
LOG_ENGINE_ERROR_ << err_msg;
return status;
}
if (files_holder.HoldFiles().empty()) {
LOG_ENGINE_DEBUG_ << "No files to get vector by id from";
return Status(DB_NOT_FOUND, "Collection is empty");
}
cache::CpuCacheMgr::GetInstance()->PrintInfo();
status = GetVectorsByIdHelper(id_array, vectors, files_holder);

milvus/core/src/db/DBImpl.cpp

Lines 1403 to 1420 in f709349

DBImpl::GetVectorsByIdHelper(const IDNumbers& id_array, std::vector<engine::VectorsData>& vectors,
meta::FilesHolder& files_holder) {
// attention: this is a copy, not a reference, since the files_holder.UnMarkFile will change the array internal
milvus::engine::meta::SegmentsSchema files = files_holder.HoldFiles();
LOG_ENGINE_DEBUG_ << "Getting vector by id in " << files.size() << " files, id count = " << id_array.size();
// sometimes not all of id_array can be found, we need to return empty vector for id not found
// for example:
// id_array = [1, -1, 2, -1, 3]
// vectors should return [valid_vector, empty_vector, valid_vector, empty_vector, valid_vector]
// the ID2RAW is to ensure returned vector sequence is consist with id_array
using ID2VECTOR = std::map<int64_t, VectorsData>;
ID2VECTOR map_id2vector;
vectors.clear();
IDNumbers temp_ids = id_array;
for (auto& file : files) {

From my understanding a possible implementation would be to use the partitions to get a "filtered" FIleHolder that could be passed to GetVectorsIdHelper to loop over. Similar to what is done on the Query methods

milvus/core/src/db/DBImpl.cpp

Lines 1750 to 1758 in f709349

std::set<std::string> partition_name_array;
GetPartitionsByTags(collection_id, partition_tags, partition_name_array);
for (auto& partition_name : partition_name_array) {
status = meta_ptr_->FilesToSearch(partition_name, files_holder);
if (!status.ok()) {
return Status(DB_ERROR, "get files to search failed in HybridQuery");
}
}

Are those assumptions correct? If so, is it preferable to create a new implementation avoiding breaking the API?

@yhmo
Copy link
Contributor

yhmo commented Mar 12, 2021

@hs-pedro

Basically you are right, to get vector from specified partition, we need to pass partition names to the loop so that the FileHolder only contains segments from specified partitions.
But this is internal implementation for db level. There are more work to do:

  • change the request level code in the milvus/core/src/server to pass partition names
  • add parameter for all sdk(python/go/c++/java/restful) to pass partition names
    for example, in pymilvus, the definition of get_entity_by_id:
    def get_entity_by_id(self, collection_name, ids, timeout=None)
    we need to redefine it as:
    def get_entity_by_id(self, collection_name, partition_tags=None, ids, timeout=None)

@jingkl
Copy link
Contributor

jingkl commented Mar 15, 2021

For this question refer to the issue: #4813

@shengjun1985 shengjun1985 added this to the 1.1.0 milestone Apr 15, 2021
@yhmo yhmo changed the title get_entity_by_id with partition_name API for get_entity_by_id in specific partition Apr 28, 2021
@yhmo yhmo changed the title API for get_entity_by_id in specific partition API for get_entity_by_id() in specific partition Apr 29, 2021
@yhmo
Copy link
Contributor

yhmo commented May 7, 2021

Implemented in v1.1.0

@yhmo yhmo closed this as completed May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Issues related to feature request from users
Projects
None yet
Development

No branches or pull requests

5 participants