# Simple tag search

This script shows a simple example of a combination of tag-based search and ANN search

Requisites:
- numpy
- rii (can be installed via `pip install rii`)
- nanopq (can be installed via `pip install nanopq`)


In [1]:
import rii
import nanopq
import numpy as np
from pprint import pprint

## Setup

Let us suppose we have $N$ images. For each image, a $D$-dimentional feature vector is extracted: $X \in \mathbb{R}^{N \times D}$

In [2]:
N, D = 30, 4
X = np.random.random((N, D)).astype(np.float32)
print(X)

[[0.23848394 0.29702175 0.65262675 0.9822919 ]
 [0.5894111  0.9024969  0.5392529  0.7023705 ]
 [0.9426218  0.26252207 0.39459682 0.9452917 ]
 [0.94124043 0.85057265 0.937756   0.17387015]
 [0.50857836 0.1293731  0.3028786  0.2306242 ]
 [0.76751685 0.7017813  0.9549601  0.8791828 ]
 [0.4735736  0.08341998 0.9536749  0.7296951 ]
 [0.91858566 0.843796   0.6069587  0.8784087 ]
 [0.88605714 0.99625313 0.9692241  0.65299237]
 [0.46498737 0.48663864 0.3083093  0.2550272 ]
 [0.9165077  0.8309953  0.09934966 0.78623253]
 [0.7734119  0.21923055 0.47940737 0.31913072]
 [0.40212712 0.02557702 0.10609309 0.7202728 ]
 [0.1250469  0.98909503 0.13674396 0.22888306]
 [0.7612986  0.474714   0.487191   0.08833417]
 [0.6271453  0.33875272 0.6161446  0.81554586]
 [0.00779804 0.85486376 0.9896356  0.25961038]
 [0.49262956 0.37964103 0.6773961  0.14222622]
 [0.24302055 0.5625525  0.5205382  0.15198916]
 [0.38390493 0.9842511  0.2981003  0.23134524]
 [0.04426325 0.44369084 0.36672053 0.07548269]
 [0.9181404  

Besides, each image has its tag for its attribute, as the form of:`attr={"id": id, "tag": tag}`

In [3]:
def random_tag():
    return ["cat", "dog", "horse", "rabbit"][np.random.randint(4)]   

attributes = [{"id": n, "tag": random_tag()} for n in range(N)]
pprint(attributes)

[{'id': 0, 'tag': 'dog'},
 {'id': 1, 'tag': 'horse'},
 {'id': 2, 'tag': 'horse'},
 {'id': 3, 'tag': 'dog'},
 {'id': 4, 'tag': 'cat'},
 {'id': 5, 'tag': 'horse'},
 {'id': 6, 'tag': 'cat'},
 {'id': 7, 'tag': 'rabbit'},
 {'id': 8, 'tag': 'rabbit'},
 {'id': 9, 'tag': 'dog'},
 {'id': 10, 'tag': 'horse'},
 {'id': 11, 'tag': 'horse'},
 {'id': 12, 'tag': 'rabbit'},
 {'id': 13, 'tag': 'horse'},
 {'id': 14, 'tag': 'cat'},
 {'id': 15, 'tag': 'dog'},
 {'id': 16, 'tag': 'rabbit'},
 {'id': 17, 'tag': 'cat'},
 {'id': 18, 'tag': 'dog'},
 {'id': 19, 'tag': 'dog'},
 {'id': 20, 'tag': 'horse'},
 {'id': 21, 'tag': 'horse'},
 {'id': 22, 'tag': 'rabbit'},
 {'id': 23, 'tag': 'dog'},
 {'id': 24, 'tag': 'cat'},
 {'id': 25, 'tag': 'rabbit'},
 {'id': 26, 'tag': 'cat'},
 {'id': 27, 'tag': 'dog'},
 {'id': 28, 'tag': 'rabbit'},
 {'id': 29, 'tag': 'horse'}]


Then let's setup a Rii instance for search

In [4]:
codec = nanopq.PQ(M=2, Ks=10).fit(vecs=X)
e = rii.Rii(fine_quantizer=codec).add_configure(vecs=X)

M: 2, Ks: 10, code_dtype: <class 'numpy.uint8'>
iter: 20, seed: 123
Training the subspace: 0 / 2
Training the subspace: 1 / 2
Encoding the subspace: 0 / 2
Encoding the subspace: 1 / 2
===== Threshold selection ====
L: [6]
threshold: [30]
polyfit coeff: [0, 30]
resultant func:  
30


## Search

Given a new query vector and a target tag, our task is to find items that (1) have the target tag and (2) are similar to the query. Let us prepare the query and the tag:

In [5]:
query = np.random.random((D, )).astype(np.float32)
target_tag = "dog"
print("query vector:", query)
print("target tag:", target_tag)


query vector: [0.46426806 0.5684307  0.30254945 0.4973088 ]
target tag: dog


Let's run the tag search, i.e., collecting IDs of items that have the target tag. Here, we show a simple exhaustive search over the attributes for simplicity. You can use any fast algorithms/libraries here, such as SQL, pandas, and elasticsearch.

In [6]:
target_ids = np.array([attr["id"] for attr in attributes if attr['tag'] == target_tag])
print("target IDs:", target_ids)

target IDs: [ 0  3  9 15 18 19 23 27]


With Rii, we can find similar feature vectors to the query efficiently, where the search range is specified by the target IDs.

In [7]:
ids, dists = e.query(q=query, target_ids=target_ids, topk=3)
print("top-3 IDs:", ids)
print("top-3 dists:", dists)

top-3 IDs: [ 9 27 19]
top-3 dists: [0.09783517 0.1416896  0.22612099]


As this demo shows, Rii can easily handle such tag-and-ANN search problem. In existing search pipelines, this can be achieved by running the ANN search first and filtering the result by the tag search. But that doesn't work if the result of the ANN search doesn't contain items having the target tag. 