In [1]:
%load_ext autoreload
%autoreload 2

## Using MTAnnoy Index

Annoy is a library for approximate nearest neighbour searches. The MTAnnoy class makes it easier to work with HathiTrust volumes.

See [CreatingMTAnnoyIndex](./CreatingMTAnnoyIndex.ipynb) for an example on building the index.

In [2]:
from compare_tools.MTAnnoy import MTAnnoy
ann = MTAnnoy('testsetGlove3.ann', dims=300)

Metadata is mapped with the integer IDs that Annoy uses.

In [4]:
ann.get_htid_by_id(30), ann.get_mtid_by_id(30)

('hvd.32044020608253', 'hvd.32044020608253-0020')

`get_nns_by_item` returns the named mtids (which is the HathiTrust ID with the four character chunk sequence number). However, it's likely easier to work with `get_nns_by_mtid`.

In [6]:
ann.get_nns_by_item(30, 3)

['uc2.ark:/13960/t26975w91-0016',
 'mdp.39015013285484-0032',
 'uc1.$b160093-0009']

In [7]:
ann.get_nns_by_mtid('hvd.32044020608253-0020', 3)

['uc2.ark:/13960/t26975w91-0016',
 'mdp.39015013285484-0032',
 'uc1.$b160093-0009']

MTAnnoy is not a superclass of Annoy - it just wraps it. The underlying Annoy index, memmapped on disk, is under `MTAnnoy.u`

In [6]:
ann, ann.u

(<compare_tools.MTAnnoy.MTAnnoy at 0x7f6481f2dd30>,
 <annoy.Annoy at 0x7f6472c4bf70>)

In [8]:
ann.u.get_nns_by_item(30, 3)

[47212, 17247, 58029]

To get matches with distances and ranks in a DataFrame, use `get_named_result_df`. This is useful in higher leel methods.

In [12]:
# The self-match is dropped, so n=5 returns 4 results. Changing
# that in code seemed more confusing, since the n elsewhere includes
# the self-match
ann.get_named_result_df('hvd.32044020608253-0020', n=5)

Unnamed: 0,target,target_seq,match,match_seq,dist,match_rank
0,uc2.ark:/13960/t26975w91,16,mdp.39015013285484,32,0.198782,1
1,uc2.ark:/13960/t26975w91,16,uc1.$b160093,9,0.211221,2
2,uc2.ark:/13960/t26975w91,16,njp.32101067176550,22,0.217921,3
3,uc2.ark:/13960/t26975w91,16,uc2.ark:/13960/t5j962g70,13,0.221345,4


Metadata is in `ann.ind`

In [22]:
ann.ind.head(2)

Unnamed: 0_level_0,min,max,length
htid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
aeu.ark:/13960/t0bv8gh1j,11230,11231,2
aeu.ark:/13960/t0vq4jb2c,53970,53990,21


## Similarity Meta