In [1]:
%load_ext autoreload
%autoreload 2

## Using MTAnnoy Index

Annoy is a library for approximate nearest neighbour searches. The MTAnnoy class makes it easier to work with HathiTrust volumes.

See [CreatingMTAnnoyIndex](./CreatingMTAnnoyIndex.ipynb) for an example on building the index.

In [None]:
from compare_tools.MTAnnoy import MTAnnoy
ann = MTAnnoy('testsetGlove3.ann', dims=300)

Metadata is mapped with the integer IDs that Annoy uses.

In [3]:
ann.get_htid_by_id(30), ann.get_mtid_by_id(30)

('hvd.32044020608253', 'hvd.32044020608253-0020')

`get_nns_by_item` returns the named mtids (which is the HathiTrust ID with the four character chunk sequence number). However, it's likely easier to work with `get_nns_by_mtid`.

In [4]:
ann.get_nns_by_item(30, 3)

['uc2.ark:/13960/t26975w91-0016',
 'mdp.39015013285484-0032',
 'uc1.$b160093-0009']

In [5]:
ann.get_nns_by_mtid('hvd.32044020608253-0020', 3)

['uc2.ark:/13960/t26975w91-0016',
 'mdp.39015013285484-0032',
 'uc1.$b160093-0009']

MTAnnoy is not a superclass of Annoy - it just wraps it. The underlying Annoy index, memmapped on disk, is under `MTAnnoy.u`

In [6]:
ann, ann.u

(<compare_tools.MTAnnoy.MTAnnoy at 0x7f5e6c26d160>,
 <annoy.Annoy at 0x7f5e6c2c3b30>)

In [7]:
ann.u.get_nns_by_item(30, 3)

[47212, 17247, 58029]

To get matches with distances and ranks in a DataFrame, use `get_named_result_df`. This is useful in higher leel methods.

In [8]:
# The self-match is dropped, so n=5 returns 4 results. Changing
# that in code seemed more confusing, since the n elsewhere includes
# the self-match
ann.get_named_result_df(mtid='hvd.32044020608253-0020', n=5)

Unnamed: 0,target,target_seq,match,match_seq,dist,rank
0,hvd.32044020608253,20,uc2.ark:/13960/t26975w91,16,0.194983,0
1,hvd.32044020608253,20,mdp.39015013285484,32,0.198782,1
2,hvd.32044020608253,20,uc1.$b160093,9,0.211221,2
3,hvd.32044020608253,20,njp.32101067176550,22,0.217921,3
4,hvd.32044020608253,20,uc2.ark:/13960/t5j962g70,13,0.221345,4


It can also run it for all the chunks in a volume:

In [9]:
ann.get_named_result_df(htid='hvd.32044020608253', n=5).head(10)

Unnamed: 0,target,target_seq,match,match_seq,dist,rank
0,hvd.32044020608253,1,uc2.ark:/13960/t26975w91,16,0.246429,0
1,hvd.32044020608253,1,hvd.ah6mfg,16,0.252218,1
2,hvd.32044020608253,1,hvd.hwkffx,90,0.260438,2
3,hvd.32044020608253,1,mdp.39015000603004,29,0.267264,3
4,hvd.32044020608253,1,wu.89094554524,12,0.270478,4
0,hvd.32044020608253,2,uc1.b4632935,7,0.37648,0
1,hvd.32044020608253,2,hvd.hwdrt8,198,0.413332,1
2,hvd.32044020608253,2,hvd.hx14aq,34,0.427328,2
3,hvd.32044020608253,2,uiug.30112098216234,12,0.433767,3
4,hvd.32044020608253,2,uiug.30112098216234,6,0.433869,4


Metadata is in `ann.ind`

In [10]:
ann.ind.head(2)

Unnamed: 0_level_0,min,max,length
htid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
aeu.ark:/13960/t0bv8gh1j,11230,11231,2
aeu.ark:/13960/t0vq4jb2c,53970,53990,21


A module that sums `get_named_result_df` into basic stats is provided with `ann.doc_match_stats`

In [11]:
stats = ann.doc_match_stats(htid='hvd.hnmf88', n=20)
stats.head()

Unnamed: 0,match,target,count,mean,length,prop_target,prop_match
0,nnc2.ark:/13960/t9r21mv76,hvd.hnmf88,6,0.280424,184,0.139535,0.032609
1,uc1.b5053819,hvd.hnmf88,2,0.286049,71,0.046512,0.028169
2,mdp.39015013285484,hvd.hnmf88,2,0.235767,54,0.046512,0.037037
3,uva.x001475639,hvd.hnmf88,1,0.411731,14,0.023256,0.071429
4,mdp.39015077918871,hvd.hnmf88,1,0.397935,40,0.023256,0.025
