In [1]:
%load_ext autoreload
%autoreload 2

## Using MTAnnoy Index

Annoy is a library for approximate nearest neighbour searches. The MTAnnoy class makes it easier to work with HathiTrust volumes.

See [CreatingMTAnnoyIndex](./CreatingMTAnnoyIndex.ipynb) for an example on building the index.

In [1]:
from compare_tools.MTAnnoy import MTAnnoy
ann = MTAnnoy('testsetGlove3.ann', dims=300)

Metadata is mapped with the integer IDs that Annoy uses.

In [5]:
ann.get_htid_by_id(30), ann.get_mtid_by_id(30)

('uc1.31822023936982', 'uc1.31822023936982-0001')

`get_nns_by_item` returns the named mtids (which is the HathiTrust ID with the four character chunk sequence number). However, it's likely easier to work with `get_nns_by_mtid`.

In [6]:
ann.get_nns_by_item(30, 3)

['uc1.31822023936982-0001',
 'inu.32000011561851-0002',
 'nc01.ark:/13960/t4pk0nx3c-0020']

In [19]:
ann.get_nns_by_mtid('nc01.ark:/13960/t4pk0nx3c-0020', 3)

['nc01.ark:/13960/t4pk0nx3c-0020',
 'nc01.ark:/13960/t4pk0nx3c-0002',
 'nc01.ark:/13960/t4pk0nx3c-0014']

MTAnnoy is not a superclass of Annoy - it just wraps it. The underlying Annoy index, memmapped on disk, is under `MTAnnoy.u`

In [9]:
ann, ann.u

(<compare_tools.MTAnnoy.MTAnnoy at 0x7fae5efa3518>,
 <annoy.Annoy at 0x7fae6fbb0fb0>)

In [20]:
ann.u.get_nns_by_item(30, 3)

[30, 29, 9944]

To get matches with distances and ranks in a DataFrame, use `get_named_result_df`. This is useful in higher leel methods.

In [21]:
# The self-match is dropped, so n=5 returns 4 results. Changing
# that in code seemed more confusing, since the n elsewhere includes
# the self-match
ann.get_named_result_df(mtid='nc01.ark:/13960/t4pk0nx3c-002', n=5)

Unnamed: 0,target,target_seq,match,match_seq,dist,rank
0,nc01.ark:/13960/t4pk0nx3c,2,nc01.ark:/13960/t4pk0nx3c,2,0.0,0
1,nc01.ark:/13960/t4pk0nx3c,2,nc01.ark:/13960/t4pk0nx3c,20,0.092129,1
2,nc01.ark:/13960/t4pk0nx3c,2,nc01.ark:/13960/t4pk0nx3c,14,0.099151,2
3,nc01.ark:/13960/t4pk0nx3c,2,nc01.ark:/13960/t4pk0nx3c,8,0.11137,3
4,nc01.ark:/13960/t4pk0nx3c,2,nc01.ark:/13960/t4pk0nx3c,10,0.116853,4


It can also run it for all the chunks in a volume:

In [22]:
ann.get_named_result_df(htid='nc01.ark:/13960/t4pk0nx3c', n=5).head(10)

Unnamed: 0,target,target_seq,match,match_seq,dist,rank
0,nc01.ark:/13960/t4pk0nx3c,1,nc01.ark:/13960/t4pk0nx3c,1,0.0,0
1,nc01.ark:/13960/t4pk0nx3c,1,nyp.33433082479092,5,0.151567,1
2,nc01.ark:/13960/t4pk0nx3c,1,mdp.39015063787983,6,0.156182,2
3,nc01.ark:/13960/t4pk0nx3c,1,uva.x001211590,1,0.168556,3
4,nc01.ark:/13960/t4pk0nx3c,1,mdp.39015063787983,4,0.168888,4
0,nc01.ark:/13960/t4pk0nx3c,2,nc01.ark:/13960/t4pk0nx3c,2,0.0,0
1,nc01.ark:/13960/t4pk0nx3c,2,nc01.ark:/13960/t4pk0nx3c,20,0.092129,1
2,nc01.ark:/13960/t4pk0nx3c,2,nc01.ark:/13960/t4pk0nx3c,14,0.099151,2
3,nc01.ark:/13960/t4pk0nx3c,2,nc01.ark:/13960/t4pk0nx3c,8,0.11137,3
4,nc01.ark:/13960/t4pk0nx3c,2,nc01.ark:/13960/t4pk0nx3c,10,0.116853,4


Metadata is in `ann.ind`

In [23]:
ann.ind.head(2)

Unnamed: 0_level_0,min,max,length
htid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
aeu.ark:/13960/t0cv4sg1m,5925,5925,1
aeu.ark:/13960/t0ft8q48g,10041,10041,1


A module that sums `get_named_result_df` into basic stats is provided with `ann.doc_match_stats`

In [25]:
stats = ann.doc_match_stats(htid='aeu.ark:/13960/t0ft8q48g', n=20)
stats.head()

Unnamed: 0,match,target,count,mean,length,prop_target,prop_match
0,uc1.b4385719,aeu.ark:/13960/t0ft8q48g,6,0.319545,24,6.0,0.25
1,mdp.39015012591510,aeu.ark:/13960/t0ft8q48g,2,0.324776,3,2.0,0.666667
2,uc1.b4377472,aeu.ark:/13960/t0ft8q48g,2,0.318841,30,2.0,0.066667
3,umn.319510008952947,aeu.ark:/13960/t0ft8q48g,1,0.337614,57,1.0,0.017544
4,wu.89094610227,aeu.ark:/13960/t0ft8q48g,1,0.336061,2,1.0,0.5
