### Akin - Example Usage

In [1]:
from akin import MultiHash, LSH
import pandas as pd

In [5]:
content = pd.DataFrame(
    {
        'text': [
            'Jupiter is primarily composed of hydrogen with a quarter of its mass being helium',
            'Jupiter moving out of the inner Solar System would have allowed the formation of inner planets.',
            'A helium atom has about four times as much mass as a hydrogen atom, so the composition changes '
            'when described as the proportion of mass contributed by different atoms.',
            'Jupiter is primarily composed of hydrogen and a quarter of its mass being helium',
            'A helium atom has about four times as much mass as a hydrogen atom and the composition changes '
            'when described as a proportion of mass contributed by different atoms.',
            'Theoretical models indicate that if Jupiter had much more mass than it does at present, it '
            'would shrink.',
            'This process causes Jupiter to shrink by about 2 cm each year.',
            'Jupiter is mostly composed of hydrogen with a quarter of its mass being helium',
            'The Great Red Spot is large enough to accommodate Earth within its boundaries.'
        ]
    }
)

In [6]:
content

Unnamed: 0,text
0,Jupiter is primarily composed of hydrogen with...
1,Jupiter moving out of the inner Solar System w...
2,A helium atom has about four times as much mas...
3,Jupiter is primarily composed of hydrogen and ...
4,A helium atom has about four times as much mas...
5,Theoretical models indicate that if Jupiter ha...
6,This process causes Jupiter to shrink by about...
7,Jupiter is mostly composed of hydrogen with a ...
8,The Great Red Spot is large enough to accommod...


**Create MinHash object:**

In [7]:
minhash = MultiHash(n_gram=9, permutations=100, hash_bits=64, seed=3)

In [8]:
signatures = minhash.transform(content['text'])

In [9]:
content['signature'] = signatures

**Create LSH object:**

In [10]:
lsh = LSH(no_of_bands=50)

In [11]:
lsh.update(signatures)

**Query to find near duplicates for text 1:**

In [12]:
text_1_minhash = signatures[0]

In [14]:
near_duplicates = lsh.query(text_1_minhash, min_jaccard=0.5)

In [15]:
content.loc[content['signature'].isin(near_duplicates)]['text']

3    Jupiter is primarily composed of hydrogen and ...
7    Jupiter is mostly composed of hydrogen with a ...
Name: text, dtype: object

**Generate minhash signature and add new texts to LSH model:**

In [17]:
new_text = [
    'Jupiter is primarily composed of hydrogen with a quarter of its mass being helium',
    'Jupiter moving out of the inner Solar System would have allowed the formation of '
    'inner planets.'
]

In [18]:
new_signatures = minhash.transform(new_text)
lsh.update(new_signatures)

**Check contents of documents:**

In [19]:
lsh.get_minhashes()

{(-9220181301477941095,
  -9070612480513123769,
  -8758334065048863870,
  -9088599755423898248,
  -9112237402113485154,
  -8724804833834128190,
  -9021086049868851433,
  -9104488982795034825,
  -8447732875455317765,
  -9217921711869696650,
  -8824513710337024782,
  -9210625045199069429,
  -9053985727061122284,
  -9051570627014427002,
  -9199199950870842079,
  -8971423630062409303,
  -8800578919216967380,
  -9170044917287851553,
  -9190785196225892758,
  -8659574165287706090,
  -9200757596398149156,
  -9095662641992575632,
  -9121808719474503607,
  -9081440341094702971,
  -8654682464023772450,
  -9120493561370521530,
  -9174317440215446398,
  -9199140433181872723,
  -8861390946988569149,
  -9173216789070224585,
  -9044845962479753263,
  -8742355895639115659,
  -8783381914956342161,
  -9198692403682590824,
  -9193180543760552918,
  -9198116881577692843,
  -8545429115445750215,
  -9082476343236263349,
  -9121405371872816943,
  -8879611928913067939,
  -8757852097886192412,
  -8807576779975

**Remove text and label from model:**

In [22]:
# Remove text and label from model.
lsh.remove([signatures[4]])
lsh.get_minhashes()

KeyError: 1117104749162048440

**Return adjacency list for all similar texts:**

In [12]:
adjacency_list = lsh.adjacency_list(min_jaccard=0.55)
adjacency_list

{1: ['doc1', 4],
 2: ['doc2'],
 3: [],
 4: [1, 'doc1'],
 6: [],
 7: [],
 8: [],
 9: [],
 'doc1': [1, 4],
 'doc2': [2]}