Skip to content

Latest commit

 

History

History
155 lines (147 loc) · 5.92 KB

api_documentation.md

File metadata and controls

155 lines (147 loc) · 5.92 KB

Akin API Guide

MinHash

Creates a MinHash object that contains matrix of Minhash Signatures for each text.

MinHash Parameters

MinHash(
    text, 
    method='multi_hash', 
    n_gram=9, 
    n_gram_type='char', 
    permutations=100, 
    hash_bits=64, 
    seed=None
)

text
{list or ndarray}
Iterable containing strings of text for each text in a corpus.

method
str, optional, default: 'multi_hash'
Method for random sampling via hashing, must be 'multi_hash' or 'k_smallest_values'.
If multi_hash selected texts are hashed once per permutation and the minimum hash value selected each time to construct a signature.

If bottom_k selected each text is hashed once and the k-smallest values selected for k permutations. This method is less computationally intensive than multi_hash but also less stable.

n_gram
int, optional, default: 9
Size of each overlapping text shingle to break text into prior to hashing. Shingle size should be carefully selected dependent on average text length as too low a shingle size will yield false similarities, whereas too high a shingle size will fail to return similar documents.

For character shingles a size of 5 is recommended for shorter texts such as emails, the default size of 9 is recommended for longer texts or documents.

n_gram_type
str, optional, default: 'char'
Type of n gram to use for shingles, must be 'char' to split text into character shingles or 'term' to split text into overlapping sequences of words.

permutations
int, optional, default: 100
Number of randomly sampled hash values to use for generating each texts minhash signature. Intuitively the larger the number of permutations, the more accurate the estimated Jaccard similarity between the texts but longer the algorithm will take to run.

hash_bits
int, optional, default: 64
Hash value size to be used to generate minhash signatures from shingles, must be 32, 64 or 128 bit. Hash value size should be chosen based on text length and a trade off between performance and accuracy. Lower hash values risk false hash collisions leading to false similarities between documents for larger corpora of texts.

seed
int, optional, default: None
Seed from which to generate random hash function, necessary for reproducibility or to allow updating of the LSH model with new minhash values later.

MinHash Properties

n_gram: int

.n_gram

Returns size of each overlapping text shingle used to create minhash signatures.

n_gram_type: int

.n_gram_type

Returns type of n-gram used for text shingling.

permutations: int

.permutations

Returns number of permutations used to create signatures.

hash_bits: int

.hash_bits

Returns hash value size used to create signatures.

method: str

.method

Returns hashing method used in minhash function.

seed: int

.seed

Returns seed value used to generate random hashes in minhash function.

signatures: numpy.array

.signatures

Returns matrix of text signatures generated by minhash function.
n = text row, m = selected permutations.

LSH

Creates an LSH model of text similarity that can be used to return similar texts based on estimated Jaccard similarity.

LSH Parameters

LSH(minhash=None, labels=None, no_of_bands=None)

minhash
optional, default: None
Minhash object containing minhash signatures returned by MinHash class.

labels
{list or ndarray}, optional, default: None
List, array or Pandas series containing unique labels for each text in minhash object signature. This should be provided in the same order as texts passed to the MinHash class. Example labels include filepaths and database ids.

no_of_bands
optional, default: permutations // 2
Number of bands to break minhash signature into before hashing into buckets. A smaller number of bands will result in a stricter algorithm, requiring larger possibly leading to false negatives missing some similar texts, whereas a higher number may lead to false similarities.

LSH Methods

update
Updates model from a MinHash object containing signatures generated from new texts and their corresponding labels.

.update(minhash, new_labels)

minhash: MinHash object containing signatures of new texts, parameters must match any previous MinHash objects.
new_labels: List, array or Pandas series containing text labels.

query
Takes a label and returns the labels of any similar texts.

.query(label, min_jaccard=None, sensitivity=1)

label: Label of text to return list of similar texts for.
min_jaccard: Jaccard similarity threshold texts have to exceed to be returned as similar.
sensitivity: Number of buckets texts must share to be returned as similar.

remove
Remove file label and minhash signature from model.

.remove(label)

label: Label of text to remove from LSH model.

contains
Returns list of labels contained in the model.

.contains()

adjacency_list
Returns an adjacency list that can be used to create a text similarity graph.

.adjacency_list(min_jaccard=None, sensitivity=1)

min_jaccard: Jaccard similarity threshold texts have to exceed to be returned as similar.
sensitivity: Number of buckets texts must share to be returned as similar.

LSH Properties

no_of_bands: int

.no_of_bands

Number of bands used in LSH model.

permutations: int

.permutations

Number of permutations used to create minhash signatures used in LSH model.