Creates a MinHash object that contains matrix of Minhash Signatures for each text.
MinHash(
text,
method='multi_hash',
n_gram=9,
n_gram_type='char',
permutations=100,
hash_bits=64,
seed=None
)
text
{list or ndarray}
Iterable containing strings of text for each text in a corpus.
method
str, optional, default: 'multi_hash'
Method for random sampling via hashing, must be 'multi_hash' or 'k_smallest_values'.
If multi_hash selected texts are hashed once per permutation and the minimum hash value selected each time to
construct a signature.
If bottom_k selected each text is hashed once and the k-smallest values selected for k permutations.
This method is less computationally intensive than multi_hash but also less stable.
n_gram
int, optional, default: 9
Size of each overlapping text shingle to break text into prior to hashing. Shingle size should be carefully selected
dependent on average text length as too low a shingle size will yield false similarities, whereas too high a shingle
size will fail to return similar documents.
For character shingles a size of 5 is recommended for shorter texts such as emails, the default size of 9 is
recommended for longer texts or documents.
n_gram_type
str, optional, default: 'char'
Type of n gram to use for shingles, must be 'char' to split text into character shingles or 'term' to split text into
overlapping sequences of words.
permutations
int, optional, default: 100
Number of randomly sampled hash values to use for generating each texts minhash signature. Intuitively the larger the
number of permutations, the more accurate the estimated Jaccard similarity between the texts but longer the algorithm
will take to run.
hash_bits
int, optional, default: 64
Hash value size to be used to generate minhash signatures from shingles, must be 32, 64 or 128 bit. Hash value size
should be chosen based on text length and a trade off between performance and accuracy. Lower hash values risk false
hash collisions leading to false similarities between documents for larger corpora of texts.
seed
int, optional, default: None
Seed from which to generate random hash function, necessary for reproducibility or to allow updating of the LSH model
with new minhash values later.
n_gram: int
.n_gram
Returns size of each overlapping text shingle used to create minhash signatures.
n_gram_type: int
.n_gram_type
Returns type of n-gram used for text shingling.
permutations: int
.permutations
Returns number of permutations used to create signatures.
hash_bits: int
.hash_bits
Returns hash value size used to create signatures.
method: str
.method
Returns hashing method used in minhash function.
seed: int
.seed
Returns seed value used to generate random hashes in minhash function.
signatures: numpy.array
.signatures
Returns matrix of text signatures generated by minhash function.
n = text row, m = selected permutations.
Creates an LSH model of text similarity that can be used to return similar texts based on estimated Jaccard similarity.
LSH(minhash=None, labels=None, no_of_bands=None)
minhash
optional, default: None
Minhash object containing minhash signatures returned by MinHash class.
labels
{list or ndarray}, optional, default: None
List, array or Pandas series containing unique labels for each text in minhash object signature. This should be provided in the same order as texts passed to the MinHash class. Example labels include filepaths and database ids.
no_of_bands
optional, default: permutations // 2
Number of bands to break minhash signature into before hashing into buckets. A smaller number of bands will result in a stricter algorithm, requiring larger possibly leading to false negatives missing some similar texts, whereas a higher number may lead to false similarities.
update
Updates model from a MinHash object containing signatures generated from new texts and their corresponding labels.
.update(minhash, new_labels)
minhash: MinHash object containing signatures of new texts, parameters must match any previous MinHash objects.
new_labels: List, array or Pandas series containing text labels.
query
Takes a label and returns the labels of any similar texts.
.query(label, min_jaccard=None, sensitivity=1)
label: Label of text to return list of similar texts for.
min_jaccard: Jaccard similarity threshold texts have to exceed to be returned as similar.
sensitivity: Number of buckets texts must share to be returned as similar.
remove
Remove file label and minhash signature from model.
.remove(label)
label: Label of text to remove from LSH model.
contains
Returns list of labels contained in the model.
.contains()
adjacency_list
Returns an adjacency list that can be used to create a text similarity graph.
.adjacency_list(min_jaccard=None, sensitivity=1)
min_jaccard: Jaccard similarity threshold texts have to exceed to be returned as similar.
sensitivity: Number of buckets texts must share to be returned as similar.
no_of_bands: int
.no_of_bands
Number of bands used in LSH model.
permutations: int
.permutations
Number of permutations used to create minhash signatures used in LSH model.