# 样例实现

上面我们已经有说过位置敏感的哈希可以给数据降维，其中我们较为关心的就是实际情况下的碰撞概率，具体来说给定两个对象，经过哈希以后保留的特征中，大概的相似程度是多少？

嗯这个应该只相当于位置敏感的哈希中的计算单个的碰撞概率的方法，不过大概我们需要模拟的也不是特异性特别强的降维那方面。

In [1]:
import random

DEBUG = False

# global constants value
rand_seed = None
P = 31
M = 9
# K is roughly O(1 / relative_accuracy^2), so a fairly large number is needed to get to 10% relative error.
K = 100000

data = [[0, 2, 3, 4, 7, 8], [0, 1, 3, 8]]

## Jaccard相似

In [2]:
def jaccard_similarity(list1, list2):
    set1, set2 = set(list1), set(list2)
    return len(set1 & set2) / len(set1 | set2)

jaccard_similarity(data[0], data[1])

0.42857142857142855

## 哈希相关

In [3]:
def init_coeffs():
    random.seed(rand_seed)
    new_coeffs = [(random.randint(0, P), random.randint(1, P)) for i in range(K)]
    return new_coeffs


def coeffs_hash(i, x):
    return ((coeffs[i][0] * x + coeffs[i][1]) % P) % M

def sketch(initial_list):
    finally_list = []
    for i in range(K):
        hash_values = [coeffs_hash(i, x) for x in initial_list]
        finally_list.append(min(hash_values))
    return finally_list

## MinHash相似

In [4]:
def minhash_similarity(list1, list2):
    return [list1[i] == list2[i] for i in range(len(list1))].count(True) / len(list1)

coeffs = init_coeffs()
minhash_data = [sketch(each_list) for each_list in data]

minhash_similarity(minhash_data[0], minhash_data[1])

0.61017