## Hash

### Hashing

In [1]:
help(hash)

'''
Return the hash value of the object (if it has one).
Hash values are integers.
---->  They are used to quickly compare dictionary keys during a dictionary lookup.
Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).
Note For objects with custom __hash__() methods, note that hash() truncates the return value based on the bit width of the host machine.
'''


Help on built-in function hash in module builtins:

hash(obj, /)
    Return the hash value for the given object.
    
    Two objects that compare equal must also have the same hash value, but the
    reverse is not necessarily true.



'\nReturn the hash value of the object (if it has one).\nHash values are integers.\n---->  They are used to quickly compare dictionary keys during a dictionary lookup.\nNumeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).\nNote For objects with custom __hash__() methods, note that hash() truncates the return value based on the bit width of the host machine.\n'

In [2]:
li = ['abc', 123, (0, 1)]
try:
    hash(li)
except Exception as e:
    print(e)

unhashable type: 'list'


In [3]:
import hashlib
print(hashlib.algorithms_available)

{'MD4', 'SHA224', 'ecdsa-with-SHA1', 'DSA', 'sha3_256', 'sha3_512', 'SHA', 'SHA384', 'dsaWithSHA', 'md5', 'SHA1', 'md4', 'blake2s', 'SHA256', 'sha', 'ripemd160', 'MD5', 'sha3_224', 'sha3_384', 'SHA512', 'sha224', 'sha256', 'sha512', 'DSA-SHA', 'dsaEncryption', 'RIPEMD160', 'shake_256', 'sha384', 'blake2b', 'whirlpool', 'shake_128', 'sha1'}


In [4]:
def myHash(i):
    return "%.8X" % (i % 2147483647)

myHash(1)

'00000001'

### Bloom Filter : a membership function

[Wiki] A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used **to test whether an element is a member of a set**. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives.

In [5]:
class myBM:
    
    bits = None
    my_k = [13, 41, 71, 107, 307, 419, 877]
    size = 1024 #* 1024
    
    def __init__(self, data = None):
        if data == None:
            self.bits = list([False]*self.size)
        else:
            self.bits = list([False]*self.size)
            for i in data:
                self.add(i)
        
    def add(self, newItem):
        for k in self.my_k:
            v = self.__myhash(newItem, k)
            self.bits[v] = True
    
    def __myhash(self, i, k):
        return i % k
    
    def __contains__(self, i):
        for k in self.my_k:
            v = self.__myhash(i, k)
            if self.bits[v] == False:
                return False
        return True

In [6]:
bm = myBM()

members = [18, 346, 672, 823, 74]
for m in members:
    bm.add(m)

In [7]:
print(74 in bm)
print(346 in bm)
print(823 in bm)

print(1 in bm)
print(88 in bm)
print(298 in bm)

True
True
True
False
False
False


### Find an example of false positive 

In [8]:
count = 0
for i in range(bm.size):
    if i in bm and i not in members:
        count += 1
        print(i)
print("FP rate:", count/bm.size)

3
4
5
8
9
16
30
FP rate: 0.0068359375


In [9]:
# Question: how to design a better my_k?

Theoretical Analysis of false positive is here.

https://people.eecs.berkeley.edu/~daw/teaching/cs170-s03/Notes/lecture10.pdf

### Collisions

The simplest way to handle collisions is to use a method called **separate chaining**. Each entry in the hash table serves as a head of a list containing all the keys that are hashed into the entry.

In [10]:
class mySet:
    li = None
    k = 13
    
    def __init__(self):
        self.li = [None]*self.k
        
    def __myhash(self, item):
        return item % self.k
        
    def add(self, newItem):
        h = self.__myhash(newItem)
        if self.li[h] != None:
            if newItem not in self.li[h]:
                self.li[h].append(newItem) # try to rewrite this by using sorted list
        else:
            self.li[h] = [newItem]
    
    def __str__(self):
        return str(self.li)
    
    def __contains__(self, item):
        h = self.__myhash(item)
        return True if item in self.li[h] else False # try to rewrite it by using binary search

In [11]:
s = mySet()
for i in range(30):
    s.add(i)

In [12]:
print(s)

[[0, 13, 26], [1, 14, 27], [2, 15, 28], [3, 16, 29], [4, 17], [5, 18], [6, 19], [7, 20], [8, 21], [9, 22], [10, 23], [11, 24], [12, 25]]


In [13]:
print(45 in s)
print(11 in s)

False
True


### Collisions: open addressing

Given a collection of items, a hash function that maps each item into a unique **slot** is referred to as a **perfect hash function**. If the hash function is perfect, collisions will never occur. However, since this is often not possible, collision resolution becomes a very important part of hashing.

Open addressing: Another simple way to deal with collision is to start at the original hash value position and then move in a sequential manner through the slots until we encounter the first slot that is empty. By systematically visiting each slot one at a time, we are performing an open addressing technique called **linear probing**.

Note that we may need to go back to the first slot (circularly) to cover the entire hash table.

A disadvantage to linear probing is the tendency for **clustering**. This means that if many collisions occur at the same hash value, a number of surrounding slots will be filled by the linear probing resolution.

### Collisions: rehashing

One way to deal with clustering is to extend the linear probing technique so that instead of looking sequentially for the next open slot, we skip slots, thereby more evenly distributing the items that have caused collisions. This will potentially reduce the clustering that occurs.

For example, once a collision occurs, we will look at every third slot until we find one that is empty.

$newhashvalue=rehash(oldhashvalue)$

$rehash(pos) = (pos+3)$ % $sizeoftable$

A variation of the linear probing idea is called **quadratic probing**.

Instead of using a constant skip value, we use a rehash function that increments the hash value by 1, 3, 5, 7, 9, and so on. This means that if the first hash value is $h$, the successive values are $h+1$, $h+4$, $h+9$, $h+16$, and so on.

Question: when to stop rehashing?

### Map

One of the most useful Python collections is the dictionary. Recall that a dictionary is an associative data type where you can store **key–value** pairs. The key is used to look up the associated data value. We often refer to this idea as a **map**.


* **Map()** Create a new, empty map. It returns an empty map collection.
* **put(key,val)** Add a new key-value pair to the map. If the key is already in the map then replace the old value with the new value.
* **get(key)** Given a key, return the value stored in the map or None otherwise.
* **del** Delete the key-value pair from the map using a statement of the form del map[key].
* **len()** Return the number of key-value pairs stored in the map.
* **in** Return True for a statement of the form key in map, if the given key is in the map, False otherwise.

In [14]:
class MyHashTable:
    def __init__(self):
        self.size = 11
        self.keys = [None] * self.size
        self.values = [None] * self.size
        
    def put(self, key, value):
        hashvalue = self.hashfunction(key, self.size)

        # add new item to the slot
        if self.keys[hashvalue] == None:
            self.keys[hashvalue] = key
            self.values[hashvalue] = value
        else:
            # replace net value into slot
            if self.keys[hashvalue] == key:
                self.values[hashvalue] = value
            # collision
            else:
                # find next empty slot
                nextslot = self.rehash(hashvalue, self.size)
                while self.keys[nextslot] != None and self.keys[nextslot] != key:
                    nextslot = self.rehash(nextslot, self.size)
                
                # add new item to non-empty slot
                if self.keys[nextslot] == None: # add
                    self.keys[nextslot] = key
                    self.values[nextslot] = value
                else: # replace
                    self.values[nextslot] = value

    def hashfunction(self, key, size):
        return key % size
    
    def rehash(self, oldhash, size):
        return (oldhash + 1) % size
    
    def get(self, key):
        startslot = self.hashfunction(key, self.size)
        
        value = None
        stop = False
        found = False
        position = startslot
        while self.keys[position] != None and not found and not stop:
            if self.keys[position] == key:
                found = True
                value = self.values[position]
            else:
                position = self.rehash(position, self.size)
                if position == startslot:
                    stop = True
        return value
    
    # default function of []
    def __getitem__(self,key):
        return self.get(key)
    
    def __setitem__(self, key, value):
        self.put(key, value)

In [15]:
H = MyHashTable()
H[54] = "cat"
H[26] = "dog"
H[93] = "lion"
H[17] = "tiger"
H[77] = "bird"
H[31] = "cow"
H[44] = "goat"
H[55] = "pig"
H[20] = "chicken"

In [16]:
print(H.keys)
print(H.values)

[77, 44, 55, 20, 26, 93, 17, None, None, 31, 54]
['bird', 'goat', 'pig', 'chicken', 'dog', 'lion', 'tiger', None, None, 'cow', 'cat']


In [17]:
print(list(zip(H.keys, H.values)))

[(77, 'bird'), (44, 'goat'), (55, 'pig'), (20, 'chicken'), (26, 'dog'), (93, 'lion'), (17, 'tiger'), (None, None), (None, None), (31, 'cow'), (54, 'cat')]


In [18]:
H[20] = 'duck'
print(list(zip(H.keys, H.values)))

[(77, 'bird'), (44, 'goat'), (55, 'pig'), (20, 'duck'), (26, 'dog'), (93, 'lion'), (17, 'tiger'), (None, None), (None, None), (31, 'cow'), (54, 'cat')]


### Analysis of Hashing

In the best case hashing would provide a $O(1)$, constant time search technique. However, due to collisions, the number of comparisons is typically not so simple.