## Hashing

In [1]:
## Define some function useful for testing
import random

## generate an array of n random integers up to b
def get_random_array(n, b = 50):
    return [random.randint(0, b) for _ in range(n)]

---

### Open Addressing with linear probing

[Open addressing](https://en.wikipedia.org/wiki/Open_addressing) is a collision resolution technique used for handling collisions in hashing. 

All the items are stored in a table of size $\alpha n$, where $n$ is the number of keys and $\alpha > 1$ is the load factor.

Initially, the table contains only a special value ```None``` which says that the entry is empty. Another 
special value, say character ```'D'``` is used to mark a entry that contained a key that has been deleted.

A hash functon $h()$ is used to specify the order of entries to probe for a key to be inserted/searched/deleted. 
We start by probing $h(k)$ and, with linear probing, the sequence of probes $S(k)$ is $h(k), h(k)+1, h(k)+2, \ldots$ , modulo $\alpha n$.

- **Insert** adds the key in the first empty slot that we found with positions in $S(k)$.
- **Lookup** is performed by checking positions in $S(k)$ until we find either the key or ```None```.
- **Delete** is performed by first sesrching the key and then by replacing it with ```'D'```. Why don't we use ```None``` instead? 


![alt text](LinearProbing.jpg "Example")

### Exercise: Open Addressing with linear probing
Complete the implementation below by implementing ```Lookup```and ```Delete```.


**Optional:** Try to implement quadratic probing. This is the technique employed by Python's set and dictionary.  

In [2]:
## Your implementation goes here

class linear_probing_set:
    def __init__(self, size):
        
        self.T = [None]*size
        self.prime = 993319
        self.a = random.randint(2, self.prime-1)
        self.b = random.randint(2, self.prime-1)
        self.n_keys = 0
        
    def insert(self, key): #fix len(T)< self.n.keys if you want 
        if self.lookup(key): # per evitare duplicati. Impiega più tempo del loop sotto= si ferma a None, non a D 
            return
        h = self.hash(key)
        while self.T[h] != None and self.T[h] != 'D':
            h += 1
            if h == len(self.T):
                h = 0
        self.T[h] = key
        self.n_keys += 1
    
    # Return True if key is in the set, False otherwise
    def lookup(self, key):
        h=self.hash(key)
        visite = 0
        while self.T[h] != None:
            if self.T[h] == key:
                return True
            h+=1
            visite += 1
            if h==len(self.T):
                h=0
            if visite==len(self.T): #per non ricercare all'infinito
                return False
        return False
    
    def delete(self, key):
        h=self.hash(key)
        visite=0
        while self.T[h] != None:
            if self.T[h] == key:
                self.T[h]='D'
                self.n_keys-=1
                return h
            h+=1
            visite+=1
            if h==len(self.T):
                h=0
            if visite==len(self.T):
                return False
        return False  
    
    
    def hash(self, key):
        return ((self.a*key + self.b) % self.prime) % len(self.T)
    
    def len(self):
        return self.n_keys

In [3]:
## Test your implementation

n = 10000

a = get_random_array(n, n)

queries = get_random_array(n, n)

lp_set = linear_probing_set(2*n)
std_set = set()

for key in a:
    lp_set.insert(key)
    std_set.add(key)

assert len(std_set) == lp_set.len(), "Fail len!"     
    
for key in a:
    assert lp_set.lookup(key) == True, "Lookup fail a"

for key in queries:
    assert lp_set.lookup(key) == (key in std_set), "Lookup fail queries"
    
for key in a[:300]:
    lp_set.delete(key)
    try:
        std_set.remove(key)
    except:
        pass # the key has been already removed
          
    assert lp_set.lookup(key) == (key in std_set), "Lookup fail delete"

In [4]:
%timeit for key in queries: lp_set.lookup(key)
    
%timeit for key in queries: key in std_set

16.6 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.28 ms ± 4.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


----
### Hashing with Chains
Instead of just storing the elements in the slots in the table $T$, let every slot be a list which contains all the elements which are in the table and map to that slot. Our operations now become:

- `Insert` $(k)$: hash $k$ to an index $i$ in the table. You may want to check if $k$ is already in the set first.
- `Lookup` $(k)$: search for $k$ in the list by iterating through all the list.
- `Delete` $(k)$: search for $k$ and then remove it from the list.

Lookup and Delete takes $O(s)$ time where $s$ is the size of the list. We define $\alpha = \frac{n}{m}$ as the **load factor**. If we assume simple uniform hashing, then each element has equal probability to go into any slot. So after $n$ independent elements have been inserted we have and expected length of $\frac{n}{m} = \alpha$ for each chain by linearity of expectation. So the run time of all the above operations is time to hash + time to do these operations which is $O(1 + \alpha)$.

![alt text](Chaining.gif "Example")

### Exercise: Hashing with Chains
Complete the implementation below by implementing ```Lookup``` and ```Delete```.

In [5]:
## Your implementation goes here

class chaining_set:
    def __init__(self, size):
        
        self.T = []
        for _ in range(size):
            self.T.append([]) 
        ## why not self.T = [[]] * size ?
            
        self.prime = 993319
        self.a = random.randint(2, self.prime-1)
        self.b = random.randint(2, self.prime-1)
        self.n_keys = 0
        
    def insert(self, key):
        if self.lookup(key):
            return
        
        h = self.hash(key)
        self.T[h].append(key)
        self.n_keys += 1
    
    # return True if key is in the set, False otherwise
    def lookup(self, key):
        h=self.hash(key)
        i=0
        for i in range(len(self.T[h])):
            if self.T[h][i] == key:
                return True
            if i==len(self.T[h]):
                return False
        return False
            
    
    def delete(self, key):
        h = self.hash(key)
        for i in range(len(self.T[h])):
            if self.T[h][i] == key:
                self.T[h][i], self.T[h][-1] = self.T[h][-1], self.T[h][i]
                self.T[h].pop() #rimuovo l'elemento da eliminare che ho messo all'ultima posizione, così non modifico la posizione degli altri elementi
                self.n_keys-=1
                return True
        return None 
            
    def hash(self, key):
        return ((self.a*key + self.b) % self.prime) % len(self.T)
    
    def len(self):
        return self.n_keys

In [6]:
## Test your implementation

n = 10000

a = get_random_array(n, n)

queries = get_random_array(n, n)

c_set = chaining_set(2*n)
std_set = set()

for key in a:
    c_set.insert(key)
    std_set.add(key)

assert len(std_set) == c_set.len(), "Fail len!"     
    
for key in a:
    assert c_set.lookup(key) == True, "Lookup fail a"

    
for key in queries:
    assert c_set.lookup(key) == (key in std_set), "Lookup fail queries"

for key in a[:300]:
    c_set.delete(key)
    try:
        std_set.remove(key)
    except:
        pass # the key has been already removed
          
    assert c_set.lookup(key) == (key in std_set), "Lookup fail delete"  

In [7]:
%timeit for key in queries: c_set.lookup(key)
    
%timeit for key in queries: key in std_set

16 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.3 ms ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


----

### Exercise: Dictionary
Modify the previous code to implement a dictionary, i.e., store a value together with each key. 
You need to implement methods:
- ```Insert(key, value)```: insert the key with its value. If the key was already present, change its value;
- ```Delete(key)```: remove the key;
- ```Lookup(key)```: return True if the key is present, False otherwise;
- ```Value(key)```: return the value associated with the key. It returns None, if the key is not present.

I suggest to store pairs (key, value) within the lists.


**Optional**. 
Implement ```keys()```, ```values()```, and ```items()``` which allows you to iterate over keys, values, and pairs (key, value) respectively. You have to use ```yield``` to implement each generator.  

In [8]:
## Your implementation goes here

class Dizionario:
    def __init__(self, size):
        
        self.T = []
        for _ in range(size):
            self.T.append([]) 
            
        self.prime = 993319
        self.a = random.randint(2, self.prime-1)
        self.b = random.randint(2, self.prime-1)
        self.n_keys = 0
        
             
    def insert(self, key, value):
        h = self.hash(key)
        esiste = False
        
        lista = self.T[h]
        for tup_pos in range(len(lista)):
            if lista[tup_pos][0] == key: #la chiave è già presente
                esiste = True 
                lista[tup_pos] = (key, value)
                break
        if esiste == False: #evito i duplicati
            lista.append((key, value))
            self.n_keys+=1
    
    
    # return True if key is in the set, False otherwise
    def lookup(self, key):
        h = self.hash(key)
        lista = self.T[h]
        for i in range(len(lista)):
            if lista[i][0] == key:
                return True
        return False
            
    
    def delete(self, key):
        h=self.hash(key)
        i=0
        lista = self.T[h]
        for i in range(len(lista)):
            if lista[i][0] == key:
                lista[i], self.T[h][-1] = self.T[h][-1], lista[i]
                self.T[h].pop() #rimuovo l'elemento da eliminare che ho messo all'ultima posizione, così non modifico la posizione degli altri elementi
                self.n_keys-=1
                return i
            if i==len(lista):
                return None
        return None 
    
    
    def value(self, key):
        h = self.hash(key)
        lista = self.T[h]
        for i in range(len(lista)):
            if lista[i][0] == key:
                return lista[i][1]
        return None
    
    
    def hash(self, key):
        return ((self.a*key + self.b) % self.prime) % len(self.T)
    
    
    def len(self):
        return self.n_keys

In [9]:
## Write here some tests to test your implementation
n = 100

chiavi = get_random_array(n, n)
valori = get_random_array(n, n)
coppia = list(zip(chiavi, valori))

diz = Dizionario(2*n)
test_diz = dict()

for key, value in coppia:
    diz.insert(key, value)
    test_diz[key] = value

    
assert len(test_diz) == diz.len(), "Fail len!"     
    
for key in chiavi:
    assert diz.lookup(key) == True, "Lookup fail key"
    
for key in chiavi[:300]:
    diz.delete(key)
    test_diz.pop(key, None)
    assert diz.lookup(key) == (key in test_diz), 'Lookup fail delete'

for key in chiavi:
    assert diz.value(key) == test_diz.get(key), 'Value fail queries'

In [10]:
diz.insert(10, 20)
diz.value(10)

20

In [11]:
diz.insert(10, 2)
diz.value(10)

2

In [12]:
diz.insert(10, 37)
diz.value(10)

37