# Hash Tables



## 1 Building Energy  Management System 

### 1.1 Building Energy Management System

**A building energy management system(BEMS)** is a computer-based system that monitors and controls a building’s electrical and mechanical equipment such as lighting,power systems, heating, ventilation, and air conditioning(HVAC),security measures and so on. 

![](./img/ds/bems.jpg)



### 1.2  The Simple Example: Measurement Tags of VCR

The table store the Measurement Tags of VCR Example 11-2,every the Tag recored has the uniqe tagID

Refrigerant-134a enters the compressor of a refrigerator as superheated vapor at 0.14 MPa and -10°C at a rate of 0.05kg/s and leaves at 0.8 MPa and 50°C.

The refrigerant is cooled in the condenser to 26°C and 0.72MPa and is throttled to 0.15 MPa.

Disregarding any heat transfer and pressure drops in the connecting lines between the components, 

**Determine：** the power input to the compressor,

![](./img/vcr/vcr-11-2.jpg)


In [None]:
%%file ./data/vcr-11-2.csv
TagID,Tag,Desc,Unit,Value
600,CompressorIPortM,压缩机入口流量,kg/s,0.05
616,CompressorOPortP,压缩机出口压力,MPa,0.8
613,CompressorOPortT,压缩机出口温度,°C,50
714,CondenserOPortP,冷凝器出口压力,MPa,0.72
708,CondenserOPortT,冷凝器出口温度,°C,26.0
814,ExpansionValveOPortP,膨胀阀出口压力,MPa,0.15
914,EvaporatorValveOPortP,蒸发器出口压力,MPa,0.14
908,EvaporatorValveOPortT,蒸发器出口温度,°C,-10.0

**Data Stucture of Tags**

```python

tag=(id,(tag,desc,value)) # tuple 

VCC1_TagTable=[]  # list

VCC1_TagTable=[(id,(tag,desc,unit,value)),...]
```

In [None]:
import  csv
filename="./data/vcr-11-2.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)
vcc_table=[]
for line in csvdata:
    id = int(line['TagID']) # convert to int
    tag=line['Tag']
    desc=line['Desc']
    unit=c=line['Unit']
    value=float(line['Value'])
    vcc_table.append((id,(tag,desc,unit,value)))
csvfile.close()  

In [None]:
for item in  vcc_table:
    print(item)

Get the tags of Compressor through tagID by the Linear Search

In [None]:
tagid=616
for item in vcc_table:
    if tagid==item[0]:
        print(tagid,item[1])       

The Linear Search will perform  $𝑂(N)$  

Here are the data structures, the complexities of their key lookup operations:

| Data structure   |  lookup  |
| ----------------- |:--------:|
| Array         |  O(N)    |
| Sorted array    |   O(logN) |
 
If there is a data structure that can do better. And it turns out that there is: **the hash table**, one of the best and most useful data structures 

In Python, the type <font color="blue">dict</font> dictionaries use <b>hashing</b> to do <b>the lookup in time</b> 

* that is nearly `independent` of the `size` of the dictionary


The basic idea behind hashing is

* **convert the key to an integer, and then use that integer to index into a list**

which can be done in `constant` time. 

**Hash functions** : any function that can be used to map data of `arbitrary` size to `fixed-size` values.

* `CurTagID%ListSize`(除留余数法 k mod m - 关键字k除以表长度m的余数)
![](./img/ds/hash1.png)

**Hash value** : The values returned by a hash function are called 
    
* `Index_TagID=CurTagID%ListSize`

**Hash table**: the data structure that maps keys to values with hashing

* `vcc_table=[None for i in range(ListSize)]`

>散列表通过把`关键码`值映射到表中`一个位置`来访问记录，以加快查找的速度。这个映射函数叫做散列函数，存放记录的数组叫做散列表 


For example we use the remainder `key%ListSize` as the index into the list

In [None]:
import  csv
filename="./data/vcr-11-2.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)

# set the size of the store list
ListSize=30;
# the store table 
vcc_table=[None for i in range(ListSize)]
for line in csvdata:
    id = int(line['TagID'])
    tag=line['Tag']
    desc=line['Desc']
    unit=line['Unit']
    value=float(line['Value'])
    # convert the key to an integer:  address in the list
    address= id%ListSize
    # put the record in the address of the list
    vcc_table[address]=(id,(tag,desc,unit,value))
    print(id,address)
csvfile.close() 

In [None]:
for  i,item in  enumerate(vcc_table):
    print(i,item)

Search one tag from TagID with the `unique` tagid


In [None]:
tagid=616
address=tagid%ListSize
print(tagid,vcc_table[address])   
print(tagid,vcc_table[address][1])   

It is done in **constant** time that is `independent` of the `size` of VCC1_TagList

* The search time complexity is $O(1)$

每个关键字对应一个列表中的一个存储位置，就可以**直接寻址**，查询的时间复杂度是O(1)

## 2 Collision 

### 2.1 Collision 

If the space of possible `outputs` is **smaller** than the space of possible `inputs`, 

* a hash function is a `many`-to-`one` mapping. 

the different keys are mapped to the same hash value,it is called a <b>collision</b>. 

>散列冲突：在散列表中，不同的关键字值对应到同一个存储位置的现象


**For exmple**

* the input sizes of key is :9

* the output sizes of hash value:ListSize is 5

The hash function : `id%ListSize`


In [None]:
import  csv
filename="./data/vcr-11-2.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)

# set the size of the store list
ListSize=5;
# the store table 
for line in csvdata:
    id = int(line['TagID'])
    # convert the key to an integer: address of the list
    address= id%ListSize
    print(id,address)
csvfile.close()  

**Many Collision!**
```
613 3
708 3
808 3
908 3
```
```
714 4
814 4
914 4
```

**If Collision**

In [None]:
import  csv
filename="./data/vcr-11-2.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)

# set the size of the store list
ListSize=5;
# the store table 
vcc_table=[None for i in range(ListSize)]
for line in csvdata:
    id = int(line['TagID'])
    tag=line['Tag']
    desc=line['Desc']
    unit=line['Unit']
    value=float(line['Value'])
    # convert the key to an integer:address of the list
    address= id%ListSize
    # put the record in the index of the list
    vcc_table[address]=(id,(tag,desc,unit,value))
csvfile.close() 

In [None]:
CompressorTagIDList=[600,616,613,914,908]
for tagid in CompressorTagIDList:
    address=tagid%ListSize
    print(vcc_table[address]) 

you will see 
```
(908, ('EvaporatorValveOPortT', '蒸发器出口温度', '°C', -10.0))
(908, ('EvaporatorValveOPortT', '蒸发器出口温度', '°C', -10.0))
```
because the Collision 
```
613 3
908 3
```

If the space of possible outputs is bigger than the space of possible inputs

* **no collisions**?

For example: `ListSize=20`

In [None]:
import  csv
filename="./data/vcr-11-2.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)

# set the size of the store list
ListSize=20;
# the store table 
vcc_table=[None for i in range(ListSize)]
for line in csvdata:
    id = int(line['TagID'])
    tag=line['Tag']
    desc=line['Desc']
    unit=line['Unit']
    value=float(line['Value'])
    # convert the key to an integer: address of the list
    address= id%ListSize
    # put the record in the index of the list
    vcc_table[address]=(id,(tag,desc,unit,value))
    print(id, address)
csvfile.close() 

**collisions**

```
714 14
814 14
914 14

708 8
908 8
```


### 2.2 Handle the collision 

The paths to handle the collision in Hash Table

1. **minimizes collisions**: 

   * the `good` hash function： produces : **uniform distribution** every output in the range is equally probable, which `minimizes` the probability of `collisions`(散列函数设计要点:均匀性好,减少元素冲突次数)
    
  *  the `sweet spot` size of hash table


2. **collision resolution**: Separate Chainingg(分离链接法), Open Addressing(开放地址法） 


### 2.3 Choice of hash table size

Assuming you have a good hash function, by making the hash table large enough,

Let’s think about the extremes:

* You create a hash table with 1,000,000 buckets and you add 1,000 items to it. The chances of a collision are extremely low, and this will perform amazingly.

we can **reduce** the number of collisions sufficiently to allow us to treat the complexity as O(1).

* 一个足够大的数组，**可以**为每个关键字保留一个位置，就可以**直接寻址**，时间复杂度是O(1)。

It will **waste a lot of space**. Therefore, you need to find the `“sweet spot”` for the size of the hash table vs. the number of items you plan to put into it. 

Choice of hash table size depends in part on choice of hash function, and collision resolution strategy

But a good general **rule of thumb** is:

* The hash table should be an array with length about **1.3** times the maximum number of keys that will actually be in the table, and
Size of hash table array should be a **prime** number



## 3 Separate Chaining

### 3.1 Separate Chaining(分离链接法)
There are different ways through which a collision can be resolved. We will look at a method called **Separate Chaining(分离链接法)**, 

**Chain hashing** avoids collision. The idea is to make each cell of hash table point to a linked list of records(`bucket`) that have same hash function value.

* 将散列到同一个值的所有元素保留到一个`链表`中

**bucket(桶)**:  a linked list of records with same hash function value

The hash table is a list of `hash buckets`. 

For Example:
```
keys :   [36,18,72,43,6,10,5,15]
tab size : 8
hash function : key % tab size
```
![](./img/ds/hashtable_separatechaining.gif)



### 3.2 Hash Table in Python

The basic idea is to represent the hash table by a list where **each item** is a list of **key/value** pairs that have the `same` hash index

```python
[

    [bucket of the same hash value1],

    [bucket of the same hash value2]
,...
]
```

the every key/value pair in bucket is the tuple:
```python
(key, value)
```

In [None]:
keyvalues = [(36, "赵"), (18, "钱"), (72, "孙"), (43, "李"), (6, "周"), (10, "吴"), (5, "郑"), (15, "王")]
num_buckets=8

buckets=[[] for i in range(num_buckets)]

print("Key","The address in buckets","\n"+20*"-")
for item in keyvalues:
    #hash function: key % num_buckets
    address= item[0] % num_buckets
    buckets[address].append(item)
    print(item[0],address)

print("\nNo.","Bucket","\n"+20*"-")   
for  i,bucket in  enumerate(buckets):
    print(i,bucket)

### 3.3 Search

In [None]:
key=10
hashvalue=key % num_buckets
for item in buckets[hashvalue]:
    if item[0]==key:
        print(key,item[1])  

## 4 The Class of Separate Chaining

* **key** is integer or string

* **hash function**: Key % numBuckets and djb2 

> Hash function for string
>
>* http://www.cse.yorku.ca/~oz/hash.html
>
>**djb2**
>
>This algorithm (k=33) was first reported by `Dan Bernstein` many years ago in comp.lang.c.
>
>The magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.
>```python
> hash = 5381
> for c in dictKey:
>     hash = ((hash *33) + hash) + ord(c)
> hash % numBuckets
>```


In [None]:
class hashTable:
    """A dictionary with integer and string keys"""
    
    def __init__(self, numBuckets):
        """Create an empty dictionary
           buckets is initialized to a list of numBuckets empty lists.
        """
        self.numBuckets=numBuckets
        self.buckets=[[] for i in range(self.numBuckets)] 
            
    def getHashValue(self, dictKey):
        if isinstance(dictKey, int):
            return dictKey % self.numBuckets
        if isinstance(dictKey,str):
            # djb2 hash algorithm by Dan Bernstein
            hash = 5381
            for c in dictKey:
                hash = ((hash *33) + hash) + ord(c)
            return hash % self.numBuckets

    
    def addEntry(self, dictKey, dictVal):
        """Assumes dictKey an int.  Adds an entry."""
        address=self.getHashValue(dictKey)
        hashBucket = self.buckets[address]
        for i in range(len(hashBucket)):
            if hashBucket[i][0] == dictKey:
                hashBucket[i] = (dictKey, dictVal) #if one was found,replace
                return
        hashBucket.append((dictKey, dictVal)) # append a new entry (dictKey, dictVal) to the bucket if none was found.
        
    def getValue(self, dictKey):
        """Returns entry associated with the key dictKey"""
        hashBucket = self.buckets[self.getHashValue(dictKey)]
        for e in hashBucket:
            if e[0] == dictKey: # key
                return e[1]     # the tuple of value 
        return None
    
    def __str__(self):
        result = '{'
        for b in self.buckets:
            for e in b:
                result = result + str(e[0]) + ':' + str(e[1]) + ','
        return result[:-1] + '}' #result[:-1] omits the last comma


### 4.1  Init the hash table

```python
def __init__(self, numBuckets):
   """
   The instance variable buckets is initialized to a list of numBuckets empty lists
   """
        self.numBuckets = numBuckets
        self.buckets=[[] for i in range(self.numBuckets)] 
```

### 4.2  hash function

```python
def getHashValue(self, dictKey):
        if isinstance(dictKey, int):
            return dictKey % self.numBuckets
        if isinstance(dictKey,str):
            # djb2 hash algorithm by Dan Bernstein
            hash = 5381
            for c in dictKey:
                hash = ((hash *33) + hash) + ord(c)
            return hash % self.numBuckets

```


### 4.3 addEntry

By making each bucket a list, we handle collisions by storing all of the values that hash to the same bucket in the list</b>. 

```python
def addEntry(self, dictKey, dictVal):
    """
     To store or look up an entry with key **dictKey
    """ 
    hashBucket = self.buckets[self.getHashValue(dictKey)] # hashing the location `hashBucket` list in  the list of self.buckets 
    for i in range(len(hashBucket)):
        if hashBucket[i][0] == dictKey:# the item in each bucket is tuple: (dictKey, dictVal)
            hashBucket[i] = (dictKey, dictVal) #if one was found,replace
            return
         hashBucket.append((dictKey, dictVal)) # append a new entry (dictKey, dictVal) to the bucket if none was found.
```      
   
we use the hash function  to convert dictKey into an integer, 
```python  
 hashBucket = self.buckets[self.getHashValue(dictKey)] # hashing the location `hashBucket` list in  the list of self.buckets 
```    
and use that integer to index into buckets 
```python
   hashBucket[i]
```
to find the hash bucket associated with **dictKey**: if <b>a value is to be stored</b>,then  

* if one was found:  <b>replace</b> the value in the existing entry,  

* if none was found: <b>append</b> a new entry to the bucket


### 4.4 getValue

```python 

def getValue(self, dictKey)
```
We then search that bucket (which is a list) linearly to see if there is an entry with the key dictKey.

```python 
 for e in hashBucket:
            if e[0] == dictKey: // key
                return e[1]     // value
```

If we are doing <b>a lookup</b> and there is an entry with the key, we simply return the value stored with that key. 

If there is no entry with that key, we return None. 




### 4.5 Measurement Tags of VCC

#### 4.5.1 Integer keys
The hash table for Measurement Tags of VCC


In [None]:
import  csv
filename="./data/vcr-11-2.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)
Entrys=[]
for line in csvdata:
    id = int(line['TagID'])
    tag=line['Tag']
    desc=line['Desc']
    unit=line['Unit']
    value=float(line['Value'])
    Entrys.append((id,(tag,desc,unit,value))) 
csvfile.close()  

**hash table smaller sise ,collisions**

* numBucket=5

In [None]:
numBuckets=5
# numBuckets 5 <entries 
D = hashTable(numBuckets)
for item in Entrys:
    D.addEntry(item[0],item[1])

print('The hashTable(integer key) is:')
print(D)

print('\n', 'The hase buckets are:')
for i,hashBucket in enumerate(D.buckets):
    print('BucketID',i,'  ', hashBucket)


**one, two, or three tuples** depending upon <b>the number of collisions</b> that occurred

In [None]:
CompressorTagIDList=[600,616,613,914,908]
for tagid in CompressorTagIDList:
    thebucket=D.getValue(tagid)   
    print(tagid,thebucket)

In [None]:
tagid=808
thebucket=D.getValue(tagid)  
print(tagid,thebucket)

#### 4.5.2 String keys

In [None]:
import  csv
filename="./data/vcr-11-2.csv"
csvfile = open(filename, 'r',encoding="utf-8")
csvdata = csv.DictReader(csvfile)
Entrys=[]
for line in csvdata:
    tag=line['Tag']
    desc=line['Desc']
    unit=line['Unit']
    value=float(line['Value'])
    Entrys.append((tag,(desc,unit,value))) 
csvfile.close()  

In [None]:
numBuckets=5
# numBuckets 5 <entries 10
D = hashTable(numBuckets)
for item in Entrys:
    D.addEntry(item[0],item[1])

print('The hashTable(String key) is:')
print(D)

print('\n', 'The hase buckets are:')
for i,hashBucket in enumerate(D.buckets):
    print('\tBucketID',i,'  ', hashBucket)


In [None]:
tagid='CompressorOPortP'
thebucket=D.getValue(tagid)  
print(tagid,thebucket)

## Further Reading

* 严蔚敏，李冬梅，吴伟民. 数据结构（C语言版），人民邮电出版社（第2版）,2015年2月  

* Mark Allen Weiss. Data Structures and Algorithm Analysis in C
