# Arranging and Searching Data
The four data operations are create, read, update and delete (CRUD), which focus on the need to access the data you need to perform just about every task in life quickly and easily. Placing data in an order that makes it easy to perform CRUD operations is important because the less code you need to make data access work, the better. 

Sorted data makes searches considerably faster, as long as the sort matches the search. Sorting and searching go together: you sort the data in a way that makes searching faster.

There are many different ways available to search for data. Some of these techniques are slower than others; some have attributes that make them attractive to developers.

The use of indexing (in hash maps/dictionaries) makes sorting and searching significantly faster but also comes with trade-offs that you need to consider (such as the use of additional resources).

## Introduction to Sorting Algorithms

### Defining Why Sorting Data is Important

When the data is unsorted, you need to search one item at a time, and you don't even know whether you'll find what you need without searching every item in the dataset first. The need to maintain several sorted orders for the same data is the reason that developers created indexes. Sorting a small index is faster than sorting the entire dataset. By maintaining an index for each sort requirement, you can effectively cut data access time and allow several people to access the data at the same time in the order in which they need to access it.

When considering how effective a particular sort algorithm is at arranging data, timing benchmarks typically looks at two factors:
- **Comparisons: ** The number of times the target data is compared against existing data in the dataset
- **Exchanges: ** The number of times data changes place in a dataset during the sort process

### Ordering Data Naively

This consists of ordering data by using brute-force methods, without any regard whatsoever to making any kind of guess as to where the data should appear in the list. These approaches tend to work with the entire dataset at once (as opposed to taking a divide and conquer approach, for example), and are also relatively easy to understand while using compute resources efficently. The trade-off is that their runtime can be slower than other 'smarter' algorithms.

#### Selection Sort

https://en.wikipedia.org/wiki/Selection_sort

A selection sort works in one of two ways: it either looks for the smallest item in the list and places it in the front of the list (ensuring that the item is in its correct location) or looks for the largest item and places it in the back of the list. 

*Worst-Case Runtime: * O(n<sup>2</sup>)

*Benefits:*
- Easy to implement
- Guarantees that items immediately appear in their final location once moved (minimal exchanges)

In [12]:
data = [9, 5, 7, 4, 2, 8, 1, 10, 6, 3]

def selectionSort(data):
    for scanIndex in range(0, len(data)):
        minIndex = scanIndex
        
        for compIndex in range(scanIndex + 1, len(data)):
            if data[compIndex] < data[minIndex]:
                minIndex = compIndex
        
        if minIndex != scanIndex:
            data[minIndex], data[scanIndex] = data[scanIndex], data[minIndex]
        
        print(data)
            
    return data
            
data = selectionSort(data)

[1, 5, 7, 4, 2, 8, 9, 10, 6, 3]
[1, 2, 7, 4, 5, 8, 9, 10, 6, 3]
[1, 2, 3, 4, 5, 8, 9, 10, 6, 7]
[1, 2, 3, 4, 5, 8, 9, 10, 6, 7]
[1, 2, 3, 4, 5, 8, 9, 10, 6, 7]
[1, 2, 3, 4, 5, 6, 9, 10, 8, 7]
[1, 2, 3, 4, 5, 6, 7, 10, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 10, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


#### Insertion Sort

https://en.wikipedia.org/wiki/Insertion_sort

An insertion sort works by using a single item as a starting point and adding items to the left or right of it based on whether these items are less than or greater than the selected item.

*Best-Case Runtime:* O(n) - When the entire dataset is already sorted, no values need to be moved 

*Worst-Case Runtime:* O(n<sup>2</sup>) - When the entire dataset is in reverse order, every insertion requires moving every value that already appears in the output

*Benefits:*
- Easy to implement
- Can require fewer comparisons than a selection sort

In [8]:
data = [9, 5, 7, 4, 2, 8, 1, 10, 6, 3]

def insertionSort(data):
    for scanIx in range(1, len(data)):
        temp = data[scanIx]
        
        while scanIx > 0 and temp < data[scanIx - 1]:
            data[scanIx] = data[scanIx - 1]
            scanIx -= 1
            
        data[scanIx] = temp
        print(data)
    
    return data

data = insertionSort(data)

[5, 9, 7, 4, 2, 8, 1, 10, 6, 3]
[5, 7, 9, 4, 2, 8, 1, 10, 6, 3]
[4, 5, 7, 9, 2, 8, 1, 10, 6, 3]
[2, 4, 5, 7, 9, 8, 1, 10, 6, 3]
[2, 4, 5, 7, 8, 9, 1, 10, 6, 3]
[1, 2, 4, 5, 7, 8, 9, 10, 6, 3]
[1, 2, 4, 5, 7, 8, 9, 10, 6, 3]
[1, 2, 4, 5, 6, 7, 8, 9, 10, 3]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


### Employing Better Sort Techniques

As technology improves, the sort algorithms begin taking a more intelligent approach to getting data into the right order. Rather than work with an entire dataset, smart sorting algorithms work with individual items, reducing the work required to perform the task.

#### Merge Sort

https://en.wikipedia.org/wiki/Merge_sort

A merge sort works by applying the divide and conquer and approach. The sort begins by breaking the dataset into individual pieces and sorting the pieces. It then merges the pieces in a manner that ensures that it has sorted the merged piece. This process continues until the entire dataset is again a single, sorted piece.

*Worst-Case Runtime:* O(n log n) - This runtime is considerably faster than the previous examples, as log n is always < n

**Python Tip: ** When slicing a list, the slice is inclusive of the begin index and exclusive of the end index
- The slice data[2:] will return a list from index position 2 and onwards
- The slice data[:2] will return a list up to but excluding index position 2 and onwards

In [20]:
# Python Tip Demo
data = [5, 3, 7, 4, 9]

mid = len(data) // 2 # floor division to return int
print("Mid: ", mid)
print("data[{}:] - ".format(mid), data[mid:])
print("data[:{}] - ".format(mid), data[:mid])

Mid:  2
data[2:] -  [7, 4, 9]
data[:2] -  [5, 3]


In [1]:
# Mergesort Demo
data = [5, 3, 7, 4, 9]

def mergeSort(data):
    # Check Base Case: data is one (or zero) elements
    if len(data) < 2:
        return data
    
    # Calculate midpoint using floor division
    mid = len(data) // 2
    
    # Split dataset
    left = mergeSort(data[:mid])
    right = mergeSort(data[mid:])
    
    # Merge the sorted pieces
    print("Left Side: ", left)
    print("Right Side: ", right)
    merged = merge(left, right)
    print("Merged: ", merged)
    return merged

def merge(left, right):
    result = []
    leftIx = 0
    rightIx = 0
    totalLen = len(left) + len(right)
    
    while len(result) < totalLen:
        if left[leftIx] < right[rightIx]:
            result.append(left[leftIx])
            leftIx += 1
        else:
            result.append(right[rightIx])
            rightIx += 1
    
        if leftIx == len(left) or rightIx == len(right):
            result.extend(left[leftIx:] or right[rightIx:])
            break
            
    return result

mergeSort(data)

Left Side:  [5]
Right Side:  [3]
Merged:  [3, 5]
Left Side:  [4]
Right Side:  [9]
Merged:  [4, 9]
Left Side:  [7]
Right Side:  [4, 9]
Merged:  [4, 7, 9]
Left Side:  [3, 5]
Right Side:  [4, 7, 9]
Merged:  [3, 4, 5, 7, 9]


[3, 4, 5, 7, 9]

#### Quick Sort

https://en.wikipedia.org/wiki/Quicksort

Quick sort also works by applying a divide and conquer approach. It picks a pivot point in the dataset, sorts lower data below the pivot and higher data above it, then quick sorts the partitions on either side of the pivot point. When a pivot point is processed, its locked in place in the array. 

*Average-Case Runtime:* O(n log n)

*Worst-Case Runtime:* O(n<sup>2</sup>) - several events can cause the worst case runtime, although it is rare: the dataset is already sorted, the dataset is sorted in reverse order, or all elements are all the same. 

In [2]:
data = [9, 5, 7, 4, 2, 8, 1, 10, 6, 3]


def quickSort(data, low, high):
    if low < high:
        part = partition(data, low, high)
        quickSort(data, low, part)
        quickSort(data, part + 1, high)

# Implementation of the Hoare partition scheme
def partition(data, low, high):
    pivot = data[(low + high) // 2]
    i = low
    j = high
    
    while True:
        while data[i] < pivot:
            i += 1
        while data[j] > pivot:
            j -= 1
        
        if i >= j:
            return j
    
        print("\nSwapping high ({}) and low ({})".format(data[i], data[j]))
        data[i], data[j] = data[j], data[i]
        print(data)

print("Before sort: ", data)
quickSort(data, 0, len(data) - 1)

Before sort:  [9, 5, 7, 4, 2, 8, 1, 10, 6, 3]

Swapping high (9) and low (1)
[1, 5, 7, 4, 2, 8, 9, 10, 6, 3]

Swapping high (5) and low (2)
[1, 2, 7, 4, 5, 8, 9, 10, 6, 3]

Swapping high (8) and low (3)
[1, 2, 7, 4, 5, 3, 9, 10, 6, 8]

Swapping high (9) and low (8)
[1, 2, 7, 4, 5, 3, 8, 10, 6, 9]

Swapping high (8) and low (6)
[1, 2, 7, 4, 5, 3, 6, 10, 8, 9]

Swapping high (10) and low (8)
[1, 2, 7, 4, 5, 3, 6, 8, 10, 9]

Swapping high (7) and low (3)
[1, 2, 3, 4, 5, 7, 6, 8, 10, 9]

Swapping high (7) and low (6)
[1, 2, 3, 4, 5, 6, 7, 8, 10, 9]

Swapping high (10) and low (9)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


## Using Search Trees and the Heap

Search trees enable you to look for data quickly. Obtaining data items, placing them in sortred order in a tree, and then searching that tree is one of the faster ways to find information. 

A special kind of tree structure is the *binary heap*, which places each of the node elements in a special order: upper-level branches are always smaller value than lower-level branches and leaves. The effect is to keep the tree balanced and in a predictable order so that searching becomes extremely efficient. The cost is in keeping the tree balanced.

### Considering the Need to Search Effectively

Of all the tasks that applications do, searching is the more time consuming and also the one required most. The benefit to creating and maintaining a dataset comes from using it to perform useful work, which means searching it for important information. Consequently, searches must proceed as efficiently as possible. The only problem is that no one search performs every task with absolute efficiency, so you must weigh your options based on what you expect to do as part of the search routines.

Two of the more efficient methods of searching involve the use of the binary search tree and binary heap.

#### Binary Search Tree

https://en.wikipedia.org/wiki/Binary_search_tree

In a binary search tree, the keys follow an order in which lesser numbers appear to the left of a node, and greater numbers appear to the right. The root node contains a value that is somewhere in the middle of the range of keys, giving the BST an easily understood structure. 

**Advantages over Binary Heap for Searching:**
- Searching for an element requires O(log n) time, which is less than O(n) for a binary heap
- Printing the elements in order requires only O(log n) time, which is less than O(n log n) for a binary heap
- Finding the floor and ceiling requires O(log n) time
- Locating K<sup>th</sup> smallest/largest element requires O(log n) when the tree is properly configured

In [25]:
class Node:
    def __init__(self, data, left=None, right=None):
        self.data = data
        self.left = left
        self.right = right
        
    def __str__(self):
        return str(self.data)

class BinarySearchTree:
    def __init__(self, rootNode=None):
        self.rootNode = rootNode
        
    def __str__(self):
        return "Root Node: {}".format(self.rootNode.data)
        
    def push(self, node):
        pointer = self.rootNode
        nodeAdded = False
        
        # Iterate through tree unitl suitable location is found
        while(not nodeAdded):
            if pointer == None:
                self.rootNode = node
                nodeAdded = True
            elif node.data > pointer.data:
                if pointer.right == None:
                    pointer.right = node
                    nodeAdded = True
                else:
                    pointer = pointer.right
            else:
                if pointer.left == None:
                    pointer.left = node
                    nodeAdded = True
                else:
                    pointer = pointer.left
        
        return nodeAdded
    
    def traverse(self, pointer=None, level=0):
        if pointer == None:
            pointer = self.rootNode
        
        if pointer.left != None:
            self.traverse(pointer.left, level + 1)
        if pointer.right != None:
            self.traverse(pointer.right, level + 1)
        
        print("\t" * level, pointer.data)
        

data = [5, 9, 7, 4, 2, 8, 1, 10, 6, 3]

myTree = BinarySearchTree()

for val in data:
    myTree.push(Node(val))

myTree.traverse()

			 1
			 3
		 2
	 4
			 6
			 8
		 7
		 10
	 9
 5


#### Binary Heap

https://en.wikipedia.org/wiki/Binary_heap

There are two main kinds of binary heaps: a *binary max heap*, where each level of the heap contains values that are less than the previous level and the root contains the maximum key value for the tree, and a *binary min heap*, where each level of the heap contains values that are greater than the previous level and the root contains the maximum key value for the tree.

**Advantages Over BST for Searching:**
- Creating the required structures requires fewer resources because binary heaps can rely on arrays, making them cache friendlier as well.
- Building a binary heap requires O(n) time, where building a BST requires O(n log n) time
- Using pointers to implement the tree isn't necessary
- Relying on binary heap variations (i.e. the Fibonacci Heap) offers advantages such as increase and decrease key times of O(1) time

**Binary Heap Implementation:**

The below implementation is sourced directly from: https://interactivepython.org/courselib/static/pythonds/Trees/BinaryHeapImplementation.html

Refer to the above link for a detailed explanation of the implementation.

In [19]:
class BinaryHeap:
    def __init__(self):
        self.heapList = [0]
        self.currentSize = 0 
        
    def insert(self, data):
        self.heapList.append(data)
        self.currentSize += 1
        self.percUp(self.currentSize)
    
    def percUp(self,i):
        while i // 2 > 0:
            if self.heapList[i] < self.heapList[i // 2]:
                self.heapList[i], self.heapList[i // 2] = self.heapList[i // 2], self.heapList[i]
            i = i // 2
        
    def delMin(self):
        minVal = self.heapList[1] # get minimum value
        
        self.heapList[1] = self.heapList[self.currentSize] # put last value in list at top of heap
        self.heapList.pop() # pop the last value out of the list
        self.currentSize = self.currentSize - 1 # reduce size by one to account for popping
        
        self.percDown(1) # the index value to be pushed down the tree structure
        
        return minVal
    
    def percDown(self,i):
        while (i * 2) <= self.currentSize:
            mc = self.getMinChildIndex(i)
            
            if self.heapList[i] > self.heapList[mc]:
                self.heapList[i], self.heapList[mc] = self.heapList[mc], self.heapList[i]

            i = mc
            
    def getMinChildIndex(self,i):
        if i * 2 + 1 > self.currentSize:
            return i * 2
        else:
            if self.heapList[i*2] < self.heapList[i*2+1]:
                return i * 2
            else:
                return i * 2 + 1
    
    # if we start with an entire list then we can build the whole heap in O(n) operations
    # iterating through the list and using insert results in O(n log n) time
    def buildHeap(self,alist):
        i = len(alist) // 2
        self.currentSize = len(alist)
        self.heapList = [0] + alist[:]
        while (i > 0):
            self.percDown(i)
            i = i - 1
        
data = [5, 9, 7, 4, 2, 8, 1, 10, 6, 3]

binHeap = BinaryHeap()
binHeap.buildHeap(data)

print(binHeap.heapList)

[0, 1, 2, 5, 4, 3, 8, 7, 10, 6, 9]


## Relying on Hashing

A major problem with most sorting algorithms is that they sort all the data in a dataset. When the dataset is small, you hardly notice, but as the dataset grows larger, the data movement becomes noticeable. A way around this problem is to sort just the key information. A *key* is the identifying data for a particular data record.

You gain a major speed advantage by sorting the smaller amount of data presented by the keys, rather than the records as a whole.

### Putting Everything into Buckets

Until now, the search and sort routines presented have worked by performing a series of comparisons unitl the algorithm finds the correct value. The act of performing comparisons slows the algorithms because each comparison takes some amount of time to complete.

A smarter way to perform the task involves predicting the location of a particular data item in the data structure (whatever that structure might be) before actually looking for it. A *hash table* does this by providing the means to create an index of keys that points to individual items in a data structure so that an algorithm can easily predict the location of the data. The index of keys is created by a *hash function*, which converts keys into numerical values which serve as the index. 

Because a hash function produces repeatable results, you can easily predict the location of required data - in many cases, a hash table provides a search time of O(1).

A hash table contains a specific number of *slots* that you can view as buckets for holding data. The number of filled slots when compared to the number of available slots is the *load factor*. When the load factor increases, the potential for *collisions*, where two data entries have the same hash value, increases as well. 

One of the more typical methods for calculating the hash value for an input is to obtain the modulus of the value divided by the number of slots. Theoretically, if you have a perfect hash function and an infinite number of slots, every value you present to the hash function will produce a unique value (thus avoiding collisions). However, the more complex your hash function, the less benefit you receive from hashing, so keeping things simple is the best way to go. 

In [39]:
class HashTable:
    def __init__(self, size=15):
        self.table = [None] * size
        self.size = size
    
    def __str__(self):
        return str(self.table)
    
    def hashFx(self, key):
        return key % self.size
    
    def insert(self, value):
        self.table[self.hashFx(value)] = value
        
    def lookup(self, key):
        return self.table[self.hashFx(key)]

data = [22, 40, 102, 105, 23, 31, 6, 5]

ht = HashTable()
for val in data:
    ht.insert(val)
    
print("Values Stored in Hash Table: ", ht)
val = 22
print("Lookup {} in Hash Table: ".format(val), ht.lookup(val))

Values Stored in Hash Table:  [105, 31, None, None, None, 5, 6, 22, 23, None, 40, None, 102, None, None]
Lookup 22 in Hash Table:  22


### Avoiding Collisions

A problem occurs when two data entries have the same hash value. If you simply write the value into the hash table, the second entry will overwrite the first, resulting in data loss. *Collisions*, the use of the same hash value by two values, require you to have some sort of strategy in mind for handling them.

One of the methods for avoiding collisions is to ensure that you have a large enough hash table - keeping the load factor low is your first line of defense. However, sometimes the potential dataset is so large, but the used dataset is so small, that avoiding the problem becomes impossible. Consequently, a hash function may have to use more than just a simple modulus output to create the hash value. 

**Techniques for Avoiding Collisions:**
- **Partial values:** When working with some types of information, part of that information repeats, which can create collisions. For example, the first three digits of a phone number can repeat for a given area, so removing those numbers and using just the remaining ones may solve the collision problem.
- **Folding:** Creating a unique number might be as easy as dividing the original number into pieces, adding the pieces together, and using the result as the hash value. For example: 555-1234 -> 55 + 51 + 234 = 340
- **Mid-Square:** The hash squares the value in question, uses some number of digits from the center of the resulting number, and discards the rest of those digits. For example: 120<sup>2</sup> = 14,400, then use 440 as the hash value

There are as many ways to generate the hash function as someone has imagination. Unfortunately, no hash function can guarantee that collisions won't happen - when they do occur, you can use one of the following methods to address it.

**Techniques for Addressing Collisions:**
- **Open addressing:** Your hash function stores the value in the next open slot by looking through the slots sequentially until it finds an open slot to use. The problem with this approach is that it assumes an open slot for each potential value, which may not be the case. In addition, open addressing means that the search slows considerably after the load factor increases. You can no longer find the needed value on the first comparison.
- **Rehashing:** Your hash fuction hashes the hash value plus some constant. Consider you have a hash value of 22, a constant of 100, and a table containing 30 slots. If slot 22 already has a value, you can rehash with the function (22 + 100) % 30, which produces a new hash value of 2. In this case, you don't need to search the hash table sequentially for a value. When implemented correctly, a search might still include a low number of comparisons to find the target value. 
- **Chaining:** Each slot in the hash table can hold mulitple values. You can implement this approach by using a list within a list. Everytime a collision occurs, the code simply appends the value to the list in the target slot. This approach offers the benefit of knowing that the hash will always produce the correct slot, but the list within that slot will still require some sort of sequential (or other) search to find the specific value.

In [46]:
class HashTableOpenAdd:
    def __init__(self, size=15):
        self.table = [None] * size
        self.size = size
    
    def __str__(self):
        return str(self.table)
    
    def hashFx(self, key):
        return key % self.size
    
    def insert(self, value):
        hashVal = self.hashFx(value)
        
        if self.table[hashVal] == None:
            self.table[hashVal] = value
        else:
            print("Collision! Hash Value: {}".format(hashVal))
            while self.table[hashVal] != None:
                hashVal += 1
            self.table[hashVal] = value
    
    def lookup(self, key):
        hashVal = self.hashFx(key)
        
        if self.table[self.hashFx(key)] == key:
            return key
        else:
            while self.table[hashVal] != key:
                print("Found unexpected key at Hash Value {}. Checking next slot ...".format(hashVal))
                hashVal += 1
            return self.table[hashVal]
                 
data = [22, 40, 102, 105, 23, 31, 6, 5, 34, 68]

ht = HashTableOpenAdd(10)
for val in data:
    ht.insert(val)

print("Values Stored in Hash Table: ", ht)
val = 34
print("\nLookup {} in Hash Table: ".format(val))
print("Result: ", ht.lookup(val))

Collision! Hash Value: 2
Collision! Hash Value: 3
Collision! Hash Value: 5
Collision! Hash Value: 4
Collision! Hash Value: 8
Values Stored in Hash Table:  [40, 31, 22, 102, 23, 105, 6, 5, 34, 68]

Lookup 34 in Hash Table: 
Found unexpected key at Hash Value 4. Checking next slot ...
Found unexpected key at Hash Value 5. Checking next slot ...
Found unexpected key at Hash Value 6. Checking next slot ...
Found unexpected key at Hash Value 7. Checking next slot ...
Result:  34


In [55]:
class HashTableRehashing:
    def __init__(self, size=15, const=123):
        self.table = [None] * size
        self.size = size
        self.const = const
    
    def __str__(self):
        return str(self.table)
    
    def hashFx(self, key):
        return key % self.size
    
    def insert(self, value):
        hashVal = self.hashFx(value)
        
        if self.table[hashVal] == None:
            self.table[hashVal] = value
        else:
            print("Collision! Rehashing ...")
            hashVal = self.hashFx(hashVal + self.const)
            while self.table[hashVal] != None:
                hashVal = self.hashFx(hashVal + self.const)
            self.table[hashVal] = value
        
    def lookup(self, key):
        hashVal = self.hashFx(key)
        while self.table[hashVal] != key:
            print("Wrong value at Hash Value {}. Rehashing...".format(hashVal))
            hashVal = self.hashFx(hashVal + self.const)
        return self.table[hashVal]
                  
data = [22, 40, 102, 105, 23, 31, 6, 5, 34, 68]

ht = HashTableRehashing(size=10)
for val in data:
    ht.insert(val)

print("Values Stored in Hash Table: ", ht)
val = 34
print("\nLookup {} in Hash Table: ".format(val))
print("Result: ", ht.lookup(val))

Collision! Rehashing ...
Collision! Rehashing ...
Collision! Rehashing ...
Collision! Rehashing ...
Collision! Rehashing ...
Values Stored in Hash Table:  [40, 31, 22, 23, 5, 102, 6, 34, 105, 68]

Lookup 34 in Hash Table: 
Wrong value at Hash Value 4. Rehashing...
Result:  34


In [76]:
class HashTableChain:
    def __init__(self, size=15):
        self.table = [None] * size
        self.size = size
    
    def __str__(self):
        return str(self.table)
    
    def hashFx(self, key):
        return key % self.size
    
    def insert(self, value):
        hashVal = self.hashFx(value)
        if self.table[hashVal] == None:
            self.table[hashVal] = [value]
        else:
            self.table[hashVal].append(value)
    
    def lookup(self, key):
        hashVal = self.hashFx(key)
        for val in self.table[hashVal]:
            if val == key:
                return val
            else:
                print("Wrong value. Checking next element in list ...")
    
data = [22, 40, 102, 105, 23, 31, 6, 5, 34, 68]

ht = HashTableChain(size=10)

for val in data:
    ht.insert(val)

print("Values Stored in Hash Table: ", ht)
val = 102
print("\nLookup {} in Hash Table: ".format(val))
print("Result: ", ht.lookup(val))

Values Stored in Hash Table:  [[40], [31], [22, 102], [23], [34], [105, 5], [6], None, [68], None]

Lookup 102 in Hash Table: 
Wrong value. Checking next element in list ...
Result:  102


### Hash Functions in Python

If you wonder where to find other uses of hash tables around you, check out Python's dictionaries. Dictionaries are, in fact, hash tables, even though they have a smart way to deal with collisions and you won't lose your data because two hashed keys casually have the same result. The use of a hash explains why you can't use every data type as a key - mutable data types, like lists, are unhashable.

You can find many examples of different hash functions in the Python `hashlib` package. It contains many algorithms, including Secure Hash Algorithms and RSA's MD5 Algorithm.

- **Secure Hash Algorithm (SHA):** These algorithms include SHA1, SHA224, SHA256, SHA384 and SHA512. Released by the National Institute of Standards and Technology as a U.S. Federal Information Processing Standard, SHA algorithms provide support for security applications and protocol.
- **RSA's MD5 Algorithm:** Initially designed for security applications, this hash turned into a popular way to checksum files. *Checksums* reduce files to a single number that enables you to determine whether the file was modified since hash creation - they let you determine if the file you downloaded was corrupted or altered by a hacker. To ensure file integrity, just check whether the MD5 checksum of your copy corresponds to the original one communicated by the author of the file.

You can combine the output of multiple hash functions when working with complex applications that rely on a large dataset. Simply sum the results of the various outputs after having done a multiplication on or more of them. The sum of two hash functions treated in this way retains the qualities of the original hash functions even though the result is different and impossible to recover as the original elements of the sum.

The following code snippet relies on the `hashlib` package and the `md5` and `sha1` hash algorithms. You just provide a number to use for the multiplication inside the hash sum.

In [84]:
from hashlib import md5, sha1

# A function to create infinite hash functions (dependent on the i provided)
def hashFx(element, i, length=10**8):
    h1 = int(md5(element.encode('ascii')).hexdigest(), 16)
    h2 = int(sha1(element.encode('ascii')).hexdigest(), 16)
    return (h1 + i*h2) % length

print("Hash of 'CAT', Mulitple of 1: {}".format(str(hashFx("CAT", 1))))
print("Hash of 'CAT', Mulitple of 2: {}".format(str(hashFx("CAT", 2))))

Hash of 'CAT', Mulitple of 1: 21064018
Hash of 'CAT', Mulitple of 2: 72743738
