#   Big Data
## Algorithms: Searching, Recursion and Data Structures
## Victor P. Debattista March 2017


Welcome to the second lecture on algorithms and data structures.  This one focusses on Searching, especially on trees and hashes

In [103]:
import numpy as np
import math
import random
import time

We are going to adapt some code from the sorting exercise.  We want to create two lists of N numbers which we will use as our list for storing and searching.  One of our lists has uniformly random numbers, the other has a Gaussian (normal/Bell curve) distribution

In [106]:
random.seed(22)
N = 10000
#N = 10

# Data1 is uniformly distributed
data1 = []
for i in range(N):
    data1.append(random.uniform(0.,10000.))

# Data2 is distributed as a Gaussian with average = 100 and sigma = 20
data2 = []
for i in range(N):
    data2.append(random.normalvariate(100.,20.))

We are going to experiment with open hashing, which we will implement as a list of lists in Python, akin to how we did a BinSort in week 1.

In [108]:
def InitHash(nbins):
    htable = []    # This is the empty hash table
    for i in range(nbins):
        htable.append([])
    return htable

We want to define a few functions to determine some statistics of our hash table occupation.  We want three quantities, the minimum entires, the maximum entries, and the average entries.

In [109]:
def HashStats(htable, N):
    sizes = np.zeros(N)
    for i in range(N):
        sizes[i] = (len(htable[i]))
    min_entries = min(sizes)
    max_entries = max(sizes)
    avg_entries = np.mean(sizes)
    return min_entries, avg_entries, max_entries    

From last week's BinSort exercise, let's borrow the indexing function: given a value which is within a given range [lo,hi], finds the bin to place the element into if there are N bins.  If the value is out of range some flag value should be returned.  This will be the basis of our hash function

In [112]:
def bin_index(val, lo, hi, N):
    if( (val < lo) or (val > hi) ):
        return -1
    else: 
        tmp = (val - lo) * N/(hi -lo)
        return int(tmp)

For convenience let's add a function that takes a list, a hash function and number of buckets and hashes it

In [113]:
def Hashify(arr, hashfunction, lo, hi, N):
    htable = InitHash(N)
    for val in arr:
        ind = hashfunction(val, lo, hi, N)
        if(ind >=0):
            htable[ind].append(val)
    return htable

In [114]:
# first compute statistics if the numbers are uniform
hashTable = Hashify(data1, bin_index, 0., 10000., 50000)
minbin1, avgbin1, maxbin1 = HashStats(hashTable, 50000)
print('Minimum: {} Average: {} Maximum: {}'.format(minbin1, avgbin1, maxbin1))

# now compute statistics if the data are more bunched
hashTable = Hashify(data2, bin_index, 0, 10000. ,50000)
minbin2, avgbin2, maxbin2 = HashStats(hashTable, 50000)
print('Minimum: {} Average: {} Maximum: {}'.format(minbin2, avgbin2, maxbin2))

Minimum: 0.0 Average: 0.2 Maximum: 4.0
Minimum: 0.0 Average: 0.2 Maximum: 57.0


So this is not very satisfying, our hash function is causing a lot of collisions, which are going to slow down our searches.  Develop a new hash function and compare it with the one above.

Here we're going to define a new hash function based on inverting the order of digits

In [189]:
def invert_digits(num):
    '''Reverses the digit order of a floating point number keeping the delimiter in the same place. '''
    tmp = list(str(num))
    i = tmp.index('.')
    tmp = tmp[::-1]
    tmp.remove('.')
    tmp.insert(i,'.')
    tmp = ''.join(tmp)
    return float(tmp)

def hash_fun2(val, lo, hi, N):
    tmp = invert_digits(val)
    j = bin_index(tmp, lo, hi, N)
    return j

In [190]:
# first compute statistics if the numbers are uniform
hashTable = Hashify(data1, hash_fun2 ,0. ,10000. ,50000)
minbin1, avgbin1, maxbin1 = HashStats(hashTable, 50000)
print('Minimum: {} Average: {} Maximum: {}'.format(minbin1, avgbin1, maxbin1))

#now compute statistics if the data are more bunched
hashTable = Hashify(data2, hash_fun2, 0., 10000., 50000)
minbin2, avgbin2, maxbin2 = HashStats(hashTable, 50000)
print('Minimum: {} Average: {} Maximum: {}'.format(minbin2, avgbin2, maxbin2))

Minimum: 0.0 Average: 0.2 Maximum: 4.0
Minimum: 0.0 Average: 0.2 Maximum: 22.0


So we still have too many collisions.  Need to try a different approach.  In the next hashing we're going to use those digits after the decimal point for our hashing

In [209]:
def get_digits(num):
    '''Returns the digits of num after the decimal point. '''
    tmp = str(num)
    i = tmp.index('.')
    tmp = tmp[i+1:]
    return int(tmp)
    
def hash_fun3(val, lo, hi, N):
    tmp = val - int(val)
    while(tmp < hi):
        tmp = tmp*10.
    tmp = tmp % hi
    j = bin_index(tmp, lo, hi, N)
    return j

In [211]:
# first compute statistics if the numbers are uniform
hashTable = Hashify(data1, hash_fun3, 0., 10000., 50000)
minbin1, avgbin1, maxbin1 = HashStats(hashTable, 50000)
print('Minimum: {} Average: {} Maximum: {}'.format(minbin1, avgbin1, maxbin1))

# now compute statistics if the data are more bunched
hashTable = Hashify(data2, hash_fun3, 0., 10000., 50000)
minbin2, avgbin2, maxbin2 = HashStats(hashTable, 50000)
print('Minimum: {} Average: {} Maximum: {}'.format(minbin2, avgbin2, maxbin2))

Minimum: 0.0 Average: 0.2 Maximum: 4.0
Minimum: 0.0 Average: 0.2 Maximum: 4.0


Let us now develop the functionality for a binary search tree.  Since this involves defining some classes, we develop that here before moving on to some questions.  We start by developing the class Node, which is the basic nodes of a binary tree

In [216]:
class Node:
    def __init__(self,val):
        self.l = None # Left child
        self.r = None # Right child
        self.v = val # Value of the node

Now we need to build the Tree class.  We build the functionality for inserting and finding

In [393]:
class Tree:
    def __init__(self):
        self.root = None
        self.counter = 0
        self.leaves = 0
    
    def add(self, val):
        if(self.root == None):
            self.root = Node(val)
        else:
            self._add(val, self.root)
    
    def _add(self, val, node):
        if(val < node.v):
            if(node.l is not None):
                self._add(val, node.l)
            else:
                node.l = Node(val)
        else:
            if(node.r is not None):
                self._add(val, node.r)
            else:
                node.r = Node(val)
                
    def find(self, val):
        if(self.root != None):
            return self._find(val, self.root)
        else:
            return None
    
    def _find(self, val, node):
        if(val == node.v):
            return node
        elif(val < node.v and node.l is not None):
            return self._find(val, node.l)
        elif(val > node.v and node.r is not None):
            return self._find(val, node.r)
        
    def count_nodes(self, node):
        if node is None:
            return
        else:
            self.count_nodes(node.l)
            self.counter += 1
            self.count_nodes(node.r)
        return self.counter
                
    def depth(self, node, maxormin):
        if node is None:
            return 0
        else:
            if maxormin == 'max':
                return max(self.depth(node.l, maxormin), self.depth(node.r, maxormin)) + 1
            elif maxormin == 'min':
                return min(self.depth(node.l, maxormin), self.depth(node.r, maxormin)) + 1
            else:
                return 'Not a valid argument'

    def count_leaves(self, node):
        if node is None:
            return
        elif node.l is None and node.r is None:
            self.leaves += 1
            return
        else:
            self.count_leaves(node.l)
            self.count_leaves(node.r)
        return self.leaves
            
    def inorder(self, node):
        if node is None:
            return
        else:
            self.inorder(node.l)
            print(node.v)
            self.inorder(node.r)

In [290]:
# And here are some examples on how to use this functionality
bst = Tree()
bst.add(3)
bst.add(4)
bst.find(4)
print(bst.root.r.v)

4


OK with this build two BST called "tr" with the elements of data1 generated at the top of this exercise (you can try it with data2 also)

In [400]:
'''My solution: '''
tr = Tree()
#for val in data1[0:10]:
for val in data1:
    tr.add(val)
print('Total number of nodes in the tree: {}'.format(tr.count_nodes(tr.root)))
print('Minimal Depth: {}'.format(tr.depth(tr.root, 'min')))
print('Maximal Depth: {}'.format(tr.depth(tr.root, 'max')))
print('Number of leaves: {}'.format(tr.count_leaves(tr.root)))
print('In order listing of the tree:')
tr.inorder(tr.root)

Total number of nodes in the tree: 10000
Minimal Depth: 4
Maximal Depth: 34
Number of leaves: 3340
In order listing of the tree:
0.13393762374525053
0.3062384190777312
1.26708702534728
2.961197729346443
4.908756848494011
5.125039670373921
5.283946623794167
5.4342714850963425
5.7266200916850085
5.856266724588721
7.101599520924484
7.872340325080218
8.0505966744715
9.046769304964508
10.152647079678667
10.290272011553858
11.431390455889368
12.632956620867164
13.383384161104184
13.856996503586183
14.19595637134119
15.202994378135104
15.273444439588557
16.458492625418543
18.627653780133315
24.582342445047534
24.968650964095616
25.65288650420383
26.089287599714297
26.349254369729678
27.749067034571475
28.37172717109704
30.317692513512906
30.72030692929051
30.805742976781623
31.02824113720648
33.15201163174963
33.393759471659344
35.270328739854804
35.49771808817992
36.08260271464658
37.98506879418406
38.52633858870802
38.73848315194439
40.68927581108994
42.48149957450931
46.71913191452104
48.3

Returning to the Tree class definition above, add methods for computing the number of values stored, the maximum and minimum distance to all leaves, the number of leaves and an Inorder listing of the tree.  Once you have that compute the following (during debugging work with only 10 elements of data1, i.e. data1[0:10], by uncommenting the appropriate line above and commenting the one below it

In [395]:
''' This should be the the result. '''
print('Total number of nodes in tree =',tr.count_nodes(tr.root))
print('Minimum depth =',tr.depth(tr.root,min))
print('Maximum depth =',tr.depth(tr.root,max))
print('Total number of leaves =',tr.count_leaves(tr.root))
tr.inorder(tr.root)

Total number of nodes in tree = 20
Minimum depth = Not a valid argument
Maximum depth = Not a valid argument
Total number of leaves = 8
236.1614713882554
1205.9206321532502
1403.685900763948
1842.5364570285308
2317.41489986076
3456.448375625667
6514.212405579194
8895.509397958029
9582.093798172727
9986.306536729146
