## Search & Sort Algorithms

The search algorithms are: 1. sequential search (improved in ordered list) 2. Binary serach
The sort algorithms are: 
1. selection sort
2. bubble sort
3. merge sort
4. quick sort
5. insertion sort
6. shell sort

### 1. Sequential search

#### 1.1 Unordered list
| Case               | Best Case   | Worst Case | Average Case  |
|--------------------|-------------|------------|---------------|
|item is present     |    1        |  n         |  n/2          |
|item is not present |    n        |  n         |  n            |

#### 1.2 Ordered list
| Case               | Best Case   | Worst Case | Average Case  |
|--------------------|-------------|------------|---------------|
|item is present     |    1        |  n         |  n/2          |
|item is not present |    1        |  n         |  n/2          |

However, this technique is still O(n)O(n). In summary, a sequential search is improved by ordering the list only in the case where we do not find the item.

#### Unordered List

In [11]:
def sequentialSearch(alist, item):
    pos = 0
    
    while pos < len(alist):
        if alist[pos]==item:
            return True
        else:
            pos += 1
    return False

In [10]:
def sequentialSearch(alist, item):
    
    while alist:
        if alist[0]==item:
            return True        
        alist = alist[1:]       
        
    return False
    

In [13]:
sequentialSearch([],3)

False

#### Ordered list

In [14]:
def sequentialSearchOrdered(alist, item):
    pos=0
    
    while pos < len(alist):
        if alist[pos]==item:
            return True
        elif alist[pos] > item:
            return False
        else:
            pos += 1
            
    return False

In [16]:
sequentialSearchOrdered([0, 1, 2, 8, 13, 17, 19, 32, 42,],17)

True

### 1. Binary search

If the list is sorted we can use binary search which is Divide and Conquere algorithm.

In [19]:
def binarySearch(alist, item):
    if not alist:
        return False
    
    while alist:
        mid = len(alist)//2
    
        if alist[mid]==item:
            return True
        elif alist[mid] > item:
            alist = alist[:mid]
        else:
            alist = alist[mid+1:]
    return False
    

In [23]:
binarySearch([0, 1, 2, 8, 13, 17, 19, 32, 42,], 32)

True

In [16]:
def binarySearchRecurion(alist, item):
    if not alist:
        return False
    mid = len(alist)//2
    
    if alist[mid]==item:
        return True
    elif alist[mid] > item:
        return binarySearchRecurion(alist[:mid], item)
    else:
        return binarySearchRecurion(alist[mid+1:], item)

In [28]:
binarySearchRecurion([0, 1, 2, 8, 13, 17, 19, 32, 42,], 33)

False

In [17]:
binarySearchRecurion([4,5,6,7],5)

True

### Analysis of Binary search:
http://interactivepython.org/runestone/static/pythonds/SortSearch/TheBinarySearch.html#lst-binarysearchpy

binarySearch(alist[:midpoint],item)

uses the slice operator to create the left half of the list that is then passed to the next invocation (similarly for the right half as well). The analysis that we did above assumed that the slice operator takes constant time. However, we know that the slice operator in Python is actually O(k). This means that the binary search using slice will not perform in strict logarithmic time. Luckily this can be remedied by passing the list along with the starting and ending indices. The indices can be calculated as we did in Listing 3. We leave this implementation as an exercise.

In [32]:
def binarySearchRecurion(alist, item):
    return helper(alist, item, 0, len(alist)-1)

def helper(alist, item, start, end):
    if start > end:
        return False
    
    mid = (start + end)//2
    
    if alist[mid]==item:
        return True
    elif alist[mid] > item:
        return helper(alist, item, start, mid-1)
    else:
        return helper(alist, item, mid+1, end)

In [35]:
binarySearchRecurion([0, 1, 2, 8, 13, 17, 19, 32, 42,], 32)

True

Even though a binary search is generally better than a sequential search, it is important to note that for small values of n, the additional cost of sorting is probably not worth it. In fact, we should always consider whether it is cost effective to take on the extra work of sorting to gain searching benefits. If we can sort once and then search many times, the cost of the sort is not so significant. However, for large lists, sorting even once can be so expensive that simply performing a sequential search from the start may be the best choice.

In [38]:
(len([3, 5, 6, 8, 11, 12, 14, 15, 17, 18] )-1)//2

4

### Hashing
http://interactivepython.org/runestone/static/pythonds/SortSearch/Hashing.html

**Hashing** : The concept of a data structure to search in O(1). For searching an item in a hash table: hash function gets the item and outputs an integer O(1) and at that slot position (integer) we check of the item exists O(1).

** Hash table** : A collection of items stored in a way that it makes it easy to search them later O(1). Each position in the hash table is called slot numbered by an integer starting from 0.

**Hash function**: A mapping between an item and the slot where the item belongs to in the hashing table. A hashing function takes an item and return its slot number (an integer number)

** remainder method ** : a simple hash function.

** load factor **: size of items/size of table = (number of keys/size of slots table)

** Collision or clash**: when two items have the same slot number.

**perfect hashing function** : A hashing function that maps each item into a unique slot position. There is no collision.

Our goal is to create a hash function that minimizes the number of collisions, is easy to compute, and evenly distributes the items in the hash table. There are a number of common ways to extend the simple remainder method. We will consider a few of them here:

** folding method ** :the method for constructing a hash function starts by dividing the item into equal-size pieces (the last piece may not be of equal size). These pieces are then added together to give the resulting hash value. For example, if our item was the phone number 436-555-4601, we would take the digits and divide them into groups of 2 (43,65,55,46,01). After the addition, 43+65+55+46+01, we get 210. If we assume our hash table has 11 slots, then we need to perform the extra step of dividing by 11 and keeping the remainder. In this case 210 % 11 is 1, so the phone number 436-555-4601 hashes to slot 1. Some folding methods go one step further and reverse every other piece before the addition. For the above example, we get 43+56+55+64+01=219 which gives 219 % 11=10. 

** mid-square method **: We first square the item, and then extract some portion of the resulting digits. For example, if the item were 44, we would first compute 44^2=1,936. By extracting the middle two digits, 93, and performing the remainder step, we get 5 (93 % 11). 

We can also create hash functions for character-based items such as strings. The word “cat” can be thought of as a sequence of ordinal values:

In [40]:
ord('c')+ord('a')+ord('t')

312

In [41]:
def hashString(astring, tablesize):
    sum=0
    for char in astring:
        sum += ord(char)
    return sum%tablesize

In [42]:
hashString('cat', 11)

4

It is interesting to note that when using this hash function, anagrams will always be given the same hash value. To remedy this, we could use the position of the character as a weight. Figure 7 shows one possible way to use the positional value as a weighting factor. 

In [43]:
def hashString(astring, tablesize):
    sum=0
    
    for i in range(len(astring)):
        sum += ord(astring[i])*(i+1)
        
    return sum%tablesize

In [44]:
hashString('cat', 11)

3

** Collision resolution **

** Open addressing ** : to find the next open slot or address in the hash table.

** linear probing ** : systematically visiting each slot one at a time. The disadvantage of linear probing is tendency for **clustering**. This means that if many collisions occur at the same hash value, a number of surrounding slots will be filled by the linear probing resolution. This will have an impact on other items that are being inserted.

Once we have built a hash table using open addressing and linear probing, it is essential that we utilize the same methods to search for items. if we find a different item is an address, it might have been a collision. We should search sequentailly to see of the item exists somewhere else.

**rehashing**:the process of looking for another slot after collision.  With simple linear probing, the rehash function is newhashvalue=*rehash*(oldhashvalue) where *rehash(pos)=(pos+1) % sizeoftable*.

In general, rehash(pos)=(pos+skip)%sizeoftable. It is important to note that the size of the “skip” must be such that all the slots in the table will eventually be visited. Otherwise, part of the table will be unused. To ensure this, it is often suggested that the table size be a prime number. This is the reason we have been using 11 in our examples.

** quadratic probing **: Instead of using a constant “skip” value, quadratic probing uses a skip consisting of successive perfect squares.This means that if the first hash value is h, the successive values are h+1, h+4, h+9, h+16, and so on.

** Chaining**: An alternative method for handling the collision problem is to allow each slot to hold a reference to a collection (or chain) of items. Chaining allows many items to exist at the same location in the hash table. When collisions happen, the item is still placed in the proper slot of the hash table. As more and more items hash to the same location, the difficulty of searching for the item in the collection increases.

### Map Abstract Data Structure (ADT)


The map abstract data type is defined as follows. The structure is an **unordered collection** of **associations between a key and a data value**. The **keys in a map are all unique** so that there is a one-to-one relationship between a key and a value. The operations are given below.

**Map()** Create a new, empty map. It returns an empty map collection.

**put(key,val)** Add a new key-value pair to the map. **If the key is already in the map then replace the old value with the new value**.

**get(key)** Given a key, return the value stored in the map or None otherwise.

**del** Delete the key-value pair from the map using a statement of the form **del map[key]**.

**len()** Return the number of key-value pairs stored in the map.

**in** Return True for a statement of the form **key in map**, if the given key is in the map, False otherwise.

One of the great benefits of a dictionary is the fact that given a key, we can look up the associated data value very quickly. In order to provide this fast look up capability, we need an implementation that supports an efficient search. We could use a list with sequential or binary search but it would be even better to use a hash table as described above since looking up an item in a hash table can approach O(1) performance.

we use two lists to create a HashTable class that implements the Map abstract data type. One list, called slots, will hold the key items and a parallel list, called data, will hold the data values. When we look up a key, the corresponding position in the data list will hold the associated data value. Note that the initial size for the hash table has been chosen to be 11. Although this is arbitrary, it is important that the size be a prime number so that the collision resolution algorithm can be as efficient as possible.

*hashFunction* implements the **simple remainder** method. The **collision resolution technique is linear** probing with a “plus 1” rehash function. The **put** funtion assumes that there will eventually be an empty slot unless the key is already present in the self.slots. It computes the original hash value and if that slot is not empty, iterates the rehash function until an empty slot occurs. If a nonempty slot already contains the key, the old data value is replaced with the new data value.


Likewise, the get function (see Listing 4) begins by computing the initial hash value. If the value is not in the initial slot, rehash is used to locate the next possible position. Notice that line 15 guarantees that the search will terminate by checking to make sure that we have not returned to the initial slot. If that happens, we have exhausted all possible slots and the item must not be present.

The final methods of the HashTable class provide additional dictionary functionality. We overload the __getitem__ and __setitem__ methods to allow access using``[]``. This means that once a HashTable has been created, the familiar index operator will be available. We leave the remaining methods as exercises.

In [2]:
class HashTable(object):
    def __init__(self, size):
        self.size=size
        self.slots=[None]*self.size
        self.data=[None]*self.size
        
    def put(self,key,data):
        hashValue = self.hashFunction(key,len(self.slots))
        
        if self.slots[hashValue] is None:
            self.slots[hashValue]=key
            self.data[hashValue] = data
        else:
            if self.slots[hashValue]==key:
                self.data[hashValue]=data #replace
            else:
                nextslot = self.rehash(hashValue, len(self.slots))
                
                while self.slots[nextslot]!=None and\
                self.slots[nextslot]!=key:
                    nextslot = self.rehash(nextslot, len(self.slots))
                
                if self.slots[nextslot] == None:
                    self.slots[nextslot]=key
                    self.data[nextslot]=data
                else:
                    self.data[nextslot]=data #replace
                
    # Collision resolution (rehashing) by linear probing           
    def rehash(self,oldhash, size):
        return (oldhash+1)%size
        
    def hashFunction(self,key,size):
        return key%size
    
    
    def get(self,key):
        hashValue = self.hashFunction(key,len(self.slots))
        
        if self.slots[hashValue]==key:
            return self.data[hashValue]
        else:
            nextslot = self.rehash(hashValue, len(self.slots)) 
            
            while self.slots[nextslot]!=key and nextslot!=hashValue:
                nextslot = self.rehash(nextslot, len(self.slots))
                
            if self.slots[nextslot]==key:
                return self.data[nextslot]
            else:
                return None
    # Special Methods for use with Python indexing        
    def __getitem__(self,key):
        return self.get(key)
    
    def __setitem__(self,key,data):
        return self.put(key,data)
        

In [3]:
H=HashTable(11)
H[54]="cat"
H[26]="dog"
H[93]="lion"
H[17]="tiger"
H[77]="bird"
H[31]="cow"
H[44]="goat"
H[55]="pig"
H[20]="chicken"

In [4]:
H.slots

[77, 44, 55, 20, 26, 93, 17, None, None, 31, 54]

In [5]:
H.data

['bird',
 'goat',
 'pig',
 'chicken',
 'dog',
 'lion',
 'tiger',
 None,
 None,
 'cow',
 'cat']

In [6]:
H[20]

'chicken'

In [7]:
H[17]

'tiger'

In [8]:
H[20]='duck'
print(H[20])

duck


In [9]:
H.data

['bird',
 'goat',
 'pig',
 'duck',
 'dog',
 'lion',
 'tiger',
 None,
 None,
 'cow',
 'cat']

In [10]:
print(H[99])

None


In [12]:
# Load factor 
itemSize=0
for item in H.data:
    if item != None:
        itemSize+=1

load_factor= itemSize/len(H.data)
print(load_factor)

0.8181818181818182


### Analysis of Hashing

The most important piece of information we need to analyze the use of a hash table is the load factor, λ. Conceptually, if λ is small, then there is a lower chance of collisions, meaning that items are more likely to be in the slots where they belong. If λ is large, meaning that the table is filling up, then there are more and more collisions. This means that collision resolution is more difficult, requiring more comparisons to find an empty slot. With chaining, increased collisions means an increased number of items on each chain.


As before, we will have a result for both a successful and an unsuccessful search. For a successful search using **open addressing with linear probing**, the average number of comparisons is approximately 1/2*(1+1/(1−λ)) and an unsuccessful search gives 1/2(1+(1/(1−λ))^2) If we are using chaining, the average number of comparisons is 1+λ/2 for the successful case, and simply λ comparisons if the search is unsuccessful.