# Data Structures and Algorithms in Python - Search Algorithms
### AJ Zerouali, 2023/10/12

## 0) Introduction

**References:**

- "Data structures and algorithms in Python", by Goodrich, Tamassia and Goldwasser (primary, abbreviated [GTG13]). In particular:
    * Sec.4.1.3: Sequential search and binary search.
    * Sec.10.2: Hash tables
- Section 17 of "Python for Data Structures, Algorithms, and Interviews!" by Jose Portilla (to a much lesser extent).

**Comments:**
- I am unsure on the topic that I will end up focusing on

## 1) Sequential search and binary search

This is a review of section 4.1.3, and follows lectures 122-126 .

### 1.a - Sequential search

Sequential search is the naive searching algorithm for elements in a container. For simplicity we only discuss the sequential search over arrays, but this can easily be generalized to other positional lists. The idea is to sequentially check if the entry at a given index is equal to a target value.

Starting with the easy case of a sorted array, sequential search can be implemented as follows:

In [29]:
# Sorted array version
## Stops if we find a value larger than the target before finding the target
def seq_search_sorted(arr, tgt):
    N = len(arr)
    found_tgt = False
    found_greater = False
    i = -1
    while i<N and (not found_tgt) and (not found_greater):
        i+=1
        found_tgt = (arr[i] == tgt)
        found_greater = (arr[i] > tgt)
    return found_tgt, i

It is easily seen that this is a $O(n)$ worst case algorithm. In the average case, this algorithm will run in $O(n/2)$ time, a remark that will become relevant for the case of an unsorted array.

Now if the array is not sorted, we need to check the entirety of the array, meaning that the following implementation:

In [None]:
# Unsorted array version
def seq_search_unsorted(arr, tgt):
    N = len(arr)
    found_tgt = False
    i = -1
    while i<N and (not found_tgt):
        i+=1
        found_tgt = (arr[i] == tgt)
    return found_tgt, i

executes in $O(n)$ time in both the worst case and average case.

#### Verifications

In [23]:
import random

In [30]:
nums = []
for i in range(10):
    nums.append(random.randint(0,50))
nums = sorted(nums)

In [31]:
nums

[2, 6, 20, 25, 25, 33, 33, 37, 40, 50]

In [32]:
seq_search(nums,23)

(False, 3)

In [34]:
seq_search(nums,33)

(True, 5)

### 1.b - Binary search

We discussed binary search in notebooks 01 and 02 of DSA_Python. We provided two implementations, first an iterative one:

In [None]:
def binary_search_iter(arr, target):
    '''
            Iterative implementation of binary search.
        :param arr: sorted list of numbers
        :param target: target value to find in arr.
        :return mid: index of target in arr if search is successful and False otherwise.
    '''
    # Initialization of high and low indices
    low = 0
    high = len(arr)-1
    
    # Main loop (low is increased and 
    # high is decreased at each step)
    while low <= high:
        
        # Get midpoint idx
        mid = (low+high)//2
        
        # Case where we found target
        if target == arr[mid]:
            return mid
        
        # Increase low if target > arr[mid]
        elif target > arr[mid]:
            low = mid + 1
        
        # Decrease high if target < arr[mid]
        elif target < arr[mid]:
            high = mid - 1
            
    # If low > high was reached 
    # then the search was unsuccessful
    return False

and secondly a recursive one:

In [None]:
def binary_search(arr, target, low = 0, high = None):
    '''
        Recursive implementation of binary search.
        :param arr: sorted list of numbers
        :param target: target value to find in arr.
        :param low: lower-bound index for search. 0 by default.
        :param high: upper-bound index. None by default. 
        :return: index of target in arr if search is successful and False otherwise.
    '''
    if not high:
        high = len(arr)-1
    if high < low:
        return False
    else:
        mid = (high+low)//2
        if target == arr[mid]:
            return mid
        elif target > arr[mid]:
            return binary_search(arr, target, mid + 1, high)
        elif target < arr[mid]:
            return binary_search(arr, target, low, mid -1)

We also sketched the proof of why binary search runs in $O(\log n)$ in worst case, which is considerably better than $O(n)$ for very large $n$. The main assumption to keep in mind is that the array in which we perform the search **has to be sorted**.

## 2) Hash tables and hash functions



### 2.x1 - Overview: Hash tables, hash functions, and collision resolution.

Lecture 127 of Portilla's course
- A hash table is a collection of items stored in slots that are easy to retrieveé
- A hash function is a mapping that assigns slot addresses to the items. Since this mapping is not necessarily injective, several items can be sent to the same slot, thereby causing a collision/clash. An injective hash function is called a perfect hash function.
- Hash tables are assigned a certain size at their instantiation. The ration of occupation of the slots is called the load factor: $\lambda = \text{Num. stored items}/\text{Num. slots}.$
- The principal advantage of using hash tables is that they can perform item addition, deletion and search in $O(1)$ complexity in the average case, thanks to the use of hash functions for addressing.
- The main hash function discussed in this lecture is the *remainder* hash function. For rehashing, Portilla gives brief examples of the folding and mid-square methods.
- This lecture discusses examples of collision resolution: (i) Linear probing and open addressing. (ii) Chaining.

There are more details in section 10.2 in [GTG13]. In Python, the most notable example of a hash table is the *dict* class.

### 2.x2 - Implementation of a hash table

This part is based on Lecture 128 of Portilla's course. The objective is to implement a map as a hash table. We want our *HashTable* class to have the following interface:
* **HashTable()** to call the constructor and return an empty map.
* **__setitem__(key, val)** to store *val* as key *key*.
* **__getitem__(key)** to acces the value associated to *key*.
* **__delitem__(key)** to delete the entry associated to *key* from our map.
* **__len__** to return the number of *(key, value)* items stored in the map.
* **__contains__(key)** to check if a value is associated to *key* in the map.

A comment that Portilla gives in this lecture is that we typically don't have to implement our own hash tables in an interview setting.



In [5]:
class HashTable(object):
    
    def __init__(self, size):
        self.size = size
        self.slots = [None]*self.size
        self.data = [None]*self.size
        
    def __setitem__(self, key, val):
        
        # Compute hash value of current key
        hash_val = self.hashfunction(key, len(self.slots))
        
        # If slot is empty, add value to hash_val address
        if self.slots[hash_val] == None:
            self.slots[hash_val] = key
            self.data[hash_val] = val
            
        # If slot isn't empty, use linear probing
        else: 
            # Replace current elem
            if self.slots[hash_val] == key:
                self.data[hash_val] = val
                
            # Collision resolution
            else:
                next_slot = self.rehash(hash_val, len(self.slots))
                
                # Find next available slot for key
                while self.slots[next_slot]!=None and self.slots[next_slot]!=key:
                    next_slot = self.rehash(next_slot, len(self.slots))
                
                # Add new value if key is new
                if self.slots[next_slot]==None:
                    self.slots[next_slot] = key
                    self.data[next_slot] = val
                    
                # Replace existing value if key already used
                else:
                    self.data[next_slot] = val
        
    '''
        Actual hash function. Use remainder method for simplicity
    '''
    def hashfunction(self, key, size):
        return key % size
    
    def rehash(self, old_hash_val, size):
        return (old_hash_val+1)%size
    
    def __getitem__(self, key):
        # Initializations
        ## Hash value is starting slot for linear probing
        start_slot = self.hashfunction(key, len(self.slots))
        value = None
        position = start_slot
        found_key = False
        stop_search = False
        
        # Search for key in slots
        while not found_key and not stop_search and self.slots[position]!=None:
            
            # Found key
            if self.slots[position] == key:
                found_key = True
                value = self.data[position]
            # If key not found
            else:
                # Rehash
                position = self.rehash(position, len(self.slots))
                # If all slots were visited
                if position == start_slot:
                    stop_search = True
        
        return value
    
    def __contains__(self, key):
        return self.__getitem__(key)!=None
        
    def __len__(self):
        n_items = 0
        for i in range(len(self.slots)):
            if self.slots[i] is not None:
                n_items += 1
        return n_items
            
            
        
        

#### Implementation comments:

- Use integers as keys for simplicity and ease of use.
- Use the remainder method for simplicity.
- For rehashing, will use linear probing.

#### Examples



In [6]:
hash_tab = HashTable(10)

In [7]:
hash_tab[1] = "I"
hash_tab[5] = "A"
hash_tab[9] = "R"
hash_tab[3] = "K"


In [8]:
len(hash_tab)

4

In [9]:
hash_tab[12] = "Z"
hash_tab[16] = "Y"
hash_tab[20] = "X"
hash_tab[3] = "L"

In [10]:
len(hash_tab)

7

In [11]:
hash_tab[20]

'X'