## Hash Table Review

In [20]:
import functools
import collections

In [5]:
## A hash function for strings

def string_hash(s, modulus):
    MULT = 997
    return functools.reduce(lambda v, c: (v * MULT + ord(c)) % modulus, s, 0)

In [15]:
string_hash("tat", 701)

445

What we're doing here is representing the string as a base MULT integer. We then take divide this integer by the modulus and take the remainder. This gives us an integer in the range [0, modulus-1] which is our hash value i.e. the "slot" in the array which the key (string) will be put into. Note that larger modulus takes more space but a smaller modulus may result in large number of collisions so this value must be chosen wisely. A large enough (but not too large) prime number is usually a good choice as it reduces the likelihood of a some pattern in the underlying data preventing uniform distribution of the keys (i.e. choosing a power of 2 is simply selecting the low order bits and could be prone to placing similar strings in the same bucket - bad!)

Note that by taking the modulus at each step rather than at the very end, we still end up with the same number. This is because the end result without taking modulus is congruent to the end result where we do take the modulus at multiuplication/addition in the reduce function. Let us verify this below:

In [18]:
def string_hash2(s, modulus):
    MULT = 997
    hash_value = functools.reduce(lambda v, c: (v * MULT + ord(c)), s, 0)
    hash_value %= modulus
    return hash_value

In [19]:
string_hash2("tat", 701)

445

## Anagram groups

In [25]:
words = ['debitcard', 'elvis', 'silent', 'badcredit', 'lives', 'freedom', 'listen', 'levis', 'money']

In [28]:
def find_anagrams(l):
    anagram_map = collections.defaultdict(list)
    
    for s in l:
        anagram_map[''.join(sorted(s))].append(s)
        
    return [v for v in anagram_map.values() if len(v) > 1]

In [29]:
find_anagrams(words)

[['debitcard', 'badcredit'], ['elvis', 'lives', 'levis'], ['silent', 'listen']]

## Contact List

In [34]:
class Contact:
    
    def __init__(self, names):
        self.names = names # List of contacts
        
    def __hash__(self):
        # Want to hash set of names so we must use frozenset. Repeats in the contact list do not matter
        # order does not matter either. This makes the set data structure an ideal choice.
        return hash(frozenset(self.names))
    
    def __eq__(self, other):
        return set(self.names) == set(other.names)

In [35]:
def merge_contact_lists(contacts):
    return list(set(contacts))

## Find smallest subarray covering all values 

Problem Statement: https://leetcode.com/problems/minimum-window-substring/

This problem is the same as the one already done in the sliding window section. This is an alternative (but tighter) O(n) algorithm. Note that conceptually, this problem is very simple to solve in this way: for each character in t (the query), we keep track of the latest index seen. As soon as we have all len(t) characters accounted for, we simply take the latest index-the earliest index as a candidate.

The sliding window tackles this by first skipping any extra query characters and contracts the window from the left. However, this is not needed IF we had a data structure the could O(1) retrieval of min, O(1) append, and O(1) removal of elements (at arbitrary positions). The last requirement makes it clear we need a linked list (the prev 2 could be implemented with other data structures like arrays or heaps). Here is a solution using a doubly linked list that keeps track of the order and finds the min in O(1) time (i.e. it is just the head of the list). The hash map is used for fast lookup and to store the word:node mapping, the linked list itself is used to keep the indices in sorted order (we just need the min to be at the head, the order of the others don't matter).

**Note**: This solution assume a query string t to have only unique letters. I.e. "ABC" is a valid query string but "AABC" is not. More work is needed to deal with repeats.

In [106]:
def minWindow(s: str, t: str) -> str:

    from collections import namedtuple
    Substring = namedtuple("Substring", ("left", "right"))
    import math

    class DoublyLinkedListNode:

        def __init__(self, value=None):
            self.next = self.prev = None
            self.value = value


    class DoublyLinkedList:

        def __init__(self):
            self.head = self.tail = None
            self._size = 0

        # Override len as we'll need the size of the linkedlist to compare with        
        def __len__(self):
            return self._size

        def append(self, value):
            node = DoublyLinkedListNode(value)
            node.prev = self.tail

            if self.tail:
                self.tail.next = node
            else:
                self.head = node
            self.tail = node
            self._size += 1

        def remove(self, node):

            if node.prev:
                node.prev.next = node.next
            else:
                self.head = node.next

            if node.next:
                node.next.prev = node.prev
            else:
                self.tail = node.prev

            node.next = None
            node.prev = None

            self._size -= 1


    candidate_list = DoublyLinkedList()
    candidate_dict = {c: None for c in t}
    substring = Substring(left=-math.inf, right=math.inf)

    for i, c in enumerate(s):

        if c in candidate_dict:
            prev_node = candidate_dict[c]
            if prev_node:
                candidate_list.remove(prev_node)
            candidate_list.append(i)
            candidate_dict[c] = candidate_list.tail

        if len(candidate_list) == len(candidate_dict) \
                and (i-candidate_list.head.value) < (substring.right-substring.left):
            substring = Substring(left=candidate_list.head.value, right=i)

    return s[substring.left:substring.right+1]

In [105]:
s = ["A", "A", "B", "E", "C", "A", "B", "C"]
t = ["A", "B", "C"]

In [103]:
minWindow(s, t)

['C', 'A', 'B']

**Time and Space COmplexity**: Same as before the the O(n) time is now tighter. Before it was O(2n) but now its a tight O(1n). This makes this algorithm ideal for streaming situations as we don't need to keep track of previous elements after we've processed them.