# Grokking Algorithms

In [209]:
from itertools import chain, combinations
import math
import random
import time

## Chapter 1: Introduction to Algorithms

### Binary Search

Given sorted input collection, finds a member in `O(log n)`

In [81]:
def binary_search(haystack, needle) -> int:
    start = 0
    end = len(haystack) - 1
    while True:
        next = start + int((end - start) / 2)
        if start > end or next >= len(haystack) or next < 0:
            return None
        elif haystack[next] == needle:
            return next
        elif haystack[next] < needle:
            start = next + 1
        else:
            end = next - 1


def assert_find(haystack, needle, expect_idx):
    found_idx = binary_search(haystack, needle)
    print(f'{needle} is at index: {found_idx}')
    assert found_idx == expect_idx


def assert_find_all(haystack):
    for expect_idx in range(len(haystack)):
        assert_find(haystack, haystack[expect_idx], expect_idx)

In [112]:
haystack = list(range(1, 20, 2)) # odd numbers [1, 3, 5, ..., 19]

# Should find
assert_find_all(haystack)

# Shouldn't find
assert binary_search(haystack, 12) is None
assert binary_search(haystack, 21) is None
assert binary_search(haystack, -1) is None

1 is at index: 0
3 is at index: 1
5 is at index: 2
7 is at index: 3
9 is at index: 4
11 is at index: 5
13 is at index: 6
15 is at index: 7
17 is at index: 8
19 is at index: 9


In [113]:
haystack = ('Alice', 'Bob', 'Duckling', 'Pigeon')

# Should find
assert_find_all(haystack)

# Shouldn't find
assert binary_search(haystack, 'Gerald') is None

Alice is at index: 0
Bob is at index: 1
Duckling is at index: 2
Pigeon is at index: 3


## Chapter 2: Selection Sort

### Arrays vs Lists

| | Arrays | Lists |
| --- | --- | --- |
| Reading | `O(1)` | `O(n)` |
| Insertion | `O(n)` | `O(1)` |
| Deletion | `O(n)` | `O(1)` |

* **arrays** are good for random access, while **lists** are go for frequent insertions as well as deletion from first/last positions

### Selection sort

**Selection sort** involves sorting by finding the one item (the next smallest item) per iteration, and is `O(n^2)`.

In [102]:
def find_index_smallest(a_list: list):
    smallest_idx = None
    for idx, x in enumerate(a_list):
        if smallest_idx is None or x < a_list[smallest_idx]:
            smallest_idx = idx
    return smallest_idx


def selection_sort(a_list: list) -> list: 
    sorted = []
    unsorted = a_list.copy()
    while len(unsorted) > 0:
        smallest_idx = find_index_smallest(unsorted)
        smallest_val = unsorted.pop(smallest_idx)
        sorted.append(smallest_val)
    return sorted


def assert_selection_sort(a_list: list, expected_list):
    sorted = selection_sort(a_list)
    print(f"Sorted list: {sorted}")
    assert sorted == expected_list

In [114]:
assert_selection_sort([5, 2, 1, 3], [1, 2, 3, 5])
assert_selection_sort(['bob', 'gerald', 'piggie', 'alice'], ['alice', 'bob', 'gerald', 'piggie'])

Sorted list: [1, 2, 3, 5]
Sorted list: ['alice', 'bob', 'gerald', 'piggie']


## Chapter 3: Recursion

* Recursive functions include a *base case* and a *recursive case*:
```
def count_down(i):
    print(i)
    if i <= 0: # the *base case*
        return
    else:      # the *recursive case*
        count_down(i - 1)
```
* Recursion is easier for human, but takes up extra memory due to using the call stack
* A *stack* is a data structure with two operations: `push` and `pop`

In [123]:
class Stack:
    def __init__(self):
        self._stack = []

    def push(self, item):
        self._stack.append(item)

    def pop(self):
        return self._stack.pop(-1) if len(self._stack) > 0 else None

In [124]:
my_stack = Stack()

my_stack.push('foo')
my_stack.push('bar')
assert my_stack.pop() == 'bar'
my_stack.push('baz')
assert my_stack.pop() == 'baz'
assert my_stack.pop() == 'foo'
assert my_stack.pop() is None

In [131]:
def factorial_recursive(i: int) -> int:
    return i * factor_recursive(i - 1) if i > 1 else 1


def factorial_iterative(i: int) -> int:
    val = 1
    for next in range(2, i + 1):
        val *= next
    return val

In [133]:
assert factorial_recursive(5) == 5 * 4 * 3 * 2
assert factorial_iterative(5) == 5 * 4 * 3 * 2

## Chapter 4: Quicksort

### Divide & Conquer

* **Divide & Conquer** (**D&C**): 

In [136]:
def divide_field_into_even_plots(width, height):
    """
    If we want to divide a field into even-sized plots, what's the largest possible size plot? 
    Uses divide and conquer.
    """
    # wide
    if width > height:
        remainder = width % height
        return (height, height) if remainder == 0 else divide_field_into_even_plots(remainder, height)

    # tall
    elif height > width:
        remainder = height % width
        return (width, width) if remainder == 0 else divide_field_into_even_plots(width, remainder)

    # square
    return (width, width)

print(f'Dividing 1680m x 640m field into {divide_field_into_even_plots(1680, 640)}')

Dividing 1680m x 640m field into (80, 80)


In [142]:
def add(vals:list[int]) -> int:
    if len(vals) == 0:
        return 0
    return vals[0] + add(vals[1:])

add([2, 4, 5, 6])

17

In [143]:
def count(vals:list[int]) -> int:
    if len(vals) == 0:
        return 0
    return 1 + count(vals[1:])

count([2, 2, 3, 5])

4

In [147]:
def max(vals: list[int]) -> int:
    if len(vals) == 0:
        raise Error('Must pass in list with at least one element')
    elif len(vals) == 1:
        return vals[0]
    other = max(vals[1:])
    return other if other > vals[0] else vals[0]

print(f'max: {max([5, 1, 0])}')
print(f'max: {max([1, 14, 1, 0])}')
print(f'max: {max([5, 1, 0, 6])}')

max: 5
max: 14
max: 6


In [150]:
def _binary_search(haystack, needle, start, end) -> int:
    next = start + int((end - start) / 2)
    if start > end or next >= len(haystack) or next < 0:
        return None
    elif haystack[next] == needle:
        return next
    elif haystack[next] < needle:
        return _binary_search(haystack, needle, next + 1, end)
    else:
        end = next - 1
        return _binary_search(haystack, needle, start, next - 1)


def binary_search(haystack, needle) -> int:
    return _binary_search(haystack, needle, 0, len(haystack) - 1)


haystack = [1, 2, 3, 5]
print(binary_search(haystack, 0))
print(binary_search(haystack, 1))
print(binary_search(haystack, 2))
print(binary_search(haystack, 3))
print(binary_search(haystack, 4))
print(binary_search(haystack, 5))
print(binary_search(haystack, 6))

None
0
1
2
None
3
None


In [175]:
US_COINS = {1,5,10,25}
cache = {}


def _make_change(amount: int, valid_coins:set[int]) -> list[int]:
    least_coins = None
    for coin in valid_coins:
        if coin <= amount:
            next_coins = [coin] + make_change(amount - coin, valid_coins)
            if least_coins is None or len(next_coins) < len(least_coins):
                least_coins = next_coins
    return [] if least_coins is None else least_coins


def make_change(amount: int, valid_coins:set[int]) -> list[int]:
    """
    Makes change for a bill using the least number of coins.
    """
    if amount not in cache:
        cache[amount] = _make_change(amount, valid_coins)
    return cache[amount]


print(make_change(9, US_COINS))
print(make_change(9, {1,3}))
print(make_change(50, US_COINS))
print(make_change(92, US_COINS))

[1, 1, 1, 1, 5]
[1, 1, 1, 1, 5]
[25, 25]
[1, 1, 10, 5, 25, 25, 25]


### Quicksort

In [183]:
def quicksort(values: list[int]) -> list[int]:
    if len(values) < 2:
        return values
    pivot = values[0]
    return quicksort([v for v in values[1:] if v <= pivot]) + [pivot] + quicksort([v for v in values[1:] if v > pivot])


print(quicksort([]))
print(quicksort([1]))
print(quicksort([1, 2, 3]))
print(quicksort([3, 2, 1]))
print(quicksort([4, 7, 2, 9, 6, 3, 1, 5, 8, 10]))

[]
[1]
[1, 2, 3]
[1, 2, 3]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


### Merge sort

In [197]:
def mergesort(values: list[int]) -> list[int]:
    if len(values) < 2:
        return values
    half = math.ceil(len(values)/2)
    #print(f'len(values) = {len(values)}; half = {half}')
    left = mergesort(values[0:half])
    right = mergesort(values[half:])
    sorted = []
    while len(left) > 0 or len(right) > 0:
        # only left remaining, append remaining left
        if len(left) > 0 and len(right) == 0:
            sorted += left
            break
        # only right remaining, append remaining right
        elif len(right) > 0 and len(left) == 0:
            sorted += right
            break
        elif left[0] < right[0]:
            sorted.append(left.pop(0))
        else:
            sorted.append(right.pop(0))
            
    return sorted

print(mergesort([3, 2, 1]))
print(mergesort([1, 2]))
print(mergesort([1, 5, 3, 3, 4, 2, 1]))

[1, 2, 3]
[1, 2]
[1, 1, 2, 3, 3, 4, 5]


## Chapter 5: Hash Tables

| | Hash Tables (Average) | Hash Tables (Worst) | Arrays | Linked Lists |
| --- | --- | --- | --- | --- |
| Search | `O(1)` | `O(n)` | `O(1)` | `O(n)` |
| Insert | `O(1)` | `O(n)` | `O(n)` | `O(1)` |
| Delete | `O(1)` | `O(n)` | `O(n)` | `O(1)` |

* Due to **collisions**, performance of **hash tables** isn't always **constant time**.
* The performance of hash table depends a lot on the quality of the **hashing function**, which distribute hashes evenly and reduce the number of collisions, and the load factor
* **Load factor** is the number of items in the hash table divided by the number of slots
* As load factor goes up, need to resize the array supporting the hash table (rule of thumb: double the array size)

## Chapter 6: Breadth-First Search

* **Breadth-first search** (**BFS**) is used to answer two questions: 
    1. is there a path from one node to another
    2. what's the shortest path from one node to another (**shortest-path problems**)
* A **stack** is last-in, first-out (**LIFO**), and a **queue** is first-in, first-out (**FIFO**)

In [232]:
def find_shortest_path(graph, starting_loc, destination) -> list:
    # ("location", [how, i, got, here])
    queue = [(starting_loc, [starting_loc])]
    
    while len(queue) > 0:
        current_loc, visited = queue.pop(0)
        
        # are we there yet?
        if current_loc == destination:
            return visited
            
        # find our neighbors
        for next_edge in graph.get(current_loc, []):
            # if already visisted, avoid cycle
            if next_edge in visited:
                continue
            
            go_to = (next_edge, [*visited, next_edge])
            queue.append(go_to)


my_graph = {
    'Twin Peaks': ['1', '3'],
    '1': ['2', 'Twin Peaks'], # cycle: Twin Peaks -> 1 -> Twin Peaks
    '2': ['Golden Gate Bridge'],
    '3': ['4', '5'],
    '4': ['2'],
    '5': ['2']
}

# Should exist
print(find_shortest_path(my_graph, 'Twin Peaks', 'Golden Gate Bridge'))

# Same
print(find_shortest_path(my_graph, '2', '2'))

# Unreachable
print(find_shortest_path(my_graph, 'Golden Gate Bridge', 'Twin Peaks'))

['Twin Peaks', '1', '2', 'Golden Gate Bridge']
['2']
None


## Chapter 7: Dijkstra's algorithm

* **Weighted graphs** contain nodes, edges, and weights
* To calculate shortest path in unweighted graph, use breadth-first search; for weighted graphs, use Dijkstra's algorithm
* Dijkstra's algorithm only works for (1) **directed acyclic graphs** (**DAGs**) and (2) if there are no negative weights
* If you want to find shortest path for weighted graph with negative weights, use **Bellmna-Ford algorithm**

In [290]:
DEFAULT_NODE_COST = (None, float('inf'))


def nodes_to_visit(graph, starting_loc, destination):
    nodes = set(graph.keys())
    # note that the destination may not have an outbound vertex
    if destination in graph:
        nodes.remove(destination)
    return nodes


def next_node_to_visit(graph, remaining_nodes, nodes_costs):
    next_loc = None
    next_dist = float('inf')
    for next_node in remaining_nodes:
        _, dist = nodes_costs.get(next_node, DEFAULT_NODE_COST)
        if next_loc is None or dist < next_dist:
            next_loc = next_node
            next_dist = dist
    return (next_loc, next_dist)


def find_nodes_costs(graph, starting_loc, nodes):
    nodes_costs = {
        starting_loc: (None, 0)
    }
    
    # visit each node in order of increasing cost
    while len(nodes) > 0:
        
        # find the node with lowest cost
        parent_node, parent_dist = next_node_to_visit(graph, nodes, nodes_costs)

        # remove next node from remaining nodes
        nodes.remove(parent_node)

        # update costs for this nodes immediate neighbor
        edges = graph.get(parent_node, {})
        for next_loc, next_dist in edges.items():
            existing_parent, existing_distance = nodes_costs.get(next_loc, DEFAULT_NODE_COST)
            # we found a better way
            updated_dist = parent_dist + next_dist
            if updated_dist < existing_distance:
                nodes_costs[next_loc] = (parent_node, updated_dist)   

    return nodes_costs


def find_lowest_cost_path(nodes_costs, starting_loc, destination):
    if destination == starting_loc:
        return [starting_loc]
    return  find_path(nodes_costs, starting_loc, nodes_costs[destination][0]) + [destination]


def dijkstras_algorith(graph, starting_loc, destination):
    nodes = nodes_to_visit(graph, starting_loc, destination)
    nodes_costs = find_nodes_costs(graph, starting_loc, nodes)
    return find_lowest_cost_path(nodes_costs, starting_loc, destination)

In [291]:
my_graph = {
    "BOOK": { "LP": 5, "POSTER": 0 },
    "LP": { "BASS GUITAR": 15, "DRUMS": 20 },
    "POSTER": { "BASS GUITAR": 30, "DRUMS": 35 },
    "BASS GUITAR": { "PIANO": 20 },
    "DRUMS": { "PIANO": 10 },
}

my_nodes_to_visit = nodes_to_visit(my_graph, "BOOK", "PIANO")
EXPECTED_NODES_COSTS = {'BOOK': (None, 0), 'LP': ('BOOK', 5), 'POSTER': ('BOOK', 0), 'BASS GUITAR': ('LP', 20), 'DRUMS': ('LP', 25), 'PIANO': ('DRUMS', 35)}

# test cases
assert my_nodes_to_visit == {'BASS GUITAR', 'BOOK', 'DRUMS', 'LP', 'POSTER'}
assert ('BOOK', 0) == next_node_to_visit(my_graph, my_nodes_to_visit, { 'BOOK': (None, 0) })
assert EXPECTED_NODES_COSTS == find_nodes_costs(my_graph, 'BOOK', my_nodes_to_visit)
assert [ 'BOOK', 'LP', 'DRUMS', 'PIANO' ] == find_lowest_cost_path(EXPECTED_NODES_COSTS, 'BOOK', 'PIANO')

# end-to-end test
dijkstras_algorith(my_graph, "BOOK", "PIANO")

['BOOK', 'LP', 'DRUMS', 'PIANO']

In [292]:
my_graph = {
    "START": { "A": 6, "B": 2 },
    "A": { "FINISH": 1 },
    "B": { "A": 3, "FINISH": 5 },
}

dijkstras_algorith(my_graph, "START", "FINISH")

['START', 'B', 'A', 'FINISH']

## Chapter 8: Greedy Algorithms

* A **greedy algorithm** is an algorithm where you pick the optimal move at each step (the locally optimal solution leads to the globally optimal solution)
* The greedy algorithm doesn't always work; e.g., the knapsack problem
* **Knapsack problem**: your knapsack holds a specific maximum weight. How do you maximize the value of the things you can carry?
* However, the greedy solution to the knapsack problem (adding the most valuable item) might be close enough to be useful
* **Set-covering problem**: given a set of items and a collection of subsets, what's the smallest collection of subsets that contains the entire set of items?
* **approximation algorithms**: used when an exact algorithm is intractable. E.g., can use greedy algorithms to solve **NP-complete problems**.
* **powerset**: collection of all the subsets, the length of which is `2^|items|`

| Problem | Performance |
| --- | --- |
| Knapsack  | `O(2^n)` |
| Set-covering problem | `O(2^n)` |
| Traveling Salesman | `P(n!)` |

* There's no easy way to tell if a problem is NP-complete; just look for heuristics (e.g., is it the set-covering problem or traveling salesperson problem? Does it grow really fast? Is it an "all combinations" problem? etc)

In [305]:
def find_next_option_ends_earliest(options, earliest_start):
    best_option = None
    for option in options:
        if (earliest_start is None or earliest_start <= option[1]) and (best_option == None or option[2] < best_option[2]):
            best_option = option
    return best_option


def _maximize_schedule(committed, options):
    # we've exhausted our options
    if len(options) == 0:
        return committed
        
    free_after = None if len(committed) == 0 else committed[-1][2]
    next_option = find_next_option_ends_earliest(options, free_after)

    # we've fit everything we could fit
    if next_option is None:
        return committed
    
    committed.append(next_option)
    options.remove(next_option)
    return _maximize_schedule(committed, options)

def maximize_schedule(options):
    """
    Returns combination of options that maximize the # that can be satisfied without overlap.
    @param committed Must be sorted earliest first
    """
    return _maximize_schedule([], options.copy())

In [302]:
classes = {
    ('ART', 9, 10),
    ('ENG', 9.5, 10.5),
    ('MATH', 10, 11),
    ('CS', 10.5, 11.5),
    ('MUSIC', 11, 12)
}

maximize_schedule(classes)

[('ART', 9, 10), ('MATH', 10, 11), ('MUSIC', 11, 12)]

In [304]:
classes = {
    ('ART', 9, 10),
    ('ENG', 9.5, 10.5),
    ('MATH', 10, 11),
    ('CS', 10.25, 10.5),
    ('MUSIC', 11, 12),
    ('SCIENCE', 10.5, 11.5)
}

maximize_schedule(classes)

[('ART', 9, 10), ('CS', 10.25, 10.5), ('SCIENCE', 10.5, 11.5)]

In [372]:
def set_covering_problem_greedy_approx(the_set, available_subsets):
    selected_subsets = set()
    remaining_items = the_set.copy()
    while True:
        selected_name, selected_coverage = None, 0
        for next_name, next_items in available_subsets.items():
            next_coverage = len(remaining_items & next_items)
            #print(f'DEBUG: next_name={next_name}; next_items = {next_items}; next_coverage = {next_coverage}')
            if next_coverage > selected_coverage:
                selected_name = next_name
                selected_coverage = next_coverage
        
        # nothing found, no more progress available
        if selected_name is None:
            return selected_subsets

        # we've found the next item
        selected_subsets.add(selected_name)
        remaining_items -= available_subsets[selected_name]

In [341]:
states_needed = { "mt", "wa", "or", "id", "nv", "ut", "ca", "az" }
stations = {
    "kone": { "id", "nv", "ut" },
    "ktwo": { "wa", "id", "mt" },
    "kthree": { "or", "nv", "ca" },
    "kfour": { "nv", "ut" },
    "kfive": { "ca", "az" },
}

set_covering_problem_greedy_approx(states_needed, stations)

{'kfive', 'kone', 'kthree', 'ktwo'}

## Chapter 9: Dynamic Programming
* **Dynamic programming** is a technique involving breaking problem into subproblems
* Dynamic programming works when:
    - You have a constant constraint
    - Problems can be broken down into discrete subproblems
* Tips for using dynamic programming:
    - Every solution involves a grid
    - You're optimizing the value of the cells
    - Think of the cell as the subproblem; what axes do you need to solve the subproblem?
* Thing you _cannot_ do with dynamic programming:
    - You can't use dynamic programming if value of items is dependent on others (e.g., you can't discount cost of traveling to attractions during a trip based on where you are)
    - You can't use dynamic programming to steal fractions of items (but you can use greedy algorithm!)
* Applications of dynamic programming:
    - biologists to find similarities in DNA
    - git diff
    - Levenshtein distance

In [140]:
def knapsack_problem(items: list, weight_capacity):
    """
    Note this only works for integer weight_capacity
    """
    # |        | 1   | 2   | ... | weight_capacity | 
    # | item_1 | ... | ... | ... | ... |
    # | item_2 | ... | ... | ... | ... |
    # etc
    #
    # Where each cell is { price, {item1, item2} }

    # initialize table
    table = [[None] * weight_capacity for _ in items]

    # construct table, one item row at a time...
    for item_idx, (item_name, item_weight, item_price) in enumerate(items):
        for col_idx in range(0, weight_capacity): # if want to support non-integer weights, adjust step size
            #
            # Algorithm: the cell is populated with items with max price, based on: 
            #   (a) price of cell[i-1][j] or 
            #   (b) price of current item + price of cell corresponding to remaining weight capacity
            #
            col_weight = col_idx + 1
            current_item_fits = item_weight <= col_weight
            
            # populate the first row
            if item_idx == 0:
                table[item_idx][col_idx] = (item_price, {item_name}) if current_item_fits else (0, set())

            # populate all other rows (which reference previous row)
            else:
                above_table_entry = table[item_idx - 1][col_idx]
                table[item_idx][col_idx] = above_table_entry # initialize to cell[i-1][j] (may swap below)
                
                # let's see if the current item + value of remaining space is better
                if current_item_fits:
                    remaining_space = col_weight - item_weight
                    items_price_total, items_names = item_price, {item_name}
                    if remaining_space > 0:
                        prev_price, prev_item_names = table[item_idx - 1][remaining_space - 1]
                        items_price_total += prev_price
                        items_names |= prev_item_names

                    # the next item (or the next item + other items) is worth more, so replace cell[i-1][j]
                    if items_price_total > above_table_entry[0]:
                        table[item_idx][col_idx] = (items_price_total, items_names)
                            
    #print(f'DEBUG: {table}')
    return table[-1][-1] # the solution is the bottom right cell of table


def test_knapsack_problem(stealable_items, weight_capacity):
    expected = knapsack_problem(stealable_items, weight_capacity)
    print(expected)
    for _ in range(0, 100):
        random.shuffle(stealable_items) # just to make sure algorithm is working!
        assert expected == knapsack_problem(stealable_items, weight_capacity)

In [141]:
stealable_items = [
    ( 'GUITAR', 1, 1500 ),
    ( 'STEREO', 4, 3000 ),
    ( 'LAPTOP', 3, 2000 ),
]

test_knapsack_problem(stealable_items, 4)

(3500, {'GUITAR', 'LAPTOP'})


In [142]:
# ( what, weight, value )
stealable_items = [
    ( 'GUITAR', 1, 1500 ),
    ( 'STEREO', 4, 3000 ),
    ( 'LAPTOP', 3, 2000 ),
    ( 'IPHONE', 1, 2000 ),
]

test_knapsack_problem(stealable_items, 4)

(4000, {'LAPTOP', 'IPHONE'})


In [143]:
# ( what, weight, value )
stealable_items = [
    ( 'GUITAR', 1, 1500 ),
    ( 'STEREO', 4, 3000 ),
    ( 'LAPTOP', 3, 2000 ),
    ( 'IPHONE', 1, 2000 ),
    ( 'MP3', 1, 1000 ),
]

test_knapsack_problem(stealable_items, 4)

(4500, {'GUITAR', 'MP3', 'IPHONE'})


In [144]:
# ( where, days, value )
itinerary = [
    ( 'WESTMINSTER ABBEY', .5, 7),
    ( 'GLOBE THEATER', .5, 6),
    ( 'NATIONAL GALLERY', 1, 9),
    ( 'BRITISH MUSEUM', 2, 9),
    ( 'ST. PAUL\'S CATHEDRAL', .5, 8),
]

TRIP_DAYS = 2
SCALE = 2

# transform stop times to be whole numbers
itinerary = [(stop[0], int(SCALE * stop[1]), stop[2]) for stop in itinerary]

test_knapsack_problem(itinerary, TRIP_DAYS * SCALE)

(24, {'WESTMINSTER ABBEY', 'NATIONAL GALLERY', "ST. PAUL'S CATHEDRAL"})


In [145]:
# ( item, weight, value )
camping_items = [
    ( 'WATER', 3, 10 ),
    ( 'BOOK', 1, 3 ),
    ( 'FOOD', 2, 9 ),
    ( 'JACKET', 2, 5 ),
    ( 'CAMERA', 1, 6 ),
]

test_knapsack_problem(camping_items, 6)

(25, {'CAMERA', 'FOOD', 'WATER'})


In [189]:
def longest_common_substring(word_1, word_2):
    # |   | F | I | S | H | 
    # | H | 0 | 0 | 0 | 0 |
    # | I | 0 | 1 | 0 | 0 |
    # | S | 0 | 0 | 2 | 0 |
    # | H | 0 | 0 | 0 | 3 |
    #
    # initialize table
    table = [[None] * len(word_2) for _ in word_1]

    # step 1: build the table. 
    #   if letters match between words, value of cell is '1' + value of cell to upper left
    for i in range(0, len(word_1)):
        for j in range(0, len(word_2)):
            if word_2[j] != word_1[i]:
                table[i][j] = 0
            else:
                table[i][j] = 1
                if i - 1 >= 0 and j - 1 >= 0:
                    table[i][j] += table[i - 1][j - 1]

    # step 2: find largest value
    largest_value = 0
    for row in table:
        for cell in row:
            if cell > largest_value:
                largest_value = cell
    
    return largest_value


def find_word_with_longest_common_substring(target_word, candidate_words: list):
    longest_word, longest_match = None, 0
    for candidate in candidate_words:
        match = longest_common_substring(target_word, candidate)
        if match > longest_match:
            longest_word, longest_match = candidate, match
    return (longest_word, longest_match)


def test_longest_common_substring(target_word, candidate_words: list):
    longest_word, longest_match = find_word_with_longest_common_substring(target_word, candidate_words)
    print(f'The longest matching word is "{longest_word}", with {longest_match} matching characters')
    for _ in range(0, 100):
        random.shuffle(candidate_words)
        next_longest_word, next_longest_match = find_word_with_longest_common_substring(target_word, candidate_words)
        assert longest_word == next_longest_word
        assert longest_match == next_longest_match

In [190]:
test_longest_common_substring('HISH', ['FISH', 'TISSUE', 'VISTA'])

The longest matching word is "FISH", with 3 matching characters


In [191]:
test_longest_common_substring('BLUE', ['CLUES',])

The longest matching word is "CLUES", with 3 matching characters


In [196]:
def longest_common_sequence(word_1, word_2):
    # |   | F | O | S | H | 
    # | F | 1 | 1 | 1 | 1 |
    # | I | 1 | 1 | 1 | 1 |
    # | S | 1 | 1 | 2 | 2 |
    # | H | 1 | 1 | 2 | 3 |
    #
    # initialize table
    table = [[None] * len(word_2) for _ in word_1]

    # Algorithm is a bit funky.
    #   - if letter match, cell value is '1' + value of cell to top left
    #   - else, if first cell in row, use value of cell above
    #   - else, use value of cell to left

    # step 1: build the table. 
    for i in range(0, len(word_1)):
        for j in range(0, len(word_2)):
            # characters don't match
            if word_2[j] != word_1[i]:
                # first cell in first row
                if i == 0 and j == 0:
                    table[i][j] = 0

                # first cell in non-first row: use cell above
                elif i > 0 and j == 0:
                    table[i][j] = table[i-1][j]

                # not first cell: use cell to left
                else:
                    table[i][j] = table[i][j-1]
            # characters match
            else:
                table[i][j] = 1
                if i - 1 >= 0 and j - 1 >= 0:
                    table[i][j] += table[i-1][j-1]

    return table[-1][-1]

In [197]:
longest_common_sequence('FOSH', 'FISH')

3

In [200]:
longest_common_sequence('FSH', 'FISH')

3

In [201]:
longest_common_sequence('FOHS', 'FISH')

2

## Chapter 10: K-Nearest Neighbors
* Instead of using **Euclidean distance** for KNN, consider using **cosine similarity**, which measures the angle of two vectors instead of distance between points. This will find more similar people, even if one person uses more conservative ratings (e.g., scores a movie 4) than another (e.g., scores movie perfect 5)
* You can approximate a **linear regression** using KNN: take the k nearest neighbors and average their scores to predict score for another point

In [225]:
def distance(point_1, point_2):
    """
    finds distance between two n-dimensional points
    """
    if len(point_1) != len(point_2):
        raise ValueError(f'{len(point_1)} != {len(point_2)}')

    total = 0
    for i in range(0, len(point_1)):
        total += (point_1[i] - point_2[i]) ** 2
        
    return math.sqrt(total)

assert distance((1,), (2,)) == 1.0
assert distance((1,1),(2,2)) == 1.4142135623730951 
assert distance((1,1,1),(2,2,2)) == 1.7320508075688772

In [248]:
def find_most_common(collection: list[str]) -> str:
    counts = {}
    for item in collection:
        counts[item] = counts.get(item, 0) + 1
    most_common = None
    for item, count in counts.items():
        if most_common is None or count > counts[most_common]:
            most_common = item
    return most_common

assert 'orange' == find_most_common(['orange']) # works if only one
assert 'orange' == find_most_common(['orange', 'orange']) # works if only one type
assert 'orange' == find_most_common(['orange', 'orange', 'grapefruit']) # finds most common
assert 'orange' == find_most_common(['orange', 'grapefruit', 'orange']) # order doesn't matter
assert 'orange' == find_most_common(['orange', 'grapefruit', 'banana']) # returns first if no clear winner

In [261]:
def k_nearest_neighbors(collection, target, k=3):
    distances = []
    for label, features in collection.items():
        for next_feature in features:
            dist = distance(target, next_feature)
            distances.append((label, dist))
    distances.sort(key=lambda item: item[1])
    closest = [item[0] for item in distances][0:k]
    return find_most_common(closest)

In [262]:
# is it an orange or a grapefruit?
collection = {
    'orange': [(1,1), (2,1), (3,1), (2,1), (4, 1), (2,2), (3,3), (3,2)],
    'grapefruit': [(5,4), (6,4), (5,5), (6,5), (7,5), (4, 7), (6,7), (8,7)],
}

assert k_nearest_neighbors(collection, (4,3)) == 'orange'
assert k_nearest_neighbors(collection, (6,6)) == 'grapefruit'

In [263]:
# who is most like Priyanka (so we can recommend similar movies to her)?
# (commedy, action, drama, horror, romance)
PRIYANKA = (3, 4, 4, 1, 4)

collection = {
    'Sarah': [(1, 1, 1, 5, 3)],
    'Justin': [(4, 3, 5, 1, 5)],
    'Morpheus': [(2, 5, 1, 3, 1)]
}

print(k_nearest_neighbors(collection, PRIYANKA, k=1))

Justin


## Chapter 11: Where to Go Next

### Trees
* A **binary search tree** give you an average case lookup of `O(log n)` and worst case lookup of `O(n)`, while an array has a worst case lookup of `O(log n)`. So why would you use a binary search tree? Because insertions are only `O(log n)`!
* Binary search trees can become **imbalanced**, leading to poor performance. You can use a **red-black tree** to self-balance.
* **B-trees** are a special kind of binary tree used to store data in databases
* Other trees to review: **heaps**, **splay trees**

### Invertes indexes
* **Inverted indexes** is used by search engines to make words/queries with locations

### The Fourier transform
* **Fourier transform** is a signal processing algorithm used to find components. It can break a song into frequencies, compress music, identify ingredients of a smoothie, JPEG compression, predict earthquakes, identifying songs (like Shazam)

### Parallel algorithms
* Parallelizing algorithms to speed up tasks like sorting

### MapReduce
* **Distributed algorithm** is a special case of **parallel algorithms** that run across lots of machines, generally intended to speed up very long running algorithms
* **MapReduce** is a distributed algorithms that applies the `map` step in parallel to gather and/or process data, and then aggregates a result using the `reduce` step

### Bloom filters and HyperLogLog
* **Probabilistic data structures** are data structures that give you an answer that could be wrong, but is probably right
* A **bloom filter** is a probability data structure that operates like a set, but significantly reduced memory footprint. False positives are possible, but false negatives aren't.
* **HyperLogLog** is a probabilistic data structure that approximates the nmber of unique elements in a set. This is useful for approximating amount of user behavior (e.g., Google estimating the number of unique searches performed by a user)

### The SHA algorithns
* **SHA** is a family of **one-way hash functions** that generate fixed size hashes for strings. (E.g., used to generate keys for dictionary, compare contents of two files, securely storing passwords for verification
* SHA is **locality insensitive**, meaning it generates a completely different hash for even a single character change
* If you want a **locality sensitive** hash function, consider **Simhash**, which can be used to compare how sensitive two strings are (e.g., detecting plagiarism, Scribd detecting copyrighted material)

### Diffie-Hellman key exchange
* **Diffie-Hellman** is an algorithm that involves others sending you a message encrypted by your **public key**, and the message can only be decrypted using the **private key**. (**RSA** is the successor to Diffie-Hellman.)

### Linear Programming
* **Linear Programming** is used to maximize a value given a set of constraints, using the **Simplex algorithm**
* All graph algorithms can be accomplished using linear programming; linear programming is a more general framework, and graph problems are a subset