# Data Structures & Algorithms

Notes from reading *Grokking Algorithms* by Aditya Bhargava

## Arrays & Linked Lists

When working with data, we often need to work with (ordered) sequences of objects. There is more than one approach structuring this data. Two common approaches are:
1. Arrays
2. Linked lists

### Arrays

When a new array is created, we reserve a contiguous block of memory to house the contents. Each address in the block stores a reference to the object that is placed at the corresponding element of the array.

Because we have a pre-defined block of memory, we always know which address we need to access to retrieve an element of the array. We therefore get fast reads of the contents of the array, even when we're asking for arbitrary elements eg. `elem = l[999]`.

However, over time, we may wish to change the contents of our array: for example, as we receive new values in a time series, we may need to add new elements to the end of our array. The memory block that we had reserved may now be full and we therefore need to reserve additional addresses to house all of the elements. If neighbouring addresses are already being used by other programmes, then we will need to provision a new contiguous memory block and relocate the entire array in order to keep all the elements together. 

We can mitigate the cost of having to relocate the array when its size changes by reserving some buffer in the blocks that we request - this way we can add new elements to the array without having to move the whole array.
BUT this redundancy is also a drain on memory (noone else can use it) and doesn't completely remove the problem - we can still use up the buffer and find ourselves needing to relocate the array.

### Linked lists

Linked lists allow us to avoid this need for relocating arrays when we add new elements. The elements of a linked list are not stored in a contiguous block - they can each be stored in arbitrary positions on the disk.
In order to keep track of all these positions, each element is responsible for storing the location of the next element in the LL. In this way, adding a new element is easy: you store it anywhere in memory, and then store it's location with the previous element. As a result, we never need to worry about having to relocate the whole LL.

The downside with LLs is that in order to retrieve an element from an arbitrary position in the LL, we still have to start at the beginning and work our way through each element to get there (because each element holds the only record of the address of the next element).

Thus, LLs are great for scenarios where we want to retrieve each and every element in order. But they're far from ideal if we are more likely to be requesting individual elements from arbitrary positions in the collection.


### Performance

| Operation | Array    | Linked list |
|:-----     |:-----:   |:-----:      |
| Insert    | $O(1)$   | $O(n)$      |
| Search    | $O(n)$   | $O(1)$      |
| Delete    | $O(n)$   | $O(1)$      |


- Prefer **arrays** when the contents don't change size frequently
- Prefer **arrays** when we often won't read from the sequence in order, when we'll typically want to retrieve values from arbitrary slices / indices
- Prefer **linked lists** when we typically read (all) elements, in order
- Prefer **linked lists** when we regularly change the size of the collection

## Hash tables

> AKA hash map, map, dictionary & associative array

- implementation
- collisions
- hash functions

Hash tables map inputs to outputs as key, value pairs. They are composed of:
1. Hash function
2. Array (for storage of values)


The input is a sequence of bytes that acts as a key to access the value, and the hash table returns the value each time that key is provided.
A hash function takes a key as input and returns the index of the array at which the corresponding value is stored. (ie. it doesn't return the value directly, but it tells us where to store / find it).

No duplicate keys allowed.

### Collisions

Sometimes a hash function will assign two different keys to the same bucket. This will overwrite values and could cause the hash table to return the wrong value for a key.

One way around this is to create a linked-list at a slot whenever multiple keys map to it.
But this can make hash tables inefficient if the collisions are not distributed evenly through our data (we could end up with many data points in a single linked list, and few values in the remaining array slots).

### Performance

| Operation | Avg | Worst |
|:-----|:-----|:-----|
| Insert | $O(1)$ | $O(n)$ |
| Search | $O(1)$ | $O(n)$ |
| Delete | $O(1)$ | $O(n)$ |


Eventually, the table will grow so large that collisions will occur, and the performance rules will break-down.

### Load factors & resizing

The likelihood of collisions is measured with the **load factor**:

load factor = (Num. items in the hash table) / (Num. slots available)

As the load factor increases, we may need to add slots to the values array; this is called "resizing".

Resizing is often triggered once the load factor exceeds 0.7.

To resize the values array:
- Create a new array (often double the size)
- Re-assign all exisiting items to the new array using the hash function

Resizing is expensive, but reading hash tables is still $O(1)$ once averaged-out

### Hash functions

A good hash function distributes values evenly across array slots - this reduces the likelihood of collisions

An example of a hash function: SHA

Hash functions must exhibit two properties:
1. Consistency: Each time the same key is provided, it should unlock the same value
2. Divergence: Each time a different key is provided, it should return a different array index

> Common hash functions & how they work?

## What is an algorithm?

An algorithm is a set of instructions for accomplishing a task.

There are often many ways to accomplish a task, and so we need to understand the trade-offs in order to know how best to accomplish a task in a given context.

### A motivating example: binary search

Imagine that we need to find an element in a sorted list. More specifically, we want to return the position of the desired element in that list, or `null` if it cannot be found.

**Simple search** might be the most obvious & basic attempt to solve that task. It starts at the first element, and moves through them one-by-one, in sequence, until it finds the right answer. If our desired value happens to be the first element, then we're in luck. If it isn't, and our list is long, then we might have to check a lot of values before we get our answer...

**Binary search** dramatically reduces the number of values we have to check before we find the position of our desired element. At each step it selects the mid-point of the remaining elements and establishes whether that location is too low or too high in order to determine which elements it can remove from consideration, and which elements are still contenders (remember, it's a sorted list). This process is repeated with each step until we find the target value (or exhaust the possibilities).

In [2]:
from typing import Union

def simple_search(sorted_elements: list, target_value: int) -> Union[int, None]:
    for index, element in enumerate(sorted_elements):
        if element == target_value:
            return index

In [3]:
sorted_list = [0, 1, 2, 3, 4, 5, 6, 7]

print(simple_search(sorted_list, 3))
print(simple_search(sorted_list, 6))
print(simple_search(sorted_list, 9))

3
6
None


In [20]:
def binary_search(sorted_elements: list, target_value: int) -> Union[int, None]:
    search_boundaries = {'low': 0, 'high': len(sorted_elements) - 1}
    guesses = 0
    while search_boundaries['low'] <= search_boundaries['high']:
        mid_point = int((search_boundaries['low'] + search_boundaries['high']) / 2)
        midpoint_value = sorted_elements[mid_point]
        guesses += 1
        if midpoint_value == target_value:
            print(f"{guesses} guesses")
            return mid_point
        if midpoint_value > target_value:
            search_boundaries['high'] = mid_point - 1
        else:
            search_boundaries['low'] = mid_point + 1
    return

binary_search([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 16)

5 guesses


20

## Selection Sort\*

## Recursion

A recursive recipe must contain three ingredients:

1. Stopping criteria
2. A first step (gets the ball rolling)
3. A repetitive component (that will lead us to the stopping criteria) ie. a function that calls itself.

Recursion does not improve computational performance, but may be faster to write. It should be used when it helps to make the solution easier to intuit.

### Structure

In order to avoid infinite loops of recursion, each recursive method needs to have two components:
1. A base case: catches the stopping criteria & breaks the loop
2. A recursive case: takes the recursion a layer deeper

### The call stack

The *stack* data structure acts like a LIFO inventory: new items go to the top of the pile as they're added and we remove those most recent items from the pile first as we work through it.

The **call stack** is a stack data structure that our computers use to keep track of function calls that are WIP:
- New function calls are added to the stack (memory is allocated to that call, and variables in that scope are saved to that memory)
- Those function calls are removed from the stack as they return their results

Because recursion involves calling the same function many times, with a chain of dependency between each call, it can lead to a lot of function calls being added to the call stack. With enough calls, or large state stored for each call, recursion can exhaust a machine's memory.

When this occurs, there are two options:
1. Use a loop instead
2. Use tail recursion (which only some languages support)

## Divide & Conquer

**Divide & Conquer** is a general technique for solving problems that uses recursion.

1. Identify simple conditions in which the problem can be solved.
2. Break the problem down so that you are only trying to solve it for increasingly small / simple inputs; continue until you reach a situation where you only need to deal with the simple conditions above. 

> NB. When working with recursion on problems involving arrays, the base / simple case is often a array of length 0 or 1.

Preffering the recursive approach over loops is typical of **functional programming** - Haskell doesn't even have loops!

The Binary Search algorithm that we saw earlier is also a type of D&C solution which can be coded using recursion. See the re-write of our earlier function below:

In [44]:
def recursive_binary_search(arr: list, target: int, low_idx: int, high_idx: int, guesses: int = 0) -> Union[int, None]:
    if not low_idx < high_idx:
        return None
    guesses += 1
    # Base case
    mid_point = int(low_idx + (high_idx - low_idx) / 2)
    print('mid: ', mid_point)
    midpoint_value = arr[mid_point]
    if midpoint_value == target:
        print(f"{guesses} guesses")
        return mid_point
    elif midpoint_value > target:
        return recursive_binary_search(arr, target, low_idx, mid_point-1, guesses=guesses)
    else:
        return recursive_binary_search(arr, target, mid_point+1, high_idx, guesses=guesses)

In [51]:
arr = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
recursive_binary_search(arr, 4, 0, len(arr))

mid:  10
mid:  4
2 guesses


4

### Quicksort

Quicksort - as the name suggests - is a sorting algorithm that uses D&C.

We'll work through it using the example of sorting an array:

**1. Identify simple conditions in which the problem can be solved.**

The simplest situation we can hope to work with is a case where the array is either empty, or only has one element: in this scenario, we can just return the array as it is because there's no sorting to be done!

This could look something like:
```python
def quicksort(arr):
    if len(arr) < 2:
        return arr
```

The next simplest situation we can have is an array with 2 elements. In this scenario, if the first element is larger than the second, then swap them around and return.

In [52]:
def quicksort(arr):
    if len(arr) < 2:
        return arr
    if len(arr) == 2:
        return arr if arr[0] < arr[1] else arr[::-1]
    
quicksort([2,1])

[1, 2]

**2. Then break the problem down until you reach those conditions.**

And now we need to think through how we'd approach longer arrays. With 3 elements, we can no longer do the direct comparison above. We need to Divide & Conquer... How could we break up the process so that we end up only having to deal with many instances of the base case?

One approach could be to:
- Select an element; we'll call our selected element the ***pivot***
- Work through & throw all other elements into one of two buckets (leaving three in total, once the pivot is included): those less than the pivot, and those greater than the pivot
- For each of the non-pivot buckets, choose a new pivot and repeat
- Eventually, each non-pivot bucket will simplify to an array of 2 or fewer elements, and from these we can build a complete array of sorted elements!

In [7]:
lesser = []
greater = []
arr = [5,4,3,2,1]
pivot = 3

for elem in arr:
    lesser.append(elem) if elem <= pivot else greater.append(elem)
    
print(lesser)
print(greater)

[3, 2, 1]
[5, 4]


In [11]:
def quicksort(arr: list):
    if len(arr) < 2:
        return arr
    if len(arr) == 2:
        return arr if arr[0] < arr[1] else arr[::-1]
    
    lesser = []
    greater = []
    pivot = arr.pop()
    for elem in arr:
        lesser.append(elem) if elem <= pivot else greater.append(elem)
    
    return quicksort(lesser) + [pivot] + quicksort(greater)
    

assert quicksort([1, 2]) == [1, 2]
assert quicksort([1, 2, 3]) == [1, 2, 3], f"returns {quicksort([1,2,3])}"
assert quicksort([7, 6, 5, 4, 2, 3, 1]) == [1, 2, 3, 4, 5, 6, 7], f"returns {quicksort([1,2,3,4,5,6,7])}"

### A note on complexity

The speed with which quicksort sorts an array depends on which pivots we choose:
- In the worst case, the complexity of running quicksort is $O(n^{2})$ (ie. as bad as **selection sort**)
- In the average case, the complexity of running quicksort is $O(nlog(n))$

The worst case occurs if we manage to always select our pivot such that all remaining elements lie in just one partition (ie. all are lesser than, or greater than, the pivot). This results in the worst case, because we're effectively removing one item from the array at a time, rather than breaking it into many chunks. This would occur when the array passed in is already sorted and we select the first element as pivot on each split.

To avoid the worst case, and aim for the average case, we usually select the pivot element by random each split.

In contrast, the best case occurs if each pivot is bang in the middle of the array, so the partitions are split equally - which speeds up how quickly we break all branches down to a base case scenario. In this case, the runtime complexity is $O(log(n))$.

The average case complexity is equal to the best case complexity! (the runtime will be slower because we're unlikely to get exactly the ideal splits every time, but it scales with $n$ in the same order of complexity).

### Parallelism?

Looking at the shape of the quicksort method we've written above, it's interesting to see that not only are we using recursion, we're using recursion twice (in that last line).
The fact that we're breaking the problem down into many branching pieces more quickly (by calling it twice, rather than having a single chain of many dependent calls) suggests that there may be more opportunity for parallel processing to help us here?

## Merge sort



## Avg-case Vs Worst case

In all the complexity notation, there's a hidden constant; figuratively and literally...

In all cases, the runtime of an algorithm depends not just on $n$, but also $c$: the time taken to compute the required operations for each element of $n$.
For some algorithms, the order of complexity may be higher than others, but $c$ may be small enough that it's still faster to use for the size of $n$ you're dealing with.

## Recap


| Algorithm     | Worst-case runtime complexity      |
|:--------------|:-------------:|
| Binary search | $O(log(n))$ |
| Selection sort| $O(n^{2})$   |
| Quicksort     |          |
| Merge sort    |          |


## Graphs & graph-search

### Graphs

Graphs are composed of the following element types:
- Nodes
- Edges
- (Directions)

Neighbours are nodes which are directly connected by an edge. These are "first-degree" connections.
Nodes that are connected to our first-degree connections (but not to us), are "second-degree" connections.

### Shortest path problems & Breadth-first search (BFS)

Shortest path problems involve finding the shortest route between two nodes in a graph.

BFS is the algorithm most commonly used to solve these types of problems.

"Breadth-first" refers to the fact that we exhaustively search each degree of connections before searching any connections with a higher degree.
In BFS, the nodes are not weighted.

To implement this kind of search, we can use a Queue data structure:
- **Queues** are similar to **stack** data structures; the difference is in the order with which items are removed from the list
- **Stacks** provide the most recently added item when removing items: LIFO
- **Queues** provide the earliest-added item when removing items: FIFO

> When we queue for the till whilst shopping, the **f**irst person to join is the **f**irst person to get served

> Whereas, if I make a **stack** of books, I can only (easily) remove the book at the top - the **l**ast one to be added to the stack is the **f**irst one to be used

Queues support two operations:
1. `enqueue`: add an item to the queue
2. `dequeue`: remove an item from the queue


### Representing graphs in data structures

If we're working with directed graphs without weights, then our data structure only needs to accommodate nodes & edges (connections). We can do this with a hash table:
- Keys: one for each node
- Values: a list of nodes that each key-node has a directed connection to

Imagine we have a collection of bus stops, connected by sections of several bus routes.
- Each bus stop is a node,
- Each section of a route between two stops is an edge

<diagram>

If we start at stop 1, is there a route to stop 5?

In [1]:
bus_stop_graph = {
    1: [2,3],
    2: [3,4],
    3: [4,5],
    4: [5],
    5: []
}

In [3]:
from collections import deque

def search_routes(start, destination):
    # create a double-ended queue (FIFO)
    search_roster = deque()
    # add the first degree of route sections to the stack
    search_roster += bus_stop_graph[start]
    # Keep track of stops we've already addded to the stack (avoid duplicates / inf. loops)
    searched_stops = []
    
    while search_roster: # keep searching so long as there are still items in the search roster
        next_stop = search_roster.popleft() # take the first/earliest-added item from the stack
        if next_stop not in searched_stops:
            # is it our destination?
            if next_stop == destination:
                return True
            else:
                # otherwise, add its route sections to our search list
                search_roster += bus_stop_graph[next_stop]
                # and add it to searched_stops
                searched_stops.append(next_stop)
    return False # If we exhaust the stack without finding the destination, then there was no route from 'start' to 'destination'

In [7]:
assert search_routes(start=1, destination=5)
assert search_routes(start=2, destination=5)
assert not search_routes(start=5, destination=1)

### Complexity

In the worst case, we'll end up searching through all connections / edges $O(Edges)$
And we'll need to add each to add every node to the search list (each add op is $O(1)$, so this works out to $O(Nodes)$
Combined then, we're facing $O(Nodes + Edges)$

\* On that implementation, the BFS confirms whether or not there is a path from start to destination, and it finds it efficiently. But it doesn't tell me whih path to take. What's the best way of tracking that!?

### Trees are graphs

As a side-note, ***Trees*** are Directed graphs that don't point back [and don't share connections across branches?]

## Dijkstra's Algorithm

Working with the Breadth-First Search algo was appropriate when:
1. We want to find the path with the smallest number of edges between two nodes
2. We are **NOT** considering weights on the edges

> If we want to consider weights (ie. find the smallest weighted 'distance' between two nodes) then we should use Dijkstra's algorithm.

Dijkstra's algorithm makes the assumption that there is no 'cheaper' route to the cheapest first-degree connection from a given node. This is because the first edge you'd have to travel along on any other path would already have a higher cost than the cheapest - so long as ALL weights are positive. If weights can be negative, then a 2nd-degree edge could more than offset the higher cost of the first, and thereby offer a cheaper path.

\* Bellman-Ford Vs Dijikstra?

\* What to do if multiple edges have the same (cheapest) cost?

Dijikstra's algorithm fails if:
- The graph is cyclical (undirected graphs are implicitly cyclical)
- Weights can be negative

==> Only use it for Directed Acyclical Graphs (DAGs) where weights are non-negative


## Summary of graph search

| Algorithm     | Directed? | Cyclical? | Weighted? |
|:--------------|:---------:|:---------:|:---------:|
| Breadth-first search | Directed | Acyclical | Unweighted |
| Dijkstra's | Directed | Acyclical | Weighted (+ve only) |
| Bellman-Ford | Directed | Acyclical | Weighted |


## Resources

[Grokking Algorithms; Aditya Bhargava](https://www.manning.com/books/grokking-algorithms)

[Eliana Lopez's crib sheet](https://github.com/elianalopez/Data-Structures-and-Algorithms-Notes-with-Python)