## <center>Scientific Programming - 7MRI0020 - 2021/2022</center>


## <center>Week 06 - Data Structures and Algorithms - Part 01</center>


### <center>School of Biomedical Engineering & Imaging Sciences</center>
### <center>King's College London</center>

## What we'll cover
* Data structures
* Abstract Data Types
* Time complexity
* Choosing the right one for a task

## Data Structures
* Very simply the internal constructs (ie. objects) for storing data
* Many ways exist to store the same data, the choice depends on application, efficiency, or some other quality metric
* Already seen these in the primitive types `list`, `dict`, `set`, etc.

* Lists are the simplest type, storing data in some contiguous way
* Individual items of data are accessed by index, so need to know size, type, etc. beforehand
* Python handles all of these issues for you
* For fast lists of numeric values use Numpy arrays

* We can differentiate between types of data structures
  * Collections like `list`, `tuple`, etc. are designed to generically store data objects, they provide routines and facilities related to storage mechanisms only and not for manipulating or computing with the stored data
  * Compound data types like a simple vector in 3D space having `x`, `y`, and `z` numeric components or a Numpy array which store a specifically defined sort of data and provide operations over that data

## Example: Linked List


* A list of nodes, each with a data item and a reference to the next (or None) <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Singly-linked-list.svg/600px-Singly-linked-list.svg.png">
* Interface works like a sequence, stack, and/or queue

In [1]:
class LLNode:
    def __init__(self, data, nextNode=None):
        self.data = data
        self.nextNode = nextNode # None or LLNode instance

In [2]:
class LinkedList:
    def __init__(self):
        self.head = None
        
    def append(self, data):
        """Add an item to the end of the list."""
        if self.head is None:
            self.head = LLNode(data) # start a new list
        else:
            tail = self.head
            while tail.nextNode is not None: # get tail of list
                tail = tail.nextNode
                
            tail.nextNode = LLNode(data) # add to the tail

* Could add more methods for putting an item at the front of the list, getting the n'th item, finding an item in list, removal, etc.
* This would be used in place of a `list` or other type to take advantage of the time requirements of these operations when appropriate
* Internal implementation of the linked list is hidden behind this interface, that is the details are abstracted away from any client


## Abstract Data Type
* An abstract data type (ADT) is composed of the two constructs:
  * The data definition, stated as a structure of values
  * The operations which create and manipulate these structures
* Operations are exposed to clients as interface only, hiding the implementation details
* This abstraction allows the data structure to be used by clients without knowing anything about internal details
* Classes are types of ADTs combining data and operations in a single definition

## Time Complexity
* We estimate the amount of computation time an operation on a collection will take as a proportion of the structure's size
* Big-O notation is used to state the estimated worst-case time an operation will require as a function of the size `n`
* Eg. if operation `T` must visit each item in a collection and spend the same amount of time on each, the time requirement thus scales linearly with the size of the collection: `T(n) = O(n)`
* If a operation `R` must visit the whole collection when visiting each item, then the time scales by the square of the collection size, thus `R(n) = O(n**2)`

* Big-O notation is an approximation of the runtime cost so there's generally a few classes of complexity almost all algorithms fall into:
  * Constant: `O(1)`
  * Logarithmic: `O(log n)`
  * Linearithmic: `O(n log n)`
  * Linear: `O(n)`
  * Quadratic: `O(n**2)`
  * Polynomial: `O(n**c)`
* Non-deterministic polynomial (NP) classes of algorithms exist which (almost certainly) require greater than polynomial time to complete, if one is shown to be in this category it is called NP-complete

* Consider the `append` operation of the `LinkedList` type
* Each node in the list must be visited to find the end node
* This would be a linear-time function: `append(n) = O(n)`
* Once clue to determining the time complexity of a routine is to see how many loops are nested in one another, `append` has only one loop so `O(n)`

* Consider a `prepend` operation for `LinkedList`:

In [3]:
class LinkedList:
    # ---previous definitions here---
    def prepend(self,data):
        # replace head node, old head becomes next in chain
        self.head = LLNode(data, self.head)

* Operation not dependent on the size of the list, whether list is empty or massive this will take the same amount of time
* A constant-time operation: `prepend(n) = O(1)`

* `prepend` is going to be a lot faster than `append`, especially as the list grows
* Depending on application, using a linked list like this to build a list will be more efficient than using `list` which has to resize allocated storage regularly
* Choosing the data structure having operations of the right time complexity is key to fast code

* Eg.: `set` and `dict` rely on hashable types, ie. they implement `__hash__` returning a semi-unique constant hash value (an `int`)
* Internally they stored values with lists called hashtables
* Hash values are used to calculate an index in the table
* Insertion, search, and other operations on `set` and `dict` thus average `O(1)` time complexity
* Using `set` to accumulate unique instances of objects then converting to `list` may be faster than other approaches using lists only

* Lookup with `in` keyword should be faster then with sets:

In [4]:
nums=list(range(1000))

%timeit 150 in nums # has to check every element, O(n)
%timeit 750 in nums # has to check every element, O(n)

1.65 µs ± 25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
7.83 µs ± 221 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [5]:
setnums=set(nums)

%timeit 150 in setnums # look up position for 150 using hash, O(1)
%timeit 750 in setnums # look up position for 750 using hash, O(1)
%timeit 1111 in setnums # look up position for 750 using hash, O(1)

31.4 ns ± 0.771 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
43.2 ns ± 0.598 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
30.2 ns ± 0.33 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


## Set, Dict, Hashable
* A hashable object is one implementing `__hash__` which should return a pseudo-unique integer
* Hashes must not change throughout the object's lifetime, make sense only for immutable types (eg. `tuple`)
* Calculating them is a complex subject for efficiency and security reasons, Python mostly uses the object's memory address
* Collisions will occur, depends on implementation how these are handled

* Hashes are used to calculate an index in a table for fast insertion and lookup
* `set` could be implemented as a sparse list of objects of length `n` where the index of an object `a` is `hash(a)%n`
* `dict` could be implemented as a set of key-value pairs whose hash is that of the key only
* A collision, two distinct objects with the same index, can be addressed by placing the new object below the first one, or having a linked list at each index

## Other Structures
* More complex forms of sequential collections exist:
  * Stack: items can only be added to or removed from end of list (LIFO: last-in-first-out)
  * Queue: items can only be added to the end but removed from the start (FIFO: first-in-first-out)
  * Deque (double-ended queue): items can only be added to the end but removed from start or end
  * Priority queue: items are stored in order as defined by a priority criteria, only the highest priority item can be removed

## Other Structures

* Other non-linear collections exist:
  * Binary trees: composed of nodes with data plus left and right subnodes
  * Red-black trees: binary tree with color assigned to nodes, used to balance tree upon insertionto minimize lookup cost
  * Heaps: a type of sorted binary tree stored as lists where element index indicates tree position
  * N-ary trees: Same but with up to N subnodes
  * Graphs: composed of nodes containing data plus links to any other arbitrary node, if links are directional it is a digraph (graph theory is a huge area)

# That's it! Questions?

## Next: Exercises

## Tomorrow: algorithms