# Data Structures

# (Re)Implementing data structures

As a learning exercise, we're going to see how to build a few classic data structures using Python.  Python provides a variety of nice data structures ready for use, but rebuilding some classic data structures will help you understand more complex data structures generally.

A key idea that will make all of this work is that when an object has an attribute (field) that is an object, that field actually has a reference to that object, a link.  These links "hold together" complex multi-part data structures, in which each piece of the data structure is itself an object.

# Search on a Linked List

The first structure we'll look at is a linked list.  We don't particularly need it in Python with the existence of built-in lists, but it's the easiest of these complex multi-part data structures to reason about.

Something that you might find surprising is that we just define the class of the individual "nodes", and don't create a separate class for the list.  To know where the first node in line is, is to know where the whole list is.  Each node knows the location of the next node.  The last knows there is none after it.

In [2]:
class ll_node:
  def __init__(self, num):
    self.number = num
    self.next = None

  def append(self, num):
    if self.next == None:     # End of the list
      self.next = ll_node(num)
    else:
      self.next.append(num)
    
  def contains(self, othernum):
    if self.number == othernum:
      return True
    elif self.next == None:
      return False
    return self.next.contains(othernum)

In [3]:
mylist = ll_node(5)
mylist.append(7)
mylist.append(9)
print(mylist.contains(7))
print(mylist.contains(6))

True
False


Code on lists tends to be of the form "if the current node is relevant, do x, else proceed to the next element."  That's certainly the case here for the contains method.  It has three cases - we found it, we didn't and there's nowhere else to go, or we didn't but we can move on.

Contains() and append() both have recursive calls, where a function calls itself.  Since we don't know how many times we need to go down the list, and don't even know its length, it makes sense to write code that talks about the "head" of the list and proceeding recursively to the "rest" of the list.

The search is slow, and linked lists tend to be a little slow generally, because if they're unordered, the desired element could always be waiting at the end of the list.  That's a "worst case linear" time to find something in the list.


Python lists are actually arrays that get resized when they run out of room; they don't follow this structure at all.  But being able to think about a linked list is often seen as a fundamental skill, for the purpose of coding interviews and such.

A "doubly-linked list" also has a field that points to the previous item in the list, making a backwards traversal possible.  This isn't really necessary in many cases, and in Python, you'll mostly work with the built-in lists anyway.  But this also shows up in coding interviews.

# Example of Linked List Code

Suppose we want to write a method filter_less_than(), which takes a number, and returns a linked list with just the values in the current list that are smaller than the number.  So filter_less_than(6) with a linked list [5,4,6,9] returns [5,4].

The solution follows a pattern of getting the rest of the list, then attaching something to it or not depending on whether the condition is met.

In [6]:
class ll_node:
  def __init__(self, num):
    self.number = num
    self.next = None
    
  def __str__(self):
    if self.next is not None:
      result = str(self.number) + "-" + str(self.next)
    else:
      result = str(self.number)
    return result
    
  def append(self, num):
    if self.next == None:
      self.next = ll_node(num)
    else:
      self.next.append(num)
    
  def contains(self, othernum):
    if self.number == othernum:
      return True
    elif self.next == None:
      return False
    return self.next.contains(othernum)
  
  def filter_less_than(self, number):
    if self.next is not None:
      result = self.next.filter_less_than(number)
    else:
      result = None
    if self.number < number:
      new_node = ll_node(self.number)
      new_node.next = result
      return new_node
    return result  # Drop the head, doesn't make it under number


node1 = ll_node(6)
node2 = ll_node(8)
node1.next = node2
node3 = ll_node(10)
node2.next = node3
print(node1.filter_less_than(9))

6-8


# Trees

Trees don't have generic implementations in Python, but they come up in a wide variety of contexts.  Written text gets parsed into trees, code itself gets parsed into trees by compilers, game-playing AIs view the possible moves in the future as a tree, trees are created by search algorithms while exploring, and so on.

Trees are generally good at representing a hierarchy, like folders in a filesystem.




In a rooted tree, exactly one node -- the "root" -- has no parent nodes, and all other nodes have one parent node.  In a binary tree, each node has up to two children.  (In other trees, there may not be a cap on the number of children.) A binary tree is a bit easier to think about, but it wouldn't be good at representing some situations, like a hard drive where there could be more than two folders at a given level of the directory.

Our example will be just searching for a file in a tree, similar to the linked list application.  The "leaves," or elements with no children, will represent files, while interior nodes represent folders.  To keep things simple, we'll assume a binary tree, even though we'd want to use a list of children for trees that branch more.

In [5]:
class FolderTree:
  # binary left and right are its fields
  def __init__(self, val):
    self.left = None
    self.right = None
    self.val = val
  
  def addLeft(self, node):
    self.left = node
  
  def addRight(self, node):
    self.right = node
  
  def find(self, v):
    if self.val == v:
      return True
    if self.left and self.left.find(v):
      return True
    if self.right and self.right.find(v):
      return True
    return False

root = FolderTree("root")
leftparent = FolderTree("folder1")
rightparent = FolderTree("folder2")
leftleftchild = FolderTree("wow.exe")
leftrightchild = FolderTree("xls.exe")
rightleftchild = FolderTree("lec12.pdf")
rightrightchild = FolderTree("lec14.pdf")
leftparent.addLeft(leftleftchild)
leftparent.addRight(leftrightchild)
rightparent.addLeft(rightleftchild)
rightparent.addRight(rightrightchild)
root.addLeft(leftparent)
root.addRight(rightparent)

print(root.find("wow.exe"))
print(root.find("lec13.exe"))


True
False


The keyword None can often play a handy role of "no value here" for both links (to signal end of the line) and data (maybe interior nodes don't get values).


The above implementation leaves a little something to be desired in terms of efficiency, because the whole tree needs to be searched.  If we instead require that all values to the left of a node be less than its key, and all values to the right are greater, then we get a "binary search tree," and the running time gets very good as long as the tree is also roughly balanced.

In [7]:
class BinarySearchTree:
  # binary left and right are its fields
  def __init__(self, val):
    self.left = None
    self.right = None
    self.val = val
  
  def addLeft(self, node):
    self.left = node
  
  def addRight(self, node):
    self.right = node
  
  def find(self, v):
    if self.val == v:
      return True
    if v < self.val:
      if self.left:
        print("Going Left")
        return self.left.find(v)
      else:
        return False
    else:
      if self.right:
        print("Going Right")
        return self.right.find(v)
      else:
        return False

root = BinarySearchTree("m")
leftparent = BinarySearchTree("f")
rightparent = BinarySearchTree("q")
leftleftchild = BinarySearchTree("a")
leftrightchild = BinarySearchTree("h")
rightleftchild = BinarySearchTree("o")
rightrightchild = BinarySearchTree("u")
leftparent.addLeft(leftleftchild)
leftparent.addRight(leftrightchild)
rightparent.addLeft(rightleftchild)
rightparent.addRight(rightrightchild)
root.addLeft(leftparent)
root.addRight(rightparent)

print(root.find("h"))
print(root.find("d"))

Going Left
Going Right
True
Going Left
Going Left
False


The tree is superficially similar, but only one of the recursive calls happens to a node on a given search (at most), and only one path to the file is explored instead of the whole tree.

Binary search trees illustrate the broader principle that a little bit of organization in the data can make the code a lot faster.  (Here the speed of the search grows logarithmically with the number of items, as long as the tree is roughly balanced.  "Logarithmically" is the opposite of "exponentially.")

# Example of Binary Search Tree code

Binary search trees insert nodes by following the code to look for the data, then placing the data wherever this search finally leads.  So an insert operation looks very similar to a find, except for what it does when it gets to the right place.

In [8]:
class BinarySearchTree:
  # binary left and right are its fields
  def __init__(self, val):
    self.left = None
    self.right = None
    self.val = val

  def __str__(self):  # To help visualize the output
    result = self.val
    if self.left:
      result += ", L" + str(self.left)
    if self.right:
      result += ", R" + str(self.right)
    return result

  def addLeft(self, node):
    self.left = node
  
  def addRight(self, node):
    self.right = node
  
  def find(self, v):
    if self.val == v:
      return True
    if v < self.val:
      if self.left:
        return self.left.find()
      else:
        return False
    else:
      if self.right:
        return self.right.find()
      else:
        return False

  def insert(self, v):
    if v <= self.val:
      if self.left == None:
        self.left = BinarySearchTree(v)
      else:
        self.left.insert(v)
    else:
      if self.right == None:
        self.right = BinarySearchTree(v)
      else:
        self.right.insert(v)

root = BinarySearchTree("m")
leftparent = BinarySearchTree("f")
rightparent = BinarySearchTree("q")
root.addLeft(leftparent)
root.addRight(rightparent)

root.insert("z")

print(root)

m, Lf, Rq, Rz


# Dynamic arrays

Python's list isn't a linked list, but a dynamic array.  This means a fixed amount of memory is allocated to the items, and if adding more items exceeds the memory that was allocated, everything needs to be moved to a new, more spacious location. 

 Here's a re-implementation using a numpy array.

In [None]:
class dynamic_array:
  def __init__(self, initial_size):
    self.memory = np.zeros(initial_size)
    self.occupied = 0
    self.size = initial_size
  
  def append(self, val):
    if self.occupied == initial_size:
      new_memory = np.zeros(size*2)
      for i in range(len(memory)):
        new_memory[i] = memory[i]
      memory = new_memory
      size = size*2
    new_memory[occupied] = val
    occupied += 1
  

Python uses dynamic arrays instead of linked lists because there are a variety of operations, like checking the value in the middle of the list, that are faster when you don't need to start from one end of the list and work your way in.  But there is a slight "speed bump" every time the underlying array gets full, and a new home needs to be found for the memory.


You may be wondering, looking at the preceding code:  what happens to all that old memory that was left lying around?  But Python has a garbage collector running in the background that frees up all data that is no longer being "held on to" with a reference.  It will realize nothing has the address of that used memory, and free the memory for later allocation.