# Python Data Structures

## Tom Davison

- Teaching Fellow in Computational Data Science
- MSci Geology and Geophysics
- PhD Planetary Science

### Research interests
- Impact cratering
- Asteroid collisions
- Shock physics
- Early solar system processes
- Planetary defence

## How to find me

- [**Tom Davison**](https://profiles.imperial.ac.uk/thomas.davison)
- email: thomas.davison@imperial.ac.uk
- GitHub: `@tmdavison`
- MS Teams: `@Davison, Tom`
- Office: 4.85, Royal School of Mines

## Today's learning outcomes

At the end of the day you should be able to:
- Describe the features and properties of the builtin Python data structures
- Select the best choice of data structure for a given usage
- Understand more about how Python classes work and when to use them

## What is a data structure?

On Monday we reviewed raw storage of data in binary representations

Python (& Computer Science) stores data in a number of more
complex ways:

- for efficiency
- for convenience

Today we will look at some of them:

1. Python's built in types - "what's in the box?"
2. Extended data structures in the Python standard library
3. Review of other useful data structures in common Python packages
4. More info on data structures that **you** create yourself (Python classes).

# Part I : Core Python data structures & types

## Objects

All Python data is an "object".

Unlike C/Java, Python functions will accept any object as input ("duck typing"). 

Objects can be "bound" to names using the `=` operator

<img src="./images/name1.png">

In [1]:
x = 7

<img src="./images/name2.png">

In [2]:
y = x

<img src="./images/name4.png">

In [3]:
a = list([1,2,3,4])
b = a

**(live coding)**

Objects come in two "flavours":
-  _mutable_ and 
- _immutable_.

Mutable objects allow their underlying data to change, immutable objects are fixed when they're created.

## Integers

It may be a surprise , but the built-in Python integers (& floats & strings) are immutable. 

To "update" an integer, Python creates a new object with the updated value.  We can check this using the `id()` function.

This returns a unique value (actually an integer) for every Python object in memory.

In [4]:
x = 3
print(f'1: id of object named x is {id(x)}')
x += 5
print(f'2: id of object named x is {id(x)}')

1: id of object named x is 4307935296
2: id of object named x is 4307935456


For efficiency, Python actually stores a single set of Python objects representing the small integers (from - 5 to 256).

Remember we can use the `is` operator to check if two variable names point to the same object.

In [5]:
x = 3; y = 3
print(f'ids: {id(x)}, {id(y)} same object: {x is y}')

a = 12345; b = 12345
print(f'ids: {id(a)}, {id(b)} same object: {a is b}')    

ids: 4307935296, 4307935296 same object: True
ids: 4358892112, 4358891856 same object: False


As we heard on Monday, Python has "arbitrary length precision" integers.

Fundamental limit on the largest integer number your Python code can work with is the memory available in your computer.

In [6]:
import sys

a = 10**0
print(f'a uses {sys.getsizeof(a)} bytes')
b = 10**5
print(f'b uses {sys.getsizeof(b)} bytes')
c = 10**10
print(f'c uses {sys.getsizeof(c)} bytes')
d = 10**1000
print(f'd uses {sys.getsizeof(d)} bytes')

a uses 28 bytes
b uses 28 bytes
c uses 32 bytes
d uses 468 bytes


### Floats, strings and complex numbers

- Just like python integers (`int`), floating point numbers (`float`), character strings (`str`) and complex numbers (`complex`) are all immutable.
- Updating or modifying any of these data types results in python creating a new object, and changing the variable to reference this new object.

## Lists

Python lists are formed using `[]` brackets, or using the `list()` function on another collection or iterator.


```python
list_a = [1, 3, 5]
list_b = list((3, 4, 5))
list_c = [ _ for _ in range(6, 10) ]
```

Python lists are:
- mutable
- one-dimensional (i.e. single index)
- ordered

_collections_ of other Python objects.

They also support _iteration_ over the index.

Lists are (relatively) fast at:
- finding/changing their elements,
- adding/deleting the _last_ element in the list.

They are much slower at adding or removing elements other than the last one. This scales with the length of the list

### List comprehension

The pattern `[ _ for _ in range(6, 10) ]` is called a list comprehension:
- _Usually_ the fastest way to build a list in Python.
- Good idea to use as long as it stays readable.

In [7]:
def list_method_one():
    out = []
    for i in range(10000):
        out.append(i)
    return out

%timeit a = list_method_one()

195 μs ± 250 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [8]:
def list_method_two():
    return [i for i in range(10000)]

%timeit b = list_method_two()

155 μs ± 404 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


Notice that method two (the list comprehension) runs faster than the loop in method one (by about 25% on my computer).

### Implementation

In CPython, (i.e the Python on most computers) lists are a block of memory holding a sequence of "memory addresses".

|list index| address|
|-|-|
|0|#2343|
|1|#323|
|2|#12|
|3|#323|
|4| _not used yet_|
|5| _not used yet_|
|6| _not used yet_|
|7| _not used yet_|

Looking up the `n`th element is quick (jump straight to the nth block, look up the contents of the address).

Implementations grab spare memory to allow room to expand. Adding a new element (`x.append()`) is normally cheap unless it's "full".

Adding to a full list requires copying the whole list. Deleting & inserting elements is also expensive

|index| _original_ | _append_ | _insert_ | _delete_ |
|-|-|-|-|-|
|0|#2343|#2343|#2343|#2343|
|1|#323|#323|#323|#323|
|2|#12|#12|#496|#323|
|3|#323|#323|#12|#5|
|4|#5|#5|#323|#67|
|5|#67|#67|#5| _not used yet_|
|6| _not used yet_|#226|#67| _not used yet_|

**(live coding)**

## Tuples

Python tuples are formed using commas (usually in `()` brackets) or using the `tuple()` function on another collection or iterator.


```python
tuple_a = 1, 3, 5
tuple_b = (2, 3, 4)
tuple_c = tuple([3, 4, 5])
```

Python tuples are:
- immutable
- one-dimensional
- ordered

_collections_ of other Python objects.

Tuples are the immutable equivalent of lists.

Tuples are fast at:
-  finding their elements

They tend to use slightly less memory than a list.

If you know that something should never change, then use a tuple.

### Implementation

In CPython, tuples implemented a lot like lists.

Block can be chosen to be exactly the right size to hold all the addresses. No need for room to expand.

### Tuple unpacking

A useful feature of tuples is that they can be unpacked into multiple variable in a single statment. For example, you can write 

```python
a, b, c = 1, 2, 3
```

to assign the integers 1, 2, and 3 to the variable names `a`, `b` and `c`.

**(live coding)**

## Generators

Unlike lists, you can't build a tuple using a comprehension (tuple comprehension doesn't exist). 

But there is a similar syntax which might look like what you would expect a tuple comprehension to look like; these are called **generators**.

In [9]:
for x in (x**2 for x in range(3)):
    print(x)

0
1
4


Generators can only be iterated over once. They do not store their contents, but instead generate the relevant values as they are needed, and then discard them once they've been used.

They are therefore useful as they are memory efficient.

However, can lead to confusion:

In [10]:
gen = (x for x in range(4))
print(list(gen))
print(list(gen))

[0, 1, 2, 3]
[]


## Sets

Python sets are formed using  `{}` brackets or using the `set()` function on another collection or iterator.

```python
set_a = {1, 3, 5}
set_b = set([3, 4, 5])
```

Python sets are:
- unordered
- mutable

_collections_ of **distinct** objects.

Quick to test content  using the `x in S` syntax.

In [11]:
import random

x = 12345
S = { random.randint(0, 1_000_000) for _ in range(1000)}
L = list(S)
%timeit x in S
%timeit x in L

15.3 ns ± 0.0143 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
4.88 μs ± 5.24 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


So its a couple of orders of magnitude quicker to test if something is in the set compared to the list.

### Implementation

CPython sets are implemented via a "hash table".

This consists of:
- a large (mostly empty) block of memory, of length N
- a "hash" function which maps inputs to the numbers 0 to N-1.

|index| address|
|-|-|
|0| value1|
|1| _not used yet_|
|2| value3|
|3| _not used yet_|
|4| _not used yet_|
|5| _not used yet_|
|6| value2|
|7| _not used yet_|

To add a member:
- calculate hash of value
- insert in that space

To check if a value is in the set:
- calculate hash of value
- check that space

This is all relatively fast.

May have spotted, it's possible a hash function will give the same location for two different inputs.

Need a process to deal with these "collisions".

Python's method is essentially to carry on looking in other blocks via a predictable pattern based on the hash until:
-  it finds what it's looking for, or
- an empty entry.

This is why sets need to be quite empty, wasting memory.

## frozensets

Python frozensets are formed by using the `frozenset()` function on another collection or iterator.

Immutable equivalent of a set.

## dictionaries (dicts)

Python dictionaries are mapping objects that map _keys_ to arbitrary _values_. 

They are formed using `{}` brackets on comma separated, colon delimited pairs of objects. Alternatively, the `dict()` function can also be used to create a new dictionary, or you can use a dict comprehension.

```python
dict_a = {"a": 1, "b": 10, "c": 25}
dict_b = dict(x=1, y=2, z=100)
dict_c = {x: x ** 2 for x in range(10)}
```


Like sets, dicts are optimized to test if a key is in it. Keys must be _immutable_ (e.g. no lists).

In [12]:
def num(max=1_000_000):
    return random.randint(0, max)

# make a dictionary using a dict comprehension
D = { num():num() for _ in range(1000)}

# get the equivalent list
L = list(D.values())

x = 12345
%timeit x in D
%timeit x in L

15.8 ns ± 0.0219 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
4.81 μs ± 1.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


Again, its a couple of orders of magnitude quicker to check if a value is in the dict compared to the list

Dictionary comprehensions are useful to build a dict quickly, but not always a speed benefit.

In [13]:
# Reversing a dictionary via a "for" loop versus a dict comprehension

def method_one(input):
    out = {}
    for key, val in input.items():
        out[val] = key
    return out

def method_two(input):
    return {val:key for key, val in input.items()}

In [14]:
test = {_:20000-_ for _ in range(10000)}
%timeit t1 = method_one(test)  # for loop
%timeit t2 = method_two(test)  # comprehension

251 μs ± 1.87 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
233 μs ± 876 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


Both methods take around the same time in this case.

### Implementation

CPython implements dicts using hash tables much like sets. Now store  both a note of the key & (address of) the object storing value.

There are some interesting differences, principally that the `dict` stores the insertion order (in modern Python versions), whereas the `set`s don't.

# Part II: Extended types in standard library

## arrays

Python has a (rather limited) "array" type in the standard `array` module, which can be used to store collections of the same sort of object (e.g. all positive integers, or all floating point numbers). Overall, it's not very easy to use, and most people ignore it. Overall, I'd certainly agree with them.

### Implementation

Arrays differ from lists in that rather than the block of memory storing the addresses of other python objects (which then store the values), arrays store a homogeneous collection of data of the same type, typically with a more memory-efficient representation for numerical data.

In [15]:
import array
import sys

a = [1,2,3,4,5]
b = array.array('b', a)

print(sys.getsizeof(a))
print(sys.getsizeof(b))

104
85


## Other container datatypes in the `collections` module

The standard `collections` module contains several other extended container datatypes to be aware of, including:

- `namedtuple`:
  - much like a regular `tuple`, but with named fields. 
- `deque`:
  - similar to a list
  - optimised to add and remove items from the beginning as well as the end.

**(live coding)**

- `OrderedDict`:
  - subclass of a `dict`
  - remembers the order that items were added, and has methods to rearrange those items.
  - Less useful now that standard python `dict`s also store the order. Still find them in legacy code.
- `defaultdict`:
  - a dictionary that can automatically provide a default value for missing keys.
  - based on a factory function you provide when creating it.
  - e.g. `dd = defaultdict(int)`

## numpy.ndarrays

The `numpy` package (remember, not in the standard Python libraries, but frequently available on many systems) provides easy-to-use, multidimensional arrays (i.e 1D, 2D, 3D etc. up to 32 dimensional). 

- Like the 1d arrays we saw above (from the `array` module), `numpy` `ndarrays` _must_ have all elements stored as the same type of data.
- If no datatype (also known as `dtype`) is specified at time of creation, then `numpy` makes a choice to use a type which supports all the data in the input.

In [16]:
import numpy as np

a1 = np.array([0, 1, 2])
print(a1, a1.dtype)
a2 = np.array([0, 1, 2.5])
print(a2, a2.dtype)
a3 = np.array([0, 'hat', sum])
print(a3, a3.dtype)

[0 1 2] int64
[0.  1.  2.5] float64
[0 'hat' <built-in function sum>] object


In [17]:
a1 = np.array([0, 1, 2])
print(a1, a1.dtype)
a1[1] = 2.5
print(a1, a1.dtype)
a4 = a1 + 0.5
print(a4, a4.dtype)

[0 1 2] int64
[0 2 2] int64
[0.5 2.5 2.5] float64


Look back at lecture 2 for more on numpy arrays

## pandas Series/DataFrames

Again, look back at Tuesday's lecture for more info here, but as a reminder:

- `pandas.Series`
  - Similar to a `numpy` array, but can also be indexed by a label
- `pandas.DataFrame`
  - A table of data which has an index and named columns

## How do I choose an appropriate datatype?

<img src="./images/choose_dtype.png">

# Part III: Python Classes

Python allows programmers to "build your own" data structures called _classes_. Follows a world-view called "object oriented programming".

Classes allow you to create your own holders for data,  plus functions (class methods) which you need to modify it.

As you saw in the presessional Python material, a class (say, `MyClass`) is defined using a syntax like this:

In [18]:
class MyClass:
    """This is my cool class."""
    
    mylist = []

    def my_method(self, a, b):
        """ Add two values, store the result, then return the multiple"""
        self.val = a + b
        return a*b

With the above definition, we can create a new object of our class (_instance_)
```python
cls = MyClass()
```

We can then use `cls.my_method(1, 3)` to do the same as `MyClass.my_method(cls, 1, 3)`.

Note that attributes at class level are the _same_ object for all instances (e.g. `mylist` in the example above).

Attributes set at the instance level (e.g. `val`) can differ from each other:

In [19]:
# Get two instances
cls1 = MyClass()
cls2 = MyClass()

In [20]:
# Attributes created on the class are the same variable
cls1.mylist.append(1)
cls2.mylist.append(2)
print(f'cls1.mylist is cls2.mylist : {cls1.mylist is cls2.mylist}')

print(f'cls1.mylist: {cls1.mylist}')
print(f'cls2.mylist: {cls2.mylist}')

cls1.mylist is cls2.mylist : True
cls1.mylist: [1, 2]
cls2.mylist: [1, 2]


In [21]:
# Attributes created on instances can differ
cls1.my_method(1, 2)
cls2.my_method(4, 5)

print(f'cls1.val is cls2.val: {cls1.val is cls2.val}')

print(f'cls1.val: {cls1.val}')
print(f'cls2.val: {cls2.val}')

cls1.val is cls2.val: False
cls1.val: 3
cls2.val: 9


Mostly want to store unique data, can use special `__init__()` method to _initialize_ instance variables when the new object is created

In [22]:
class Point3D:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
        
pnt1 = Point3D(1, 1, 1)

#### A note on `__init__` and `__new__`

[*Note: This is advanced usage--can safely ignore this for now*]

As always things are actually a little more complicated.
When a new instance is created, Python calls the `__new__` class method, and then attempts to call the `__init__` that returns.

Generally, this is what we want to happen, so that we can ignore `__new__`, and just write the `__init__` method.

However, very rarely, you may want to write a `__new__` function to do something special.

In [23]:
#An example of a class with a __new__ method

import weakref

point_cache = []

class Point3D:
    
    def __new__(cls, *args):
        print('in __new__')
        # call the original object creation method
        point = super().__new__(cls) 
        point_cache.append(weakref.proxy(point))
        return point  # important!
    
    def __init__(self, x, y, z):
        print('in __init__')
        self.x = x
        self.y = y
        self.z = z
        
    def __del__(self):
        point_cache.remove(self)
        
    def __repr__(self):
        return f'Point3D({self.x}, {self.y}, {self.z})'

In [24]:
point1 = Point3D(1, 1, 1)
print(point_cache)
del point1
print(point_cache)

in __new__
in __init__
[<weakproxy at 0x10f93a570 to Point3D at 0x10f940b60>]
[]


As a note, the opposite of the `__new__` method is the `__del__` method. This gets called at the point the Python garbage collector believes the object is no longer needed, so the `__del__` method can be used to clean up anything special.

### Why use classes at all?

Classes (and class objects) are not a required part of a programming language, and many languages using other programming paradigms (for example, purely procedural languages such as C) have existed for a long time. 

However, an object-oriented approach works well at avoiding coding mistakes when combining structured collections of data and functions modifying or processing that data. It does this by binding the data and the method together, making it much harder to apply them to the wrong thing

#### Simple class example: Trees
Here's a simple example of a Python class, in which:
* Each instance of the class stores some information about a tree (species, age, height)
* There are some methods which interact with that data
* We can easily write code to access the data from these instances

In [25]:
class Tree:
    def __init__(self, species, age, height):
        self.species = species
        self.age = age  # age in years
        self.height = height  # height in metres

    def __str__(self):
        return f"{self.species} tree is {self.age} years old, {self.height} metres tall"
    
    def estimate_canopy_diameter(self):
        # Simplified estimation: canopy diameter is roughly 1/3 of the height plus 0.1 metres per year of age
        return (self.height / 3) + (0.1 * self.age)

In [26]:
class Tree:
    def __init__(self, species, age, height):
        self.species = species
        self.age = age  # age in years
        self.height = height  # height in metres

    def __str__(self):
        return f"{self.species} tree is {self.age} years old, {self.height} metres tall"
    
    def estimate_canopy_diameter(self):
        # Simplified estimation: canopy diameter is roughly 1/3 of the height plus 0.1 metres per year of age
        return (self.height / 3) + (0.1 * self.age)

tree1 = Tree("Oak", 30, 25)
tree2 = Tree("Sycamore", 75, 15)
tree3 = Tree("Birch", 5, 8)
forest = [tree1, tree2, tree3]

for tree in forest:
    print(tree)
    print(f"{tree.species} canopy diameter = {tree.estimate_canopy_diameter(): .2f} m")
    print()
    
# Find the tallest tree in the forest
tallest = max(forest, key=lambda tree: tree.height)
print(f"The tallest tree is the {tallest.species} tree, which is {tallest.height} m tall.")

# Find the widest tree in the forest
widest = max(forest, key=lambda tree: tree.estimate_canopy_diameter())
print(f"The widest tree is the {widest.species} tree, which is {widest.estimate_canopy_diameter(): .2f} m wide.")

Oak tree is 30 years old, 25 metres tall
Oak canopy diameter =  11.33 m

Sycamore tree is 75 years old, 15 metres tall
Sycamore canopy diameter =  12.50 m

Birch tree is 5 years old, 8 metres tall
Birch canopy diameter =  3.17 m

The tallest tree is the Oak tree, which is 25 m tall.
The widest tree is the Sycamore tree, which is  12.50 m wide.


### properties: combining class methods and attributes

[*Note: This is also quite advanced usage, but is still useful to know about*]

Sometimes want a quantity which "feels" like data (i.e. _attribute_) but is set by a function.

We can do this with the Python _decorators_: `@property` and `@<attribute>.setter`:

In [27]:
class Temperature:
    
    def __init__(self, temp):
        self.celsius = temp
       
    @property
    def fahrenheit(self):
        return self.celsius * 9 / 5 + 32
    
    @fahrenheit.setter
    def fahrenheit(self, temp):
        self.celsius = (temp - 32) * 5 / 9

Note how `fahrenheit` is calculated from the `celsius` attribute, and we define a separate _setter_ method, which updates the `celsius` value.

In [28]:
t1 = Temperature(30.)
    
print(t1.celsius, t1.fahrenheit)

t1.celsius = 20.

print(t1.celsius, t1.fahrenheit)

t1.fahrenheit = 50.

print(t1.celsius, t1.fahrenheit)

30.0 86.0
20.0 68.0
10.0 50.0


### A note on overloading

Python classes have a large number of ["magic methods"](https://docs.python.org/3/reference/datamodel.html#specialnames). Remember you saw some of these in the 4th presessional lecture.

These can be used to implement all the regular Python operators. For an instance (`cls`) of a class (`MyClass`):

- `cls()` goes to `MyClass.__call__(self):`
- `print(cls)` goes to `MyClass.__str__(self):`
- `cls` goes to `MyClass.__repr__(self):`
- `cls1 + cls2` is a little more complicated - to some **live coding!**

### Rules of thumb for overloads

- Don't surprise people. e.g.
- `+` should add, increase or append some quantity or object
- `-` should subtract, remove or delete something

## Summary of the day

- We've reviewed many of the builtin data types and their properties
- We've looked through a few extension types.
- We've revised classes, how to use them and _why_ to use them