# Data stuctures

Python (and the theory of Computer Science) stores data in a number ways, and today we will look at some of them:

1. A review of Python's built in types - a look what's available "out of the box".
2. A look at (some) extended data structures in the Python standard library
3. A quick look at pther useful data structures in some common Python packages
4. More info on data structures that **you** create yourself (Python classes).


## Introduction


## Part I : Core Python data structures & types

### Objects

All Python data is considered an "object", and unlike strongly typed languages such as C or Java, functions will accept any object as input. Objects can be "bound" to names using the `=` operator

Objects come in two categories, _mutable_ and _immutable_. Mutable objects allow their underlying data to change, whereas immutable objects are fixed at creation time.

### Integers

It may surprise some of you, but the built-in Python integers (and floats & strings) are actually immutable. When an action which appears to update an integer occurs, Python actually creates a new object with the updated value.  We can check this using the `id()` function, which returns a unique value (actually an integer) for every Python object.

In [1]:
a = 1

In [6]:
def axd():
    print(a)
    return 

In [7]:
axd()

1


In [None]:
x = 3
print(f'1: id of object named x is {id(x)}')
x += 5
print(f'2: id of object named x is {id(x)}')

For efficiency, Python actually stores a single set of Python objects representing the small integers (from - 5 to 256), rather than making new ones each time. Remember that we can use the `is` operator to check if two variable names point to the same instance of an object.

In [None]:
x = 3; y = 3
print(f'ids: {id(x)}, {id(y)} same object: {x is y}')
a = 12345; b = 12345
print(f'ids: {id(a)}, {id(b)} same object: {a is b}')    

From a numerical point of view Python implements "arbitrary length precision" integers, which is to say the first limit on the largest magnitude integer number your Python code can work with is the amount of memory available in your computer. "Under the hood", integers are effectively stored as an expandable collection of bytes and a count of how many bytes are being used.

In [None]:
import sys

a = 1
print(f'a uses {sys.getsizeof(a)} bytes')
b = 10**5
print(f'b uses {sys.getsizeof(b)} bytes')
c = 10**10
print(f'c uses {sys.getsizeof(c)} bytes')
d = 10**1000
print(f'c uses {sys.getsizeof(d)} bytes')

### Lists

Python lists are formed using `[]` brackets, or using the `list()` function on another collection or iterator.


```python
list_a = [1, 3, 5]
list_b = list((3, 4, 5))
list_c = [ _ for _ in range(6, 10) ]
```

Python lists are a multable, one-dimensional, ordered collection of other Python objects. Lists are (relatively) fast at finding & changing their elements, as well as adding or deleting the _last_ element in the list. They are much slower at adding or removing elements other than the last one (scaling with the length of the list).

The last creation method in the block above, `[ _ for _ in range(6, 10) ]` is called a list comprehension. This is usually the fastest way to build a list in Python and should be preferred if things will stay readable.

In [None]:
def list_method_one():
    out = []
    for i in range(10000):
        out.append(i)
    return out
    
def list_method_two():
    return [i for i in range(10000)]

%timeit a = list_method_one()
%timeit b = list_method_two()

#### Implementation

In CPython, lists are implemented as block of memory holding a sequence of "memory addresses" (i.e. where the computer needs to look to find each element) of the contents of the list. This explains why looking up the `n`th element of the list is pretty quick (jump to the nth block, look up the address in it, and that's your object). Similarly, implementations normally grab more memory than they think they'll need, so adding a new element to the end of the list is normally cheap unless it's "full". In the latter case, you'll have to wait while Python sets up a new, bigger block of memory, and copies all the old entries across to it. Similarly, deleting elements, or inserting them towards the beginning of the list can be much more expensive. For example to insert a new second element means copying nearly every other address in the block one place down, then adding the new address.

### Tuples

Python tuples are formed using comma separated values (often placed in `()` brackets) or using the `tuple()` function on another collection or iterator


```python
tuple_a = 1, 3, 5
tuple_b = tuple([3, 4, 5])
```

Python tuples are immutable, one-dimensional, ordered collections of other Python objects (so the immutable equivalent of lists). Tuples are fast at finding their elements and tend to use slightly less memory than a similar list. If you know that something should never change, then it's a better idea to store it in a tuple than in a list, since it means Python will catch your mistake if you try to change it.


#### Implementation

In CPython, tuples are implemented a lot like lists, except that the block can be chosen to be exactly the right size to hold all the addresses it needs to contain, rather than needing to leave room to expand.


### Sets

Python sets are formed using  `{}` brackets on comma delimited collections of objects, or using the `set()` function on another collection or iterator.

Python sets are an unordered, mutable collection of other Python objects. They are tuned to be quick to use to test whether an  object `x` is the set `S` using the `x in S` syntax. We can compare this to the same test applied to a list:

In [None]:
import random

x = 12345
S = { random.randint(0, 1_000_000) for _ in range(1000)}
L = list(S)
%timeit x in S
%timeit x in L

####  implementation

Python sets are implemented via what is known as a "hash table". This consists of a large block of memory, of length N (which is significantly larger than the number of members of the set) and a "hash" function which maps inputs to the numbers 0 to N-1. To add a member, it's hash is calculated and a note is made at that position in the memory. To check if a value is in the set, we hash it, and then check if there's a record in the right space. This is all relatively fast.

As you may have spotted, it's always possible that a hash function will give the same location for two different inputs, so we actually need a process to deal with these "collisions". Python's method is essentially to carry on looking in other blocks in a predictable pattern based on the hash until it either finds what it's looking for, or an empty entry. Needing this to be efficient is the primary reason why the memory block needs to be larger than the number of members of the set.


## frozensets

Python frozensets are formed by using the `frozenset()` function on another collection or iterator.

Python sets are the immutable equivalent of a set, with an almost identical implementation, and are likely among the least used Python built-in data structures.

## dictionaries (dicts)

Python dictionaries are formed using `{}` brackets on collections of comma separated, colon delimited pairs (named the "key" and the "value" respectively) of objects. While the `dict()` function exists, it can only be used to create a new dictionary.

Like sets, dicts are optimized to find if a key, `k` is among those in the dictionary, and then setting or returning the corresponding value, or deleting the entry.

In [None]:
def num(max=1_000_000):
    return random.randint(0, max)

# make a dictionary
# 
D = { num():num() for _ in range(1000)}
seq_D = {_:val for val in list(D.values())}

# get the equivalent list
L = list(D.values())

x = 12345
%timeit x in D
%timeit x in L

In [None]:
# Reversing a dictionary via a "for" loop and a dict comprehension

def method_one(input):
    out = {}
    for key, val in input.items():
        out[val] = key
    return out

def method_two(input):
    return {val:key for key, val in input.items()}


test = {_:20000-_ for _ in range(10000)}
%timeit t1 = method_one(test)
%timeit t2 = method_two(test)

%timeit t3  = {val:key for key, val in test.items()}
    

#### Implementation

Like sets, CPython implements dicts using hash tables, this time storing both a note of the key and the address of the object containing the value assigned to it. There are some interesting differences, principally that the dict implementation tracks and uses the insertion order of the entries, whereas the set method does not.


## Extended types

### Other collections


### arrays

Python has a (rather limited) "array" type in the standard `array` module, which can be used to store collections of the same sort of object (e.g. all positive integers, or all floating point numbers). Overall, it's not very easy to use, and most people ignore it. Overall, I'd certainly agree with then.

#### Implementation

Arrays differ from lists in that rather than the block of memory storing the addresses of other python objects (which then store the values)

In [None]:
import array
import sys

a = [1,2,3,4,5]
b = array.array('b', a)

print(sys.getsizeof(a))
print(sys.getsizeof(b))

### numpy.ndarrays

The `numpy` package (remember, not in the standard Python libraries, but frequently available on many systems) provides easy-to-use, multidimensional arrays (i.e 1D, 2D, 3D etc. up to 32 dimensional). Like the 1d arrays above, `numpy` `ndarrays` _must_ have all elements stored as the same type of data. If no datatype (also known as `dtype`) is specified at create, then `numpy` makes a choice to use a type which supports all the data in the input

In [None]:
import numpy as np

a1 = np.array([0, 1, 2])
print(a1, a1.dtype)
a2 = np.array([0, 1, 2.5])
print(a2, a2.dtype)
a3 = np.array([0, 'hat', sum])
print(a3, a3.dtype)

In [None]:
a1 = np.array([0, 1, 2])
print(a1, a1.dtype)
a1[1] = 2.5
print(a1, a1.dtype)
a4 = a1 + 0.5
print(a4, a4.dtype)

### DataFrames

## Part III: Python Classes

Python allows programmers to build your own data structures called "classes". Following a world-view called "object oriented programming" these allow you to create your own holders for data, as well as the functions (sometimes called class methods) which you need to modify it.

As you saw in the presessional Python material, a class (say, `MyClass`) is defined using a syntax like this:

In [10]:
class MyClass:
    """This is my cool class."""
    
    mylist = []

    def my_method(self, a, b):
        """ Add two values, store the result, then return the multiple"""
        self.val = a + b
        return a*b

With the above definition, we can create a new object in our class (an "instance") with the syntax `cls = MyClass()`. We can then use the syntax `cls.my_method(1, 3)` to get the same result as calling `MyClass.my_method(cls, 1, 3)`. Note that attributes set at the class level are the _same_ object for all instances, while attributes set at the instance level can differ from each other:

In [11]:
# Get two instances
cls1 = MyClass()
cls2 = MyClass()

In [12]:
# Attributes created on the class are the same variable
cls1.mylist.append(1)
cls2.mylist.append(2)
print(f'cls1.mylist is cls2.mylist : {cls1.mylist is cls2.mylist}')

print(f'cls1.mylist: {cls1.mylist}')
print(f'cls2.mylist: {cls2.mylist}')

cls1.mylist is cls2.mylist : True
cls1.mylist: [1, 2]
cls2.mylist: [1, 2]


In [None]:
# Attributes created on instances can differ
cls1.my_method(1, 2)
cls2.my_method(4, 5)

print(f'cls1.val is cls2.val: {cls1.val is cls2.val}')

print(f'cls1.val: {cls1.val}')
print(f'cls2.val: {cls2.val}')

Since in most cases we do want our class objects to store unique data, the special `__init__()` method can be used to _initialize_ instance variables at the time when the new object is created

In [13]:
class Point3D:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
        
pnt1 = Point3D(1, 1, 1)

#### A note on __init__ and __new__

As so often in programming, things are actually a little more complicated than they initially appear. When a new instance is created, Python actually calls the `__new__` class method, and then attempts to call the `__init__` method of the object which that returns. Generally, this is what we want to happen, so that we can ignore this, and just write the `__init__` method to initiaiise the class instance the way we want it. However, very rarely, you may want to write a `__new__` function to do something special at the point the object is created, and before you actually handle the initialization problem.

As a final point, the opposite of the `__new__` method is the `__del__` method. This gets called at the point the Python garbage collector believes the object is no longer needed, so the `__del__` method can be used to clean up anything specual

In [9]:
#An example of a class with a __new__ method

import weakref

point_cache = []

class Point3D:
    
    def __new__(cls, *args):
        print('in __new__')
        point = super().__new__(cls) # call the original object creation method
        point_cache.append(weakref.proxy(point)) #now point is a magic method
        return point  #now the instance is the point?
    
    def __init__(self, x, y, z):
        print('in __init__')
        self.x = x
        self.y = y
        self.z = z
        
    def __del__(self):
        point_cache.remove(self)
        
    def __repr__(self):
        return f'Point3D({self.x}, {self.y}, {self.z})'
        
point1 = Point3D(1, 1, 1)
print(point_cache)
del point1
print(point_cache)

in __new__
in __init__
[<weakproxy at 0x11bbaf4a0 to Point3D at 0x1196e2be0>]
[]


### Why use classes at all?

Classes (and class objects) are not a required part of a programming language, and many languages using other programming paradigms (for example, purely procedural languages such as C) have existed for a long time. However, an object-oriented approach works well at avoiding coding mistakes when combining structured collections of data and functions modifying or processing that data. It does this by binding the data and the method together, making it much harder to apply them to the wrong thing

### properties: combining class methods and attributes

A common situation is to have have a quantity which "feels" like data (i.e. an attribute)

###