# 这节课的笔记主要在 onenote

# Data stuctures

James Percival

## Today's learning outcomes

At the end of the day you should:
- be able to describe the builtin Python data structures properties
- select the best choice for a given usage
- Understand more about how Python classes work and why to use them

## What is a data structure?

On Monday we reviewed raw storage of data in binary representations

Python ( & Computer Science) stores data in a number of more
complex ways:

- for efficiency
- for convenience

Today we will look at some of them:

1. Python's built in types - "what's in the box?"
2. Extended data structures in the Python standard library
3. Review of other useful data structures in common Python packages
4. More info on data structures that **you** create yourself (Python classes).

## Part I : Core Python data structures & types

### Objects

All Python data is an "object".

Unlike C/Java functions will accept any object as input ("duck typing"). 

Objects can be "bound" to names using the `=` operator

<img src="./images/name1.png">

In [1]:
x = 7

<img src="./images/name2.png">

In [3]:
y = x
y = 8 # create a new object 8 and bound y
x

7

<img src="./images/name4.png">

In [5]:
a = list([1,2,3,4])
b = a

b[2]  = 'abc'

a

[1, 2, 'abc', 4]

Objects come in two "flavours":
-  _mutable_ and 
- _immutable_.

Mutable objects allow their underlying data to change, immutable objects are fixed when they're created.

### Integers

It may be a surprise , but the built-in Python integers (& floats & strings) are immutable. 

To "update" an integer, Python creates a new object with the updated value.  We can check this using the `id()` function.

This returns a unique value (actually an integer) for every Python object in memory.

In [6]:
x = 3
print(f'1: id of object named x is {id(x)}')
x += 5
print(f'2: id of object named x is {id(x)}')

1: id of object named x is 4313557360
2: id of object named x is 4313557520


For efficiency, Python actually stores a single set of Python objects representing the small integers (from - 5 to 256).

Remember we can use the `is` operator to check if two variable names point to the same object.

In [7]:
x = 3; y = 3
print(f'ids: {id(x)}, {id(y)} same object: {x is y}')

a = 12345; b = 12345
print(f'ids: {id(a)}, {id(b)} same object: {a is b}')    

ids: 4313557360, 4313557360 same object: True
ids: 4375210352, 4375211120 same object: False


As we heard on Monday, Python has "arbitrary length precision" integers.

Fundamental limit on the largest integer number your Python code can work with is the memory available in your computer.

In [8]:
import sys

a = 10**0
print(f'a uses {sys.getsizeof(a)} bytes')
b = 10**5
print(f'b uses {sys.getsizeof(b)} bytes')
c = 10**10
print(f'c uses {sys.getsizeof(c)} bytes')
d = 10**1000
print(f'c uses {sys.getsizeof(d)} bytes')

a uses 28 bytes
b uses 28 bytes
c uses 32 bytes
c uses 468 bytes


### Lists

Python lists are formed using `[]` brackets, or using the `list()` function on another collection or iterator.


```python
list_a = [1, 3, 5]
list_b = list((3, 4, 5))
list_c = [ _ for _ in range(6, 10) ]
```

Python lists are:
- mutable
- one-dimensional (i.e. single index)
- ordered

_collections_ of other Python objects.

They also support _iteration_ over the index.

Lists are (relatively) fast at:
- finding/changing their elements,
- adding/deleting the _last_ element in the list.

They are much slower at adding or removing elements other than the last one. This scales with the length of the list

The pattern `[ _ for _ in range(6, 10) ]` is called a list comprehension. _Usually_ the fastest way to build a list in Python.

Good idea to use as long as it stays readable.

In [9]:
def list_method_one():
    out = []
    for i in range(10000):
        out.append(i)
    return out

%timeit a = list_method_one()

298 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [11]:
def list_method_two():
    return [i for i in range(10000)]

%timeit b = list_method_two()

146 µs ± 682 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


#### Implementation

In CPython, (i.e the Python on most computers) lists are a block of memory holding sequence of "memory addresses".

|list index| address|
|-|-|
|0|#2343|
|1|#323|
|2|#12|
|3|#323|
|4| _not used yet_|
|5| _not used yet_|
|6| _not used yet_|
|7| _not used yet_|

Looking up the `n`th element is quick (jump straight to the nth block, look up the contents of the address).

Implementations grab spare memory to allow room to expand. Adding a new element (`x.append()`)is normally cheap unless it's "full".

Adding to a full list requires copying the whole list. Deleting & inserting elements is also expensive

|index| _original_ | _append_ | _insert_ | _delete_ |
|-|-|-|-|-|
|0|#2343|#2343|#2343|#2343|
|1|#323|#323|#323|#323|
|2|#12|#12|#496|#323|
|3|#323|#323|#12|#6|#5|
|4|#5|#5|#323|#67|
|5|#67|#67|#5|_not used yet_|
|6| _not used yet_|#226|#67|_not used yet_|

### Tuples

Python tuples are formed using commas (usually in `()` brackets) or using the `tuple()` function on another collection or iterator.


```python
tuple_a = 1, 3, 5
tuple_b = tuple([3, 4, 5])
```

Python tuples are:
- immutable
- one-dimensional
- ordered
collections of other Python objects.

immutable equivalent of lists.

Tuples are fast at:
-  finding their elements

They tend to use slightly less memory than a list.

If you know that something should never change, then use a tuple.

#### Implementation

In CPython, tuples implemented a lot like lists.

Block can be chosen to be exactly the right size to hold all the addresses. No need for room to expand.

### Sets

Python sets are formed using  `{}` brackets or using the `set()` function on another collection or iterator.

Python sets are:
- unordered
- mutable
collections of objects.

Quick to test content  using the `x in S` syntax.

In [12]:
import random

x = 12345
S = { random.randint(0, 1_000_000) for _ in range(1000)}
L = list(S)
%timeit x in S
%timeit x in L

30.5 ns ± 0.2 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
6.36 µs ± 46.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


####  implementation

CPython sets are implemented via a "hash table".

This consists of:
- a large (mostly empty block of memory, of length N
- a "hash" function which maps inputs to the numbers 0 to N-1.

|index| address|
|-|-|
|0| value1)|
|1|_not used yet_|
|2| value3|
|3|  _not used yet_|
|4| _not used yet_|
|5| _not used yet_|
|6| value2|
|7| _not used yet_|

To add a member:
- calculate hash of value
- insert in that space
To check if a value is in the set:
-  calculate hash of value
- check that space

This is all relatively fast.

May have spotted, it's possible a hash function will give the same location for two different inputs.

Need a process to deal with these "collisions".

Python's method is essentially to carry on looking in other blocks via a predictable pattern based on the hash until:
-  it finds what it's looking for, or
- an empty entry.

This is why sets need to be quite empty, wasting memory.

## frozensets

Python frozensets are formed by using the `frozenset()` function on another collection or iterator.

Immutable equivalent of a set.

## dictionaries (dicts)

Python dictionaries formed using `{}` brackets on comma separated, colon delimited pairs of objects. While the `dict()` function exists, it can only be used to create a new empty dictionary.

Like sets, dicts are optimized to test if a key is in it.

In [13]:
def num(max=1_000_000):
    return random.randint(0, max)

# make a dictionary
# 
D = { num():num() for _ in range(1000)}
seq_D = {_:val for val in list(D.values())}

# get the equivalent list
L = list(D.values())

x = 12345
%timeit x in D
%timeit x in L

TypeError: unhashable type: 'list'

Dictionary comprehensions are useful to build a list quickly, but not always a speed benefit.

In [14]:
# Reversing a dictionary via a "for" loop versus a dict comprehension

def method_one(input):
    out = {}
    for key, val in input.items():
        out[val] = key
    return out

def method_two(input):
    return {val:key for key, val in input.items()}

In [15]:
test = {_:20000-_ for _ in range(10000)}
%timeit t1 = method_one(test)
%timeit t2 = method_two(test)

%timeit t3  = {val:key for key, val in test.items()}
    

297 µs ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
274 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
272 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


#### Implementation

CPython implements dicts using hash tables much like sets. Now store  both a note of the key & (address of) the object storing value.

There are some interesting differences, principally that the `dict` stores the insertion order , whereas the `set`s don't

## Extended types in standard library

### arrays

Python has a (rather limited) "array" type in the standard `array` module, which can be used to store collections of the same sort of object (e.g. all positive integers, or all floating point numbers). Overall, it's not very easy to use, and most people ignore it. Overall, I'd certainly agree with then.

#### Implementation

Arrays differ from lists in that rather than the block of memory storing the addresses of other python objects (which then store the values)

In [16]:
import array
import sys

a = [1,2,3,4,5]
b = array.array('b', a)

print(sys.getsizeof(a))
print(sys.getsizeof(b))

120
69


### numpy.ndarrays

The `numpy` package (remember, not in the standard Python libraries, but frequently available on many systems) provides easy-to-use, multidimensional arrays (i.e 1D, 2D, 3D etc. up to 32 dimensional). Like the 1d arrays above, `numpy` `ndarrays` _must_ have all elements stored as the same type of data. If no datatype (also known as `dtype`) is specified at create, then `numpy` makes a choice to use a type which supports all the data in the input

In [17]:
import numpy as np

a1 = np.array([0, 1, 2])
print(a1, a1.dtype)
a2 = np.array([0, 1, 2.5])
print(a2, a2.dtype)
a3 = np.array([0, 'hat', sum])
print(a3, a3.dtype)

[0 1 2] int64
[0.  1.  2.5] float64
[0 'hat' <built-in function sum>] object


In [19]:
a1 = np.array([0, 1, 2])
print(a1, a1.dtype)
a1[1] = 2.5
print(a1, a1.dtype)
a4 = a1 + 0.5
print(a4, a4.dtype)

[0 1 2] int64
[0 2 2] int64
[0.5 2.5 2.5] float64


### DataSeries/ DataFrames

Look back at Tuesday's lecture for more info here.

<img src="./images/choose_dtype.png">

## Python Classes

Python allows programmers to "build your own" data structures called _classes_. Follows a world-view called "object oriented programming".

Allow you to create your own holders for data,  plus functions (class methods) which you need to modify it.

As you saw in the presessional Python material, a class (say, `MyClass`) is defined using a syntax like this:

In [20]:
class MyClass:
    """This is my cool class."""
    
    mylist = []

    def my_method(self, a, b):
        """ Add two values, store the result, then return the multiple"""
        self.val = a + b
        return a*b

With the above definition, we can create a new object in our class (_instance_)
```python
cls = MyClass()
```
. We can then use `cls.my_method(1, 3)` to do the same as `MyClass.my_method(cls, 1, 3)`.

Note that attributes at class level are the _same_ object for all instances.

Attributes set at the instance level can differ from each other:

In [21]:
# Get two instances
cls1 = MyClass()
cls2 = MyClass()

In [22]:
# Attributes created on the class are the same variable
cls1.mylist.append(1)
cls2.mylist.append(2)
print(f'cls1.mylist is cls2.mylist : {cls1.mylist is cls2.mylist}')

print(f'cls1.mylist: {cls1.mylist}')
print(f'cls2.mylist: {cls2.mylist}')

cls1.mylist is cls2.mylist : True
cls1.mylist: [1, 2]
cls2.mylist: [1, 2]


In [23]:
# Attributes created on instances can differ
cls1.my_method(1, 2)
cls2.my_method(4, 5)

print(f'cls1.val is cls2.val: {cls1.val is cls2.val}')

print(f'cls1.val: {cls1.val}')
print(f'cls2.val: {cls2.val}')

cls1.val is cls2.val: False
cls1.val: 3
cls2.val: 9


Mostly want to stor unique data, can use special `__init__()` method to _initialize_ instance variables when the new object is created

In [24]:
class Point3D:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
        
pnt1 = Point3D(1, 1, 1)

#### A note on `__init__` and `__new__`

As always things are actually a little more complicated.
When a new instance is created, Python calls the `__new__` class method, and then attempts to call the `__init__` that returns.

Generally, what we want to happen, so that we can ignore this, and just write the `__init__` method.

However, very rarely, you may want to write a `__new__` function to do something special.

In [54]:
#An example of a class with a __new__ method

import weakref

point_cache = []

class Point3D:
    
    def __new__(cls, *args):
        print('in __new__')
        point = super().__new__(cls) # call the original object creation method
        point_cache.append(weakref.proxy(point))
        return point
    
    def __init__(self, x, y, z):
        print('in __init__')
        self.x = x
        self.y = y
        self.z = z
        
    def __del__(self):
        print(1)
        point_cache.remove(self)
        
    def __repr__(self):
        print(2)
        return f'Point3D({self.x}, {self.y}, {self.z})'

In [55]:
point1 = Point3D(1, 1, 1)
print(point_cache)
print('------')
# Point3D.point
del point1
print(point_cache)

in __new__
in __init__
[<weakproxy at 0x11496b590 to Point3D at 0x1149f24f0>]
------
1
[]


As a note, the opposite of the `__new__` method is the `__del__` method. This gets called at the point the Python garbage collector believes the object is no longer needed, so the `__del__` method can be used to clean up anything specual

### Why use classes at all?

Classes (and class objects) are not a required part of a programming language, and many languages using other programming paradigms (for example, purely procedural languages such as C) have existed for a long time. However, an object-oriented approach works well at avoiding coding mistakes when combining structured collections of data and functions modifying or processing that data. It does this by binding the data and the method together, making it much harder to apply them to the wrong thing

### properties: combining class methods and attributes

Sometimes want a quantity which "feels" like data (i.e. _attribute_) but is set by a function

In [57]:
class Temperature:
    
    def __init__(self, temp):
        self.celsius = temp
       
    @property
    def farenheit(self):
        return self.celsius*9/5 + 32
    
    @farenheit.setter
    def set_farenheit(self, temp):
        self.celsius = (temp-32)*5/9

In [58]:
t1 = Temperature(30)
    
print(t1.celsius, t1.farenheit)

t1.celsius = 20

print(t1.celsius, t1.farenheit)

t1.farenheit = 100.
print(t1.celsius, t1.farenheit)


Exception ignored in: <function Point3D.__del__ at 0x114908b80>
Traceback (most recent call last):
  File "/var/folders/wb/x375xcqd3p31cgn8w1m_xggm0000gp/T/ipykernel_46598/3383083070.py", line 24, in __del__
ValueError: list.remove(x): x not in list
Exception ignored in: <function Point3D.__del__ at 0x1147badc0>
Traceback (most recent call last):
  File "/var/folders/wb/x375xcqd3p31cgn8w1m_xggm0000gp/T/ipykernel_46598/590513765.py", line 24, in __del__
ValueError: list.remove(x): x not in list
Exception ignored in: <function Point3D.__del__ at 0x114908b80>
Traceback (most recent call last):
  File "/var/folders/wb/x375xcqd3p31cgn8w1m_xggm0000gp/T/ipykernel_46598/3383083070.py", line 24, in __del__
ValueError: list.remove(x): x not in list


30 86.0
20 68.0
1
1
1


AttributeError: can't set attribute

### A note on overloading\
(overloading means funcs with same name have different functionalities, based on the passed args.)\
Python classes have a large number of ["magic methodds"](https://docs.python.org/3/reference/datamodel.html#specialnames). Remeber you saw these in the 4th presessional lecture.

These can be used to implement all the regular Python operators:
(mycls is an instance)

- `mycls()` goes to `Mycls.__call__(self):`
- `print(mycls)` goes to `Mycls.__str__(self):`
- `mycls` goes to `Mycls.__repr__(self):`
- `mycls1 + mycls2` is a little more complicated - to some live coding!

#### rules of thumb for overloads

- Don't surprise people 
- e.g. `+` should add, increase or append
- `-` should subtract remove or delete

## Summary of the day

- We've reviewed many of the builtin data types and their properties
- We've looked through a few extension types.
- We've revised classes, how to use them and _why_ to use them

In [73]:
class MyCoolClass(object):  #here the (object) can be dismissed, it has
    # the function of inherit all the method on object 'object'
    
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __repr__(self):
        return f'MyCoolClass({self.x},{self.y})'
    
    def __str__(self):
        return f'x:{self.x}, y:{self.y}'
    
    def __add__(self, val):  # you can define your own type of add!!!!!
        return MyCoolClass(self.x + val, self.y+val)
    
    def __radd__(self, val):
        return self + val
    
    def __getitem__(self,n):  #which make the class like a function ([] replac ())
        return self.x**n
    
my_cl = MyCoolClass(2,2)
my_cl

MyCoolClass(2,2)

In [74]:
print(my_cl+5)

x:7, y:7


In [75]:
print(5+my_cl) #if not define the __radd__ ,things would go wrong

x:7, y:7


In [76]:
print(my_cl[10])

1024
