# STA 141B Data & Web Technologies for Data Analysis

### Lecture 3, 10/5/23, Memory handling


### Announcements

 - First HW due tomorrow. 

### Today's topics

 - Basics of Python (cont.')
 - Memory Handling in Python
     - Stack and Heap
     - Types
     - Reference Semantics
     - Interning

#### 3. Comprehensions and generators

A comprehension is a Python expression that transforms a sequence, element-by-element.

In [165]:
[x**2 for x in range(5)]

[0, 1, 4, 9, 16]

In [169]:
x = [4, 3, 1]

In [171]:
[y + 12 for y in x]

[16, 15, 13]

Think of this as Pythons `lapply`. You can include a condition in a comprehension:

In [172]:
y = [x**2 for x in range(11) if x % 2 == 0]
y

[0, 4, 16, 36, 64, 100]

You can also iterate over subelements.

In [None]:
x = [[1, 2, 3], [4, 5, 6]] # print 1, 2, 3, 4, 5, 6

In [None]:
# somewhat clumsy
for sublist in x:
    for elt in sublist:
        print(elt)

In [None]:
[y for sublist in x for y in sublist]

Be aware that `sublist in x` is the top loop and subloops are right thereof. In other words, the outermost iterables always come first in the comprehension.

A comprehension surrounded by `[ ]` is called a list comprehension and produces a <kbd>list</kbd>. A comprehension surrounded by `{ }` and including `:` is called a dictionary comprehension and produces a <kbd>dict</kbd>. Else it is called set comprehension. 

In [None]:
x = ["hello", "goodbye"]

lens = {len(name): (name) for name in x} # print the length of names
lens

Remember that <kbd>dict</kbd> does not support equal keys and <kbd>set</kbd> does not support equal items, but <kbd>list</kbd> does. 

In [None]:
{x**2 for x in [-1, 0, 1]} # set # uniqueness of sets is checked with ==, not is

There's no such thing as a tuple comprehension. Instead, a comprehension surrounded by `( )` is called a generator expression.

In [None]:
y = (x**2 for x in range(1001) if x % 2 == 0)
type(y)

In [None]:
import sys
sys.getsizeof(y)

In [None]:
sys.getsizeof([x**2 for x in range(1001) if x % 2 == 0]) # produces a list, i.e., is evaluated

Operating on a generator forces its evaluation. 

In [None]:
sum(y)

This code does not produce any sensible result, because *a generator can only be used once*. Once iterated through, it is exhausted. Since this saves memory it is *much* more efficient than <kbd>list</kbd>.

In [None]:
for i in y:
    print(i, end=" ")

In [None]:
y = (x**2 for x in range(101) if x % 2 == 0)

In [None]:
for i in y:
    print(i, end=" ")

 The economics of memory show when we time operations. 

In [None]:
import timeit

In [None]:
print(timeit.timeit('''list_com = [i for i in range(100) if i % 2 == 0]''', number=1000000))
print(timeit.timeit('''gen_exp = (i for i in range(100) if i % 2 == 0)''', number=1000000))

A generator is a special kind of iterable which computes its elements on demand. Examples are ranges and generator expressions. 
Generators are especially useful for working with data that are __too large__ to fit in memory. While making a huge list (say $10^9$ elements) might use enough memory to crash Python, making a generator with the same number of elements uses almost no memory. See more examples [here](https://zacks.one/python-generators/). 

Python's `itertools` module has functions for manipulating generators and iterable objects

### Stack and Heap

In [None]:
x = True
type(x)

`x` is a variable, which corresponds to an <kbd>bool</kbd> object with value `True`. The variable itself holds merely a reference to a specific object. This reference is stored in local memory (the *stack*). Our compiler takes care in allocating stack memory, we don't have to do that. 

The <kbd>bool</kbd>-object and its value are stored on the random access memory (RAM, the *heap*). We can access the address of the object on the heap (and, conversely, the refrence on the stack): 

In [None]:
hex(id(x)) 

In [None]:
y = float(x)
hex(id(y))

In Python, we can change the type of a variable.

In [None]:
hex(id(x))

In [None]:
x = int(x)
type(x)

In [None]:
hex(id(x))

<img src="../images/memory1.png" alt="" width="1000"/>

As soon as the `x`-variable, which references to the <kbd>bool</kbd> object is out of scope (either by deletion or recasting), the object on the heap is ready to be overwritten by the garbage collector. 



Let's work through the phrases: *Everything in Python is an object*. Some basic default objects (*types*) we have already met are 

- Numeric: <kbd>int</kbd>, <kbd>floats</kbd>, <kbd>complex</kbd>
- Boolean: <kbd>bool</kbd>
- String: <kbd>str</kbd>
- Sequence: <kbd>list</kbd>, <kbd>tuple</kbd>, <kbd>range</kbd>
- Mapping: <kbd>dict</kbd>

The function `sys.getsizeof` ([docs](https://docs.python.org/3/library/sys.html?highlight=getsizeof#sys.getsizeof)) returns the size in bytes of the object the variable points to. 

In [None]:
import sys
sys.getsizeof(x)

In [None]:
sys.getsizeof(y)

A <kbd>float</kbd> is less expensive than an <kbd>integer</kbd>. This is because <kbd>integer</kbd> stores additional information about size together with the actual value. The larger the integer, the more memory required. 

In [None]:
sys.getsizeof(100 ** 10)

In [None]:
sys.getsizeof(100.0 ** 10)

However, <kbd>integer</kbd> can store larger values than <kbd>float</kbd>. 

In [None]:
x = 500 ** 500 
type(x)

In [None]:
x

In [None]:
sys.getsizeof(x)

In [None]:
float(x)

The function `range(start, stop, step)` ([docs](https://docs.python.org/3/library/stdtypes.html#range)) creates a <kbd>range</kbd> type object. It starts at `start` and ends at `stop - 1`, but does not instantiate an object of that length. 

In [None]:
x = range(0, 500**500)
sys.getsizeof(x)

In [None]:
sys.getsizeof(500**500)

A <kbd>tuple</kbd> is an ordered collection of values. Think of coordinates. <kbd>tuple</kbd> is immutable, which means they can't be changed after they're created.

In [None]:
x = 1, 3.0, "horse" # parenthesis are optional, but should be used for clarity 
x

In [None]:
type(x)

In [None]:
sys.getsizeof(x)

A <kbd>tuple</kbd> is inmutable. We have learned that once created, it can't be changed!

In [None]:
try: x[2] = 'horsies' 
except TypeError: 
    print('Tuples are inmutable!')

This is a feature, not shortcoming of <kbd>tuple</kbd>. Since they cannot be changed nor appended, they are more  economical than <kbd>list</kbd>. <kbd>list</kbd> is the mutable counterpart of <kbd>tuple</kbd>. They are instantiated with square brackets. 

In [None]:
y = [1, 3.0, "horse"]
y

In [None]:
type(y)

In [None]:
sys.getsizeof(y)

Lists are mutable, and in particular appendable. Since these actions are allowed, <kbd>list</kbd> objects require  more memory. The return of `sys.getsizeof` does not coincide with the values in the list! Instead, `y` is a variable with a reference to a <kbd>list</kbd> object on the heap, *which itself is a collection of adresses*. This collection of adresses takes $120$ bytes. 

In [None]:
sys.getsizeof(y)

In [None]:
sum([sys.getsizeof(i) for i in y])

In [None]:
sys.getsizeof(1) + sys.getsizeof(3.0) + sys.getsizeof("horse")

In contrast to <kbd>tuples</kbd>, they are however mutable. 

In [None]:
y[2] = "horsies"
y

### Reference Semantics

Lists use *reference semantics*, which means that if you assign a list to two different variables, there's still only one list in memory, and both variables refer to it. As a result, changing the list with one variable changes the list for the other variable.

In [None]:
x = y

In [None]:
hex(id(x))

In [None]:
hex(id(y))

In [None]:
x[0] = "my"
y

A new, non-referenced object can be created by slicing. 

In [None]:
z = y[:]

In [None]:
hex(id(z))

<img src="../images/memory2.png" alt="" width="1000"/>

Alternatively, we can use the copy method ([docs](https://docs.python.org/3/library/copy.html)) to the original list. 

In [None]:
z = y.copy()
hex(id(z))

While the copies `y` and `z` are *equal*, the are not *identical*, because they point to different objects. 

In [None]:
y == z # equal

In [None]:
y is z # identical

In [None]:
y is x # identical

In [None]:
y[1] = 2
print(y)
print(z) 

Attention! This is a *shallow copy*, i.e., objects whithin the list will not be be reinstantiated! Above, the command `y[1] = 2` just instantiates a new <kbd>int</kbd> object of value `2` on the heap and replaces the former reference in `y` with the reference to that new object. 

In [None]:
hex(id(z[1])) == hex(id(y[1]))

This becomes tricky if the list references to another list: 

In [None]:
a = ['a', 'list']

In [None]:
y = [1, 2, 'three', a]

In [None]:
z = y.copy()

In [None]:
print(y)
print(z)

In [None]:
hex(id(y[0]))

In [None]:
hex(id(z[0]))

In [None]:
z[0] = 1

In [None]:
y

In [None]:
y[3][1] = 'ha'

In [None]:
print(y)
print(z)

In [None]:
hex(id(z[3])) == hex(id(y[3])) 

Although both lists are real copies, they reference to the same other list `a`, which has not been copied. 

In [None]:
hex(id(z[3])) == hex(id(y[3]))

This behaviour is irrespecive of the variable `a`. We can remove it from the scope. Since the list object `a` has pointed to still is in scope, it will not be taken by the garbage collector. 

In [None]:
hex(id(a))

In [None]:
del(a)

In [None]:
hex(id(z[3]))

We can copy the upper-level lists as well by calling the `copy.deepcopy`. 

In [None]:
from copy import deepcopy
z = deepcopy(y)

In [None]:
hex(id(z[3])) == hex(id(y[3]))

While the copies `y` and `z` are *equal*, the are not *identical*, because they point to different objects. 

In [None]:
y == z # equal

In [None]:
y is z # identical

### Interning 

The heap memory is memory that can be accessed and reserved by the programmer. Usually, this is tedious and automatically done. To optimize this process, Python uses *interning* to allocate ressources. Since `x` is merely a pointer to the <kbd>int</kbd> type object with value `1`, any other variable can point to the same adress.  

In [None]:
x = 1

In [None]:
y = 1

In [None]:
hex(id(x)) == hex(id(y))

This does not mean that integers use reference semantics! 

In [None]:
y = 2
x

In [None]:
hex(id(x)) == hex(id(y))

Integer internalization is only done from `-5` to `255`. 

In [None]:
x = 21.0
y = 21.0
hex(id(x)) == hex(id(y))

Interning works for several simple types: 

In [None]:
x = "Hi"
y = "Hi"

In [None]:
hex(id(x)) == hex(id(y))

Interning can be forced using `sys.intern`. 

In [None]:
a = "This is quite a long string."
b = "This is quite a long string."

In [None]:
hex(id(a)) == hex(id(b))

In [None]:
b = sys.intern(a)

In [None]:
hex(id(a)) == hex(id(b))

In [None]:
print(a)
print(b)

In [None]:
a = sys.intern("This is quite a long") # alternative 
b = a
hex(id(a)) == hex(id(b))

For reoccuring data, interning allows to use the heap economically. 

### Summary 

- There is stack and heap memory
- All objects are stored on the heap
- Lists are versatile, but generally inefficient
- Optimize heap usage via interning