# STA 220 Data & Web Technologies for Data Analysis

### Lecture 1, 1/9/24, Memory handling


### Today's topics
 - Course outline
 - Memory Handling in Python
     - Stack and Heap
     - Types
     - Reference Semantics
     - Interning

### Course Outline

In this course we will learn how to acquire and process publicly available information from the internet. To this end, we will inspect the structure of websites, monitor their communication with servers, and retrieve the data we are interested in. The acquired data is then processed for further analysis, i.e., brought into tidy form, and visualized using static and interactive methods. Natural language data is processed to obtain insights into its topics or sentiments.


The final grade is determined by 
- homework assignments (40%),
- project due March 20th (60%).

For comprehensive and updated information about the course, please consult [Canvas](https://canvas.ucdavis.edu/courses/866299).  

The project will be collaborative work with two to three group members. You will use the methods learn in this class to procure a data set, preferrably from multiple sources, and process it to make it accessible for further investigation. This involves displaying its properties by visual means, so that statistical hypotheses can be formed. 

All material of this course will be made available online [GitHub](https://github.com/kramlinger/STA220_WQ24). Use [Piazza](https://piazza.com/class/lr575o6i3nk330/), for any inquiries regarding organization, homework or lectures. We will monitor this site M-F during business hours. Please do not write emails! Screen recordings and all other class related administrative information will be made available on Canvas. 

Office hours: 
* Peter Kramlinger: T 11-12 AM, MSB 1143
* Sophia Sun, t.b.a.

#### Ethics

This is a programming class. Using assistance is part of programming and is encouraged. This can be AI based, or from online sources (e.g., [stackoverflow](https://stackoverflow.com/questions)). 

However, you will be graded by your proficiency in coding. In all assignments, make sure that you display your own contribution. Submitting AI generated code, answers from online sources, or even classmates' solutions will not be enough to pass the course. Furthermore, if you pass off someone else's work as your own, then you are engaging in academic misconduct. 


### Stack and Heap

In [1]:
x = True
type(x)

bool

`x` is a variable, which corresponds to an <kbd>bool</kbd> object with value `True`. The variable itself holds merely a reference to a specific object. This reference is stored in local memory (the *stack*). Our compiler takes care in allocating stack memory, we don't have to do that. 

The <kbd>bool</kbd>-object and its value are stored on the random access memory (RAM, the *heap*). We can access the address of the object on the heap (and, conversely, the refrence on the stack): 

In [2]:
hex(id(x))

'0x1011e7530'

In [3]:
y = float(x)
hex(id(y))

'0x7facf8bff2d0'

In Python, we can change the type of a variable.

In [4]:
hex(id(x))

'0x1011e7530'

In [5]:
x = int(x)
type(x)

int

In [6]:
hex(id(x))

'0x7fad2802e930'

<img src="../images/memory1.png" alt="" width="1000"/>

As soon as the `x`-variable, which previously referenced to the <kbd>bool</kbd> object is out of scope (either by deletion or recasting), the object on the heap is ready to be overwritten by the garbage collector. 



Let's work through the phrases: *Everything in Python is an object*. Some basic default objects (*types*) are 

- Numeric: <kbd>int</kbd>, <kbd>floats</kbd>, <kbd>complex</kbd>
- Boolean: <kbd>bool</kbd>
- String: <kbd>str</kbd>
- Sequence: <kbd>list</kbd>, <kbd>tuple</kbd>, <kbd>range</kbd>
- Mapping: <kbd>dict</kbd>

The function `sys.getsizeof` ([docs](https://docs.python.org/3/library/sys.html?highlight=getsizeof#sys.getsizeof)) returns the size in bytes of the object the variable points to. 

In [7]:
import sys
sys.getsizeof(x)

28

In [8]:
sys.getsizeof(y)

24

A <kbd>float</kbd> is less expensive than an <kbd>integer</kbd>. This is because <kbd>integer</kbd> stores additional information about size together with the actual value. The larger the integer, the more memory required. 

In [9]:
sys.getsizeof(100 ** 10)

36

In [10]:
sys.getsizeof(100.0 ** 10)

24

However, <kbd>integer</kbd> can store larger values than <kbd>float</kbd>. 

In [11]:
x = 500 ** 500 
type(x)

int

In [12]:
x

3054936363499604682051979393213617699789402740572326663893613909281291626524720457701857235108015228256875152693590467155317853427804283969735133114200917889630724420533772852222035588819531883700816508667930179487913663389937052516364978922702120035245082091219087448202119601494637211093403079855076782836518362040933993739599827677011489868164062500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

In [13]:
sys.getsizeof(x)

624

In [14]:
float(x)

OverflowError: int too large to convert to float

The function `range(start, stop, step)` ([docs](https://docs.python.org/3/library/stdtypes.html#range)) creates a <kbd>range</kbd> type object. It starts at `start` and ends at `stop - 1`, but does not instantiate an object of that length. 

In [15]:
x = range(0, 500**500)
sys.getsizeof(x)

48

In [16]:
sys.getsizeof(500**500)

624

A <kbd>tuple</kbd> is an ordered collection of values. Think of coordinates. <kbd>tuple</kbd> is immutable, which means they can't be changed after they're created.

In [18]:
x = (1, 3.0, "horse") # parenthesis are optional, but should be used for clarity 
x

(1, 3.0, 'horse')

In [19]:
type(x)

tuple

In [20]:
sys.getsizeof(x)

64

A <kbd>tuple</kbd> is inmutable. We have learned that once created, it can't be changed!

In [21]:
x[2] = 'horsies' 

TypeError: 'tuple' object does not support item assignment

In [23]:
try: x[2] = 'horsies' 
except TypeError: 
    print('Tuples are inmutable!')

Tuples are inmutable!


This is a feature, not shortcoming of <kbd>tuple</kbd>. Since they cannot be changed nor appended, they are more  economical than <kbd>list</kbd>. <kbd>list</kbd> is the mutable counterpart of <kbd>tuple</kbd>. They are instantiated with square brackets. 

In [24]:
y = [1, 3.0, "horse"]
y

[1, 3.0, 'horse']

In [25]:
type(y)

list

In [26]:
sys.getsizeof(y)

120

Lists are mutable, and in particular appendable. Since these actions are allowed, <kbd>list</kbd> objects require  more memory. The return of `sys.getsizeof` does not coincide with the values in the list! Instead, `y` is a variable with a reference to a <kbd>list</kbd> object on the heap, *which itself is a collection of adresses*. This collection of adresses takes $120$ bytes. 

In [27]:
sys.getsizeof(y)

120

In [28]:
sum([sys.getsizeof(i) for i in y])

106

In [29]:
sys.getsizeof(1) + sys.getsizeof(3.0) + sys.getsizeof("horse")

106

In contrast to <kbd>tuples</kbd>, they are however mutable. 

In [30]:
y[2] = "horsies"
y

[1, 3.0, 'horsies']

### Reference Semantics

Lists use *reference semantics*, which means that if you assign a list to two different variables, there's still only one list in memory, and both variables refer to it. As a result, changing the list with one variable changes the list for the other variable.

In [37]:
x = y

In [32]:
hex(id(x))

'0x7fad08eba4c0'

In [34]:
hex(id(y))

'0x7fad08eba4c0'

In [42]:
x is y 

True

In [43]:
x[0] = "my"
y

['my', 3.0, 'horsies']

A new, non-referenced object can be created by slicing. 

In [44]:
z = y[:]

In [49]:
hex(id(z))

'0x7facf8c099c0'

In [48]:
hex(id(y))

'0x7fad08eba4c0'

In [46]:
z is y

False

In [50]:
z

['my', 3.0, 'horsies']

In [51]:
z[1] = 3

In [52]:
y

['my', 3.0, 'horsies']

In [53]:
z

['my', 3, 'horsies']

<img src="../images/memory2.png" alt="" width="1000"/>

Alternatively, we can use the copy method ([docs](https://docs.python.org/3/library/copy.html)) to the original list. 

In [54]:
z = y.copy()
hex(id(z))

'0x7face80f8b80'

In [56]:
hex(id(y))

'0x7fad08eba4c0'

In [58]:
y == z

True

In [59]:
y is z

False

While the copies `y` and `z` are *equal*, the are not *identical*, because they point to different objects. 

In [None]:
y == z # equal

In [None]:
y is z # identical

In [None]:
y is x # identical

In [60]:
y[1] = 2
print(y)
print(z) 

['my', 2, 'horsies']
['my', 3.0, 'horsies']


Attention! This is a *shallow copy*, i.e., objects whithin the list will not be be reinstantiated! Above, the command `y[1] = 2` just instantiates a new <kbd>int</kbd> object of value `2` on the heap and replaces the former reference in `y` with the reference to that new object. 

In [62]:
hex(id(z[1])) == hex(id(y[1]))

False

In [65]:
z[0] is y[0]

True

This becomes tricky if the list references to another list: 

In [66]:
a = ['a', 'list']

In [67]:
y = [1, 2, 'three', a]

In [68]:
z = y.copy()

In [69]:
hex(id(y))

'0x7face80fa0c0'

In [70]:
hex(id(z))

'0x7fad286dbc00'

In [71]:
hex(id(y[3]))

'0x7face80ff400'

In [76]:
hex(id(z[3]))

'0x7face80ff400'

In [75]:
del a

In [77]:
z[0] = 3

In [78]:
y

[1, 2, 'three', ['a', 'list']]

In [79]:
z

[3, 2, 'three', ['a', 'list']]

In [82]:
y[3][1] = 'ha'

In [83]:
print(y)
print(z)

[1, 2, 'three', ['a', 'ha']]
[3, 2, 'three', ['a', 'ha']]


In [81]:
hex(id(z[3])) == hex(id(y[3])) 

True

Although both lists are real copies, they reference to the same other list `a`, which has not been copied. 

In [84]:
hex(id(z[3])) == hex(id(y[3]))

True

This behaviour is irrespecive of the variable `a`. We can remove it from the scope. Since the list object `a` has pointed to still is in scope, it will not be taken by the garbage collector. 

In [None]:
hex(id(a))

In [None]:
del(a)

In [None]:
hex(id(z[3]))

We can copy the upper-level lists as well by calling the `copy.deepcopy`. 

In [129]:
y = [2, 2.0, [2, 'hi']]

In [130]:
z = deepcopy(y)

In [131]:
y

[2, 2.0, [2, 'hi']]

In [132]:
y[0] is z[0]

True

In [133]:
y[1] is z[1]

True

In [134]:
y[2] is z[2]

False

While the copies `y` and `z` are *equal*, the are not *identical*, because they point to different objects. 

In [98]:
y == z # equal

True

In [99]:
y is z # identical

False

### Interning 

The heap memory is memory that can be accessed and reserved by the programmer. Usually, this is tedious and automatically done. To optimize this process, Python uses *interning* to allocate ressources. Since `x` is merely a pointer to the <kbd>int</kbd> type object with value `1`, any other variable can point to the same adress.  

In [100]:
x = 1

In [101]:
y = 1

In [102]:
hex(id(x)) == hex(id(y))

True

This does not mean that integers use reference semantics! 

In [105]:
y = 1
x

1

In [106]:
x is y

True

Integer internalization is only done from `-5` to `255`. 

In [127]:
x = 200.0
y = 200.0
x is y  # hex(id(x)) == hex(id(y))

False

Interning works for several simple types: 

In [110]:
x = "Hi"
y = "Hi"

In [111]:
x is y

True

Interning can be forced using `sys.intern`. 

In [112]:
a = "This is quite a long string."
b = "This is quite a long string."

In [113]:
a is b 

False

In [114]:
a = sys.intern("This is quite a long string.")
b = sys.intern("This is quite a long string.")

In [115]:
a is b

True

In [116]:
c = "This is quite a long string."

In [117]:
a is c

False

In [124]:
a = sys.intern("This is another long string!") # alternative 
print(hex(id(a))) 

0x7face8121850


In [125]:
del a
b = "This is another long string!"
print(hex(id(b)))

0x7face8121f80


In [126]:
c = sys.intern("This is another long string!")
print(hex(id(c)))

0x7face8121bc0


For reoccuring data, interning allows to use the heap economically. 

### Summary 

- There is stack and heap memory
- All objects are stored on the heap
- Lists are versatile, but generally inefficient
- Optimize heap usage via interning