# STA 220 Data & Web Technologies for Data Analysis

### Lecture 1, 1/7/25, Memory handling


### Today's topics

 - Course Organization
 - Recap and Memory Handling in Python
     - Stack and Heap
     - Types
     - Reference Semantics
     - Interning

### Course Organization

This course covers topics of data acquisition and processing. 
We will learn how to automatically retrieve information from publicly available sources on the internet. 
This includes processing these data so that they can then be studied statistically. 

The course consists of two parts: 

1. Data acquisition: Web scraping from online sources, where we’ll examine network traffic to understand API calls and learn how to configure our own requests as well as scraping websites and extracting data by navigating HTML files.
2. Data Processing: Natural languages and visualization, which involves tokenization, chunking as well as statistical models for language data as LDA and Naive Bayes. 

The final grade is determined by 
- homework assignments (40%),
- project due March 16 (60%).

For comprehensive and updated information about the course, please consult [Canvas](https://canvas.ucdavis.edu/courses/947187).  

The project will be collaborative work with two to three group members. You will use the methods learn in this class to procure a data set, preferably from multiple sources, and process it to make it accessible for further investigation. This involves displaying its properties by visual means, so that statistical hypotheses can be formed. 

Lecture notes will be made available online [GitHub](https://github.com/kramlinger/sta220). Use [Piazza](https://piazza.com/class/m5f6f5d9l6n3pm/), for any inquiries regarding organization, homework or lectures. We will monitor this site M-F during business hours. Please do not write emails! Screen recordings  will be made available on Canvas. Homework assignments will be published on Piazza.

Office hours: 
* Peter Kramlinger: T, 2-3 PM, MSB 1143
* Zhongxuan Liu: t.b.a.

#### Ethics

This is a programming class. Using assistance is part of programming and is encouraged. This can be AI based, or from online sources (e.g., [stackoverflow](https://stackoverflow.com/questions)). 

However, you will be graded on your proficiency in coding. In all assignments, make sure that you display your own contribution. Submitting AI-generated code, answers from online sources, or even classmates' solutions will not be enough to pass the course. If you pass off someone else's work as your own, then you are engaging in academic misconduct. 


### Python

For this course, we will use Python to retrieve data. Today and Thursday we will introduce and review some basic aspects. Due to its simplicity, it is one of the most popular programming languages. 

In [None]:
import this

### Stack and Heap

In [None]:
x = True
type(x)

`x` is a variable, which corresponds to an <kbd>bool</kbd> object with value `True`. The variable itself holds merely a reference to a specific object. This reference is stored in local memory (the *stack*). Our compiler takes care in allocating stack memory, we don't have to do that. 

The <kbd>bool</kbd>-object and its value are stored on the random access memory (RAM, the *heap*). We can access the address of the object on the heap (and, conversely, the reference on the stack): 

In [None]:
hex(id(x))

In [None]:
y = float(x)
hex(id(y))

A new object will be stored at a different location. 

In [None]:
hex(id(x))

In [None]:
x = int(x)
type(x)

In [None]:
hex(id(x))

<img src="source/memory1.png" alt="" width="1000"/>

As soon as the `x`-variable, which previously referenced to the <kbd>bool</kbd> object is out of scope (either by deletion or recasting), the object on the heap is ready to be overwritten by the garbage collector. 



Let's work through the phrases: *Everything in Python is an object*. Some basic default objects (*types*) we have already met are 

- Numeric: <kbd>int</kbd>, <kbd>floats</kbd>, <kbd>complex</kbd>
- Boolean: <kbd>bool</kbd>
- String: <kbd>str</kbd>
- Sequence: <kbd>list</kbd>, <kbd>tuple</kbd>, <kbd>range</kbd>
- Mapping: <kbd>dict</kbd>

The function `sys.getsizeof` ([docs](https://docs.python.org/3/library/sys.html?highlight=getsizeof#sys.getsizeof)) returns the size in bytes of the object the variable points to. 

In [None]:
import sys
sys.getsizeof(x)

In [None]:
sys.getsizeof(y)

A <kbd>float</kbd> is less expensive than an <kbd>integer</kbd>. This is because <kbd>integer</kbd> stores additional information about size together with the actual value. The larger the integer, the more memory required. 

In [None]:
sys.getsizeof(100 ** 10)

In [None]:
sys.getsizeof(100.0 ** 10)

However, <kbd>integer</kbd> can store larger values than <kbd>float</kbd>. 

In [None]:
x = 500 ** 500 
type(x)

In [None]:
x

In [None]:
sys.getsizeof(x)

In [None]:
float(x)

The function `range(start, stop, step)` ([docs](https://docs.python.org/3/library/stdtypes.html#range)) creates a <kbd>range</kbd> type object. It starts at `start` and ends at `stop - 1`, but does not instantiate an object of that length. 

In [None]:
x = range(0, 500**500)
sys.getsizeof(x)

In [None]:
sys.getsizeof(500**500)

A <kbd>tuple</kbd> is an ordered collection of values. Think of coordinates. <kbd>tuple</kbd> is immutable, which means they can't be changed after they're created.

In [None]:
x = 1, 3.0, "horse" # parenthesis are optional, but should be used for clarity 
x

In [None]:
type(x)

In [None]:
sys.getsizeof(x)

A <kbd>tuple</kbd> is inmutable. We have learned that once created, it can't be changed!

In [None]:
x[2] = 'horsies' 

In [None]:
try: x[2] = 'horsies' 
except: 
    print('Tuples are inmutable!')

This is a feature, not shortcoming of <kbd>tuple</kbd>. Since they cannot be changed nor appended, they are more  economical than <kbd>list</kbd>. <kbd>list</kbd> is the mutable counterpart of <kbd>tuple</kbd>. They are instantiated with square brackets. 

In [None]:
y = [1, 3.0, "horse"]
y

In [None]:
type(y)

In [None]:
sys.getsizeof(y)

Lists are mutable, and in particular appendable. Since these actions are allowed, <kbd>list</kbd> objects require  more memory. The return of `sys.getsizeof` does not coincide with the values in the list! Instead, `y` is a variable with a reference to a <kbd>list</kbd> object on the heap, *which itself is a collection of adresses*. This collection of adresses takes $120$ bytes. 

In [None]:
sys.getsizeof(y)

In [None]:
sum([sys.getsizeof(i) for i in y])

In [None]:
sys.getsizeof(1) + sys.getsizeof(3.0) + sys.getsizeof("horse")

In contrast to <kbd>tuples</kbd>, they are however mutable. 

In [None]:
y[2] = "horsies"
y

### Reference Semantics

Lists use *reference semantics*, which means that if you assign a list to two different variables, there's still only one list in memory, and both variables refer to it. As a result, changing the list with one variable changes the list for the other variable.

In [None]:
x = y

In [None]:
hex(id(x))

In [None]:
hex(id(y))

In [None]:
x[0] = "my"
y

A new, non-referenced object can be created by slicing. 

In [None]:
z = y[:]

In [None]:
hex(id(z))

In [None]:
z

In [None]:
z[1] = 3

In [None]:
hex(id(z[1]))

In [None]:
hex(id(y[1]))

<img src="source/memory2.png" alt="" width="1000"/>

Alternatively, we can use the copy method ([docs](https://docs.python.org/3/library/copy.html)) to the original list. 

In [None]:
z = y.copy()
hex(id(z))

In [None]:
hex(id(y))

While the copies `y` and `z` are *equal*, the are not *identical*, because they point to different objects. 

In [None]:
y == z # equal

In [None]:
y is z # identical

In [None]:
y is x # identical

In [None]:
y[1] = 2
print(y)
print(z) 

Attention! This is a *shallow copy*, i.e., objects whithin the list will not be be reinstantiated! Above, the command `y[1] = 2` just instantiates a new <kbd>int</kbd> object of value `2` on the heap and replaces the former reference in `y` with the reference to that new object. 

In [None]:
hex(id(z[1])) == hex(id(y[1]))

This becomes tricky if the list references to another list: 

In [None]:
a = ['a', 'list']

In [None]:
y = [1, 2, 'three', a]

In [None]:
z = y.copy()

In [None]:
hex(id(y))

In [None]:
hex(id(z))

In [None]:
hex(id(y[3]))

In [None]:
hex(id(z[3]))

In [None]:
z[0] = 3

In [None]:
y

In [None]:
z

In [None]:
y[3][1] = 'ha'

In [None]:
print(y)
print(z)

In [None]:
hex(id(z[3])) == hex(id(y[3])) 

Although both lists are real copies, they reference to the same other list `a`, which has not been copied. 

In [None]:
hex(id(z[3])) == hex(id(y[3]))

This behaviour is irrespecive of the variable `a`. We can remove it from the scope. Since the list object `a` has pointed to still is in scope, it will not be taken by the garbage collector. 

In [None]:
hex(id(a))

In [None]:
del(a)

In [None]:
hex(id(z[3]))

We can copy the upper-level lists as well by calling the `copy.deepcopy`. 

In [None]:
from copy import deepcopy
z = deepcopy(y)

In [None]:
y

In [None]:
hex(id(z[1]))

In [None]:
hex(id(y[1]))

While the copies `y` and `z` are *equal*, the are not *identical*, because they point to different objects. 

In [None]:
y == z # equal

In [None]:
y is z # identical

### Interning 

The heap memory is memory that can be accessed and reserved by the programmer. Usually, this is tedious and automatically done. To optimize this process, Python uses *interning* to allocate ressources. Since `x` is merely a pointer to the <kbd>int</kbd> type object with value `1`, any other variable can point to the same adress.  

In [None]:
x = 1

In [None]:
y = 1

In [None]:
hex(id(x)) == hex(id(y))

This does not mean that integers use reference semantics! 

In [None]:
y = 2
x

In [None]:
hex(id(x)) == hex(id(y))

Integer internalization is only done from `-5` to `255`. 

In [None]:
x = 4.0
y = 4.0
hex(id(x)) == hex(id(y))

Interning works for several simple types: 

In [None]:
x = "Hi"
y = "Hi"

In [None]:
hex(id(x)) == hex(id(y))

Interning can be forced using `sys.intern`. 

In [None]:
a = "This is quite a long string."
b = "This is quite a long string."
hex(id(a)) == hex(id(b))

In [None]:
a = "This is quite a long string."
b = a
hex(id(a)) == hex(id(b))

In [None]:
import sys
a = sys.intern("This is quite a long string.")
b = sys.intern("This is quite a long string.")
hex(id(a)) == hex(id(b))

In [None]:
c = "This is quite a long string."
hex(id(a)) == hex(id(c))

When using `sys.intern`, the we can internalize an object without it being pointed to on the heap. 

In [None]:
a = sys.intern("This is quite a long string.")
hex(id(a))

In [None]:
del a
b = sys.intern("This is quite a long string.")
hex(id(b))

For reoccuring data, interning allows to use the heap economically. 

### Summary 

- There is stack and heap memory
- All objects are stored on the heap
- Lists are versatile, but generally inefficient
- Optimize heap usage via interning