Q: how do references to frames work? 

In [1]:
frame = { 'pet' : ['cat', 'dog'], 'number': [10,20] }
frame

{'pet': ['cat', 'dog'], 'number': [10, 20]}

In [2]:
for i in frame: 
    print(i)

pet
number


In [7]:
for i in frame: 
    print("{}: {}".format(i, frame[i]))

pet: ['cat', 'dog']
number: [10, 20]


In [8]:
for i in frame: 
    print(frame[i])

['cat', 'dog']
[10, 20]


In [9]:
for i in frame: 
    column = frame[i]
    print(column)

['cat', 'dog']
[10, 20]


In [11]:
for i in frame: 
    column = frame[i]
    print(column[0])

cat
10


In [13]:
for i in frame: 
    column = frame[i]
    print(column)
    for c in column: 
        print(c)

['cat', 'dog']
cat
dog
[10, 20]
10
20


In [15]:
for i in frame: 
    print(i)
    column = frame[i]
    print(column)
    length = len(column)
    print("length={}".format(length))
    for i in range(length):
        print(column[i])

pet
['cat', 'dog']
length=2
cat
dog
number
[10, 20]
length=2
10
20


In [20]:
# you may assume that columns have the same length. 
column = frame['pet']
length = len(column)
print("length={}".format(length))  
for row in range(length):
    for index in frame: 
        column = frame[index]
        print(column[row])       

length=2
cat
10
dog
20


In [24]:
class Act(): 
    name = 'Alva'
    @property
    def hello(radish): 
        print("hello {}".format(radish.name))
a = Act()
a.name = 'Frank'
a.hello


hello Frank


# some comments about names

In a function, 

1. names that aren't mentioned before are local. 
2. local variables get deleted after function ends. 
3. The only way to influence a class variable is to use the first parameter. 

In [26]:
class Act(): 
    name = 'Alva'

    def hello(radish): 
        print("hello {}".format(radish.name))
        
    def set(foo, bar): 
        name = bar  # a local variable
    
a = Act()
a.name = 'Frank'
a.hello()
a.set("Rupert")  # doesn't change anything. 
a.hello()

hello Frank
hello Frank


# a summary of the concepts unique to python

* mutability
* interables

In [30]:
# Let's create something that is mutable. 
foo =  { 'morris', 'snoopy', 'goldie', 'lassie' }
bar = foo
print("foo={}".format(foo))
print("bar={}".format(bar))


foo={'goldie', 'lassie', 'snoopy', 'morris'}
bar={'goldie', 'lassie', 'snoopy', 'morris'}


In [31]:
bar.add('pinky')
print("foo={}".format(foo))
print("bar={}".format(bar))

foo={'morris', 'goldie', 'snoopy', 'pinky', 'lassie'}
bar={'morris', 'goldie', 'snoopy', 'pinky', 'lassie'}


In [33]:
bar = set(foo)  # shallow copy
bar.add('bruff')
print("foo={}".format(foo))
print("bar={}".format(bar))

foo={'morris', 'goldie', 'snoopy', 'pinky', 'lassie'}
bar={'goldie', 'bruff', 'snoopy', 'pinky', 'lassie', 'morris'}


In [35]:
# what is shallow? 
cat = ['cats', ['are', 'a'], 'blast']
dog = list(cat)  # shallow copy. 
print("cat={}".format(cat))
print("dog={}".format(dog))
cat[1].append('real')
print("cat={}".format(cat))
print("dog={}".format(dog))

cat=['cats', ['are', 'a'], 'blast']
dog=['cats', ['are', 'a'], 'blast']
cat=['cats', ['are', 'a', 'real'], 'blast']
dog=['cats', ['are', 'a', 'real'], 'blast']


In [42]:
jest = "this is a jest"
print(jest[10])
for c in jest: 
    print(c)
# jest[10] = 'p'  # will cause crash
pest = jest.replace('j', 'p')
print("jest={}".format(jest))  # this doesn't change
print("pest={}".format(pest))  # this changes

j
t
h
i
s
 
i
s
 
a
 
j
e
s
t
jest=this is a jest
pest=this is a pest


# memory management
There is a fundamental difference between memory management in C/C++ and in languages such as Java, Python, Lisp, ML, Erlang, .... 

| C/C++ | Python | 
|-------|--------|
| malloc | no malloc, things created when needed. |
| new | no new; class construction is just class_name(args) |
| free | no free; the memory is recovered by garbage collection |

Garbage collection requires variable descriptors: 
* for every value, there is a bit that describes whether it's used. 
* If that bit is True, then it's preserved. 
* If that bit becomes False, it's freed. 
* A bit becomes false when the last reference to the thing is destroyed/freed. 
* This is usually done via reference counts. 

Reference counts: 
* keep a count, for every value, of how many times it's referenced. 
* when there's a new reference, add 1
* when a reference is destroyed, subtract 1. 
* when the count is zero, mark for recovery

In [43]:
foo = [1, 2, 3]  # reference count is 1
bar = foo        # reference count is 2
bar = None       # reference count is 1
foo = 42         # reference count to [1,2,3] is 0 -> recovered. 

foo = [1, 2, 3]  # reference count 1
bar = [1, 2, 3]  # different object, reference count 1

# why classes are important

1. you can file data along with the functions that are needed to work with it. 
2. you don't have to explicitly list class attributes as arguments. 
3. you can define very powerful operations on class data with rather simple calling syntax. 

Ex.: `DataFrame` is a class that allows one to very simply compute a new column as a function of several existing columns of a table. You don't need iteration, for loops, intermediate variables, etc. 

Most data scientists deal with classes, at a higher level of abstraction than the primitive types list, set, dict, etc. Classes can emit these types, upon request, but they keep their implementations secret. This is a form of "information hiding". It's not to make your life difficult, but instead, to determine what you need to remember. 



# Mission of the course
(Or, why I have the audacity to think this fits into 6 weeks) 

Primary mission: 
* give you enough exposure that you can succeed in subsequent data science courses. 
* in particular, give you facility with a specific set of extremely common data analysis workflow tasks. 
* and the ability to add and remove steps without error. 
* In this, the **big problem** is data cleaning and formatting. "Data wrangling". This can take significant programming, before data can even enter the standard machine learning workflows. 
* By contrast, the output stages of a data science workflow are much better determined. 
* So, I will be spending more time on data preparation than on output generation, though it is much, much easier to invoke visualizations than to prepare data! 

# Details on the whole data science program. 
For the undergraduate program, see
https://docs.google.com/document/d/1q8GjBb2uQkZGkrMFp_C39coy6pJbSrvBfzy0diiRtdw/edit
and I will write an MS program guide later this summer; it starts Sept 1. 

This course is a prerequisite for 
* COMP 135 Machine Learning
* COMP 136 Statistical Machine Learning
* COMP 137 Deep Learning
* COMP 138 Reinforcement Learning