# STA 141B Data & Web Technologies for Data Analysis

### Lecture 1, 1/9/24, Basics of Python

### Today's topics

<style>
    font-size: 40x;
</style>

- Course Organization
- Basics of Python

### Course Organization

This course covers topics of data acquisition and processing. 
We will learn how to automatically retrieve information from publicly available sources on the internet. 
This includes processing these data so that they can then be studied statistically. 

The course consists of three parts: 

1. Introduction to Python
2. Data acquisition
3. Data Processing: Natural languages and visualisation

The final grade is determined by 
- homework assignments (40%),
- one exam on the basics of Python on January 30th (each 20%),
- project due March 20th (40%).

For comprehensive and updated information about the course, please consult [Canvas](https://canvas.ucdavis.edu/courses/868132).  

The project will be collaborative work with two to three group members. You will use the methods learn in this class to procure a data set, preferrably from multiple sources, and process it to make it accessible for further investigation. This involves displaying its properties by visual means, so that statistical hypotheses can be formed. 

All material of this course will be made available online [GitHub](https://github.com/kramlinger/STA141B_WQ24). Use [Piazza](https://piazza.com/class/lr56jub0plrc), for any inquiries regarding organization, homework or lectures. We will monitor this site M-F during business hours. Please do not write emails! Screen recordings and all other class related administrative information will be made available on Canvas. 

Office hours: 
* Peter Kramlinger: R 11-12 AM, MSB 1143
* Yichen Hu, t.b.a.
* Yejiong Zhu, t.b.a., via Zoom

#### Ethics

This is a programming class. Using assistance is part of programming and is encouraged. This can be AI based, or from online sources (e.g., [stackoverflow](https://stackoverflow.com/questions)). 

However, you will be graded by your proficiency in coding. In all assignments, make sure that you display your own contribution. Submitting AI generated code, answers from online sources, or even classmates' solutions will not be enough to pass the course. Furthermore, if you pass off someone else's work as your own, then you are engaging in academic misconduct. 


### Basics of Python

For this course, we will use Python to retrieve data. Today and next Tuesday we will introduce and review some basic aspects. Due to its simplicity, it is one of the most popular programming languages. 

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


#### Arithmetic operations

Python is a fancy calculator that allows for all basic arithmetic operations. The first notable difference to R is that not every result is printed. 

In [2]:
12 + 4
20 - 4 
2 * 8
32 / 2

16.0

In [3]:
4 ** 2 # exponentiation

16

In [4]:
4 ^ 2 # ^ is a binary operator in python, we won't use it here

6

In [5]:
33 % 17 # modulus

16

In [7]:
(1 + 12 / 4) ** (3 + 2) / 8 ** 2

16.0

The assignment operator in Python is `=`. 

In [14]:
x = 4 

In [15]:
x

4

Python is all about cleanliness! Very useful are the assignment operators that perform an arithmetic operation and avoid redundant code. 

In [16]:
x += 3
x

7

In [17]:
x -= 2
x

5

In [18]:
x *= (x + 1)
x

30

In [19]:
x /= 6
x **= 2
x %= 5
x

0.0

Boolean operators are: 

In [20]:
4 == 5

False

In [21]:
4 != 5

True

In [22]:
4 >= 5

False

In [23]:
4 >= 5 and True

False

In [24]:
4 >= 5 or True

True

In [25]:
4 <= 5 or not True

True

Brackets are important to nest statements. 

In [26]:
4 <= 5 == True

False

In [27]:
(4 <= 5) == True 

True

In [28]:
(4 <= 5) is True 

True

The distinction between `==` and `is` is not semantics: 

In [29]:
x = 3
y = 3.0
x == y # equality

True

In [30]:
x is y # identical

False

#### Syntax

Python code is user friendly and much cleaner than, e.g., R. We want to keep it that way. Therefore, lets adhere to some principles: 

Instead of brackets, Python uses indentation. The if clause below enters the indented chunk if the condition is `True`. 

In [33]:
if 4 > 0: 
    print('Four is strictly greater than zero.')
else:  
 print('Four is not strictly greater than zero.')
print('Indeed.')

Four is strictly greater than zero.
Indeed.


We will learn more about loops and if statement next lecture. Indentation works if at least one space is left, as in the `else` statement above. However, it is advised to use four blank spaces. In any case, be consistent throughout your code. 

In [36]:
k = 0
for i in range(1,100):
 k = k + i
 print(k)

1
3
6
10
15
21
28
36
45
55
66
78
91
105
120
136
153
171
190
210
231
253
276
300
325
351
378
406
435
465
496
528
561
595
630
666
703
741
780
820
861
903
946
990
1035
1081
1128
1176
1225
1275
1326
1378
1431
1485
1540
1596
1653
1711
1770
1830
1891
1953
2016
2080
2145
2211
2278
2346
2415
2485
2556
2628
2701
2775
2850
2926
3003
3081
3160
3240
3321
3403
3486
3570
3655
3741
3828
3916
4005
4095
4186
4278
4371
4465
4560
4656
4753
4851
4950


Indentation really matters in Python!

In [None]:
k = 0
for i in range(1, 100): # 1, 2, 3, ..., 99
    k = k + i
print(k)

Identation is necessary everywhere where other languages would put curly brackets...

In [37]:
def square_it(x):
    return x ** 2

In [38]:
square_it(123)

15129

While statements can be separated using `;` its better to put every statement in a new line. Screen space doesn't cost anything!

In [39]:
y = 2; x = y + 1; print(y % x); (y ** 2) == 2 # wrong!

2


False

If you don't want to break down lengthy formulae in smaller steps, use brackets to use several lines: 

In [41]:
x = (2 # imagine a full line of code...
 + 3 # ...and a second one... 
     - 4) # .. until you're done

Keep the operators at the beginning of the line to increase readability. 

Variable names must not contain `.`, as the dot accesses an attribute of the considered object. 

In [42]:
x.y = 3

AttributeError: 'int' object has no attribute 'y'

In [43]:
x_y = 3

Cleanliness and convention is very important in Python. Make sure to follow the [style guide (link to external website)](https://peps.python.org/pep-0008/)  in coding. 

#### Types

So far we have used <kbd>int</kbd>, <kbd>floats</kbd> and <kbd>bool</kbd>. We will use the following basic types: 
- Numeric: <kbd>int</kbd>, <kbd>floats</kbd>, <kbd>complex</kbd>
- Boolean: <kbd>bool</kbd>
- String: <kbd>str</kbd>
- Sequence: <kbd>list</kbd>, <kbd>tuple</kbd>, <kbd>range</kbd>
- Mapping: <kbd>dict</kbd>
- Set: <kbd>set</kbd>

We can check the type by calling `type`. 

In [44]:
type(1)

int

In [45]:
type(1.0)

float

In [46]:
type(1 + 1j) # not i, but j! 

complex

In [47]:
type(True) 

bool

In [48]:
type('Hello World!')

str

Types can be casted using the constructors.  

In [49]:
x = 1
type(x)

int

In [51]:
hex(id(x))

'0x7f9cd802e930'

In [52]:
x = float(x)
type(x)

float

In [53]:
hex(id(x))

'0x7f9c98d2d370'

Although recasting makes things easy to work with, always be aware of all the work you require your compiler to do! 

In [54]:
x = bool(x)
type(x)

bool

In [55]:
x 

True

In [60]:
bool(-1.0)

True

In [57]:
bool(0)

False

In [58]:
bool(1)

True

In [59]:
bool(0.1)

True

In [61]:
x = str(x)
x

'True'

In [62]:
x + ' or False?'

'True or False?'

We can force printing by using `print`. 

In [63]:
x = bool(x)
print(x)
complex(x)

True


(1+0j)

You can check what a functions does by running `help`. 

In [64]:
help(print)

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.



We are using Python because it is industry standard, not because it is superior to, e.g., R. It is industry standard because it is so easy and simple. We should keep it that way. 

__Adhere to the principles of proper programming!__

- K.I.S.S. (Keep It Simple, Stupid): Functions should perform one task, and one task only. 
- Rule of Three (avoid code duplication): Duplication is a bad programming habit because it makes code harder to maintain. 
- Clarity before Efficiency: Never sacrifice clarity for some perceived efficiency. Donald Knuth: "Premature optimization is the root of all evil."
- Naming: Stick to consistency and conventions. 

#### Sequence

##### Range

We have already used a <kbd>range</kbd> object in `for` loops. The function `range(start, stop, step)` creates a <kbd>range</kbd> type object. Note that it starts at `start` and ends at `stop - 1`.

In [65]:
x = range(1, 10) 
len(x)

9

In [66]:
x = range(1, 10)
for n in x:
    print(n)

1
2
3
4
5
6
7
8
9


In [67]:
type(x)

range

`range(0, 100)` creates an iterable object and should be used in , e.g., `for` loops to iterate over. It does not instantiate a vector of length $100$, that would take up too much space. 

In [68]:
x = range(0, 100**100)

In [69]:
import sys
sys.getsizeof(x)

48

In [70]:
sys.getsizeof(100**100) # just the largest value of that range...

116

##### Tuple

A <kbd>tuple</kbd> is an ordered collection of values. Think of coordinates. <kbd>tuple</kbd> is immutable, which means they can't be changed after they're created.

In [72]:
x = (1, 3.0, "horse") # parenthesis are optional, but should be used for clarity 
x

(1, 3.0, 'horse')

In [73]:
type(x)

tuple

There are three ways to get elements from a tuple:
- Indexing with `[]` 
- Slicing with `[a:b]`, a slice a:b gets elements from a to b - 1
- Unpacking: assign to a same-shape tuple of variables on the left-hand side

Note that that objects in Python are indexed starting with zero!

In [74]:
x[0]

1

In [75]:
type(x[2])

str

In [76]:
x

(1, 3.0, 'horse')

In [78]:
x[2]

'horse'

In [85]:
x[1:]

(3.0, 'horse')

In [86]:
x[0:1] 

(1,)

In [88]:
type(x[:1][0]) # leaving the a-position blank assumes the first entry at position zero

int

Python has the ability to assign multiple varaiables at once! This is called unpacking. 

In [93]:
u, _, _ = x
print(u)
print(w)

1
horse


In [94]:
_

'horse'

In [90]:
type(v)

float

For unpacking, there must be enough variables provided. 

In [95]:
u, v = x
print(v)
print(w)

ValueError: too many values to unpack (expected 2)

Once created, tuples can't be changed!

In [98]:
x[2] = 'horsies'

TypeError: 'tuple' object does not support item assignment

This is a feature, not shortcoming of <kbd>tuple</kbd>. Since they cannot be changed nor appended, they are more  economical than <kbd>list</kbd>. 

##### Lists

<kbd>list</kbd> is the mutable counterpart of <kbd>tuple</kbd>. They are instantiated with square brackets. 

In [99]:
y = [1, 3.0, "horse"]
type(y)

list

In [100]:
sys.getsizeof(x)

64

In [101]:
sys.getsizeof(y)

120

They are however mutable. 

In [102]:
y[2] = "horsies"
y 

[1, 3.0, 'horsies']

Accessing lists works just like for tuples. 

In [103]:
y[0]

1

In [104]:
y[3]

IndexError: list index out of range

In [105]:
y[1:]

[3.0, 'horsies']

In [109]:
y[2:] # one-dim list

['horsies']

In [107]:
y[2:][0]

'horsies'

There are two ways to read the first and third element in `y`. 

In [110]:
z = [y[i] for i in range(0, 3, 2)] # more on that syntax next lecture! 
z

[1, 'horsies']

Alternatively, we can slice:

In [111]:
z = y[0:3:2] # start at 0, stop at 3 - 1, use step size 2
z

[1, 'horsies']

Just as for tuples, we can use unpacking. 

In [112]:
u, v, w = y
print(u)
print(v)
print(w)

1
3.0
horsies


In [115]:
del x[1]
x

TypeError: 'tuple' object doesn't support item deletion

In [114]:
type(x)

tuple

Some other important methods for <kbd>list</kbd> are: 

In [116]:
y.append(4) #appends argument to list, does not return anything

In [117]:
y

[1, 'horsies', 4]

In [118]:
y.append([6, 7]) #appends list!
y

[1, 'horsies', 4, [6, 7]]

In [119]:
y.index('horsies') # returns index of argument

1

In [120]:
y.index(4)

2

In [121]:
y.pop(1) # removes element on argument position and returns it

'horsies'

In [122]:
y

[1, 4, [6, 7]]

In [123]:
y.reverse() # reverses the order of elements, returns nothing
y

[[6, 7], 4, 1]

### Mapping

<kbd>dict</kbd> type objects are a one-to-one map from keys to values. In other words, you use a key to look up a value. Dictionaries are mutable. They are instantiated with curly brackets `{}` and colons `:`. 

In [125]:
x = {"hello": 1, 3: 5.0}
x

{'hello': 1, 3: 5.0}

In [126]:
type(x)

dict

We can access dictionaries with indexing or its `get` method. 

In [127]:
x

{'hello': 1, 3: 5.0}

In [128]:
x['hello']

1

In [129]:
x[3]

5.0

In [130]:
x.get('hello')

1

In [131]:
x.get('house') # nothing is returned

In [132]:
x.get('house', 'not here') # second argument is returned if first argument is not in dictionary

'not here'

The keys of a dictionary must be unique and of immutable type, i.e., numeric, boolean, string or tuples. 

In [133]:
x.keys() #  returns a dict_keys type object

dict_keys(['hello', 3])

In [134]:
x['new key'] = [8]
x

{'hello': 1, 3: 5.0, 'new key': [8]}

In [None]:
x[(1,2)] = "lists are mutable and cant be keys"

In [137]:
x

{'hello': 1,
 3: 5.0,
 'new key': [8],
 (1, 2): 'lists are mutable and cant be keys'}

Therefore, only the value, not the key can be changed. 

In [138]:
x

{'hello': 1,
 3: 5.0,
 'new key': [8],
 (1, 2): 'lists are mutable and cant be keys'}

In [144]:
x['hello'] = 5.0 ** 0.5
x

TypeError: 'set' object does not support item assignment

{'hello': 2.23606797749979,
 3: 5.0,
 'new key': [8],
 (1, 2): 'lists are mutable and cant be keys'}

<kbd>dict</kbd> type is useful, since lists be looked up efficiently. 

#### Set

A <kbd>set</kbd> is an unordered collection of unique items. It is instantiated with curly brackets. Since the items are unique, they must be inmutable!

In [142]:
x = {"apple", True, 2} # display order changed, they are unordered
x

{2, True, 'apple'}

In [141]:
x[2]

TypeError: 'set' object is not subscriptable

In [143]:
{"apple", [2,3], 2}

TypeError: unhashable type: 'list'

Sets are unordered. Hence, they do not support indexing. 

In [145]:
x[1] 

TypeError: 'set' object is not subscriptable

In [146]:
x.add("new item")
x

{2, True, 'apple', 'new item'}

In [147]:
x.add("new item") # the items are unique
x

{2, True, 'apple', 'new item'}

In [150]:
x.remove(12)
x

KeyError: 12