# STA160 2024 Fall Discussion 01: Python Basics

## References

* Python for Data Analysis
* Python Data Science Handbook
* Hands-On Machine Learning with Scikit-Learn & TensorFlow

## Preparation

The primary programming language for the discussion is **Python 3**.

If your operating system is Windows or OS X, the best way to set up Python 3 is to get Anaconda. Anaconda is a Python distribution designed specifically for scientific computing. Anaconda includes many of the packages we'll use this quarter. You can find Anaconda at: <https://www.anaconda.com/download/>

If your operating system is Linux, the best way to set up Python 3 is through your distribution's package manager. You'll also need to install Jupyter this way.
### Python Environment Setup

For this class, installing [Anaconda](https://www.anaconda.com/) for setting up your programming environment is strongly recommended.

Several tools you may use for programming:

- [Jupyter Notebook](https://jupyter.org/): a great platform to organize your project with Markdown and Python. Jupyter notebooks enable you to write text and code in the same document, and to run the code to display the results. Jupyter notebooks are very popular in the data science community and are convenient for data analysis.
- [PyCharm](https://www.jetbrains.com/pycharm/): a great IDE for Python.


(Advanced) Text Editors you can use for coding:

- [Vim](https://www.vim.org/)
- [Sublime Text 3](https://www.sublimetext.com/)
- [Visual Studio Code](https://code.visualstudio.com/)

[Markdown Syntax](https://www.markdownguide.org/basic-syntax/)
[Markdown Cheatsheet](https://www.markdownguide.org/cheat-sheet/)


### Python Basic Syntax

Please quick review all the listed commands.

- Operators: `+`, `-`, `*`, `**`, `/`, `//`, `=`, `==`, `is`, `in`, ...
- Control flow tools: `if...elif...else...`, `for`, `while`, `break`, `continue`, `pass`, ...

### Python Basic Data Structures

To warm up, we summarize several basic data structures in Python. Below are four commonly-used structures.

#### Strings

Strings in Python are arrays of bytes representing unicode characters.
Square brackets can be used to access elements of the string. Elements are numbered starting from 0, not 1 (unlike R).

In [1]:
s='Python'
print(s[0])  # select the first character
print(s[1])  # select the second character
print(s[-1]) # select the last character
print(s[-2]) # select the second last character

P
y
n
o


You can get a slice (multiple elements) of the container with the : operator.
The first number is included, but the last number is not (unlike R).

In [2]:
s[0:4]   # from s[0] to s[3]

'Pyth'

In [3]:
s[1:]    # from s[1] to end

'ython'

In [4]:
s[:4]    # from s[0] to s[3]

'Pyth'

In [5]:
s[:]     # from a[0] to end

'Python'

In [6]:
s[0:5:2]  # from s[0] to s[4] with increment 2, so returns s[0] s[2] s[4]

'Pto'

In [7]:
len(s) # returns the length of a string

6

In [8]:
s.lower() # returns the string in lower case

'python'

In [9]:
s.upper() # returns the string in upper case

'PYTHON'

In [10]:
s.replace('hon','orch') # replaces a string with another string

'Pytorch'

## List, Tuple, Set and Dictionary
There are four collection data types in the Python programming language:

__List__ is a collection which is ordered and changeable. Allows duplicate members.

__Tuple__ is a collection which is ordered and unchangeable. Allows duplicate members.

__Set__ is a collection which is unordered and unindexed. No duplicate members.

__Dictionary__ is a collection which is unordered, changeable and indexed. No duplicate members.        

#### List

In [11]:
L=[1,'a',3.0]  # list can have different types
print(L)

[1, 'a', 3.0]


In [12]:
L[1]        # selection

'a'

In [13]:
L[0:3]      # slicing

[1, 'a', 3.0]

In [14]:
L.append(6) # append(): add an element at the end of the list
print(L)

[1, 'a', 3.0, 6]


In [15]:
L.extend([1,2,3,4]) # extend(): add the elements of a list to the end of the current list
print(L)
print(len(L))

[1, 'a', 3.0, 6, 1, 2, 3, 4]
8


In [16]:
L=[1,'a',3.0]
L.append(6)
L.append([1,2,3,4])
print(L)
print(len(L))

[1, 'a', 3.0, 6, [1, 2, 3, 4]]
5


In [17]:
L=[1,'a',3]
L.insert(2,'b') # insert() adds an element at the specified position
print(L)

[1, 'a', 'b', 3]


In [18]:
print(L)
L.pop(1)      # pop(): removes the element at the specified position
print(L)      # remove L[1]

[1, 'a', 'b', 3]
[1, 'b', 3]


In [19]:
print(L)
L.pop()       # remove L[-1]
print(L)

[1, 'b', 3]
[1, 'b']


In [20]:
L=[1,5,6,3,2,7,8,3]
L.remove(3)  # remove(): removes the first occurrence of the element with the specified value
print(L)     # remove the first value 3

[1, 5, 6, 2, 7, 8, 3]


In [21]:
print(L)
L.reverse()  # reverse(): reverses the order of the list
print(L)

[1, 5, 6, 2, 7, 8, 3]
[3, 8, 7, 2, 6, 5, 1]


In [22]:
L.sort()              # sort(): sorts the list ascending by default
print(L)

L.sort(reverse=True)  # sort descending
print(L)

[1, 2, 3, 5, 6, 7, 8]
[8, 7, 6, 5, 3, 2, 1]


### Tuple (a unchangeable list)

In [23]:
T=(1,3,5,6)
print(T)
print(type(T))

(1, 3, 5, 6)
<class 'tuple'>


In [24]:
T[0:2]

(1, 3)

In [25]:
T[0]=0  # Error

TypeError: 'tuple' object does not support item assignment

### Set

In [26]:
S={1,1,2,5,7,7}
print(S)
print(type(S))

{1, 2, 5, 7}
<class 'set'>


In [27]:
S.add(9) # add(): add one element to a set
print(S)

{1, 2, 5, 7, 9}


In [28]:
S.remove(2) # remove(): remove an element from a set; it must be a member
print(S)

{1, 5, 7, 9}


In [29]:
S1={1,3,4,5}
S2={3,4,6,7}
## S1 | S2
print(S1.union(S2)) # Union
## S1 & S2
print(S1.intersection(S2)) # Intersection

{1, 3, 4, 5, 6, 7}
{3, 4}


In [30]:
print(S2.union(S1)) # Union

{1, 3, 4, 5, 6, 7}


### Dictionary

A dictionary is a container for {key: value} pairs. You can use __any__ type as a key and __any__ type as a value.

3 ways in defining dictionary:

In [31]:
# 1
D={1:'STA' , 'Hi':160 , 2.0:'2020'}
print(D)

{1: 'STA', 'Hi': 160, 2.0: '2020'}


In [32]:
D['Hi']

160

In [33]:
# 2
dict(A='STA', B=160, C='2020')

{'A': 'STA', 'B': 160, 'C': '2020'}

In [34]:
# 3
D={}
D[1]='STA'
D['Hi']=160
D[2.0]='2020'
print(D)

{1: 'STA', 'Hi': 160, 2.0: '2020'}


In [35]:
D.items()

dict_items([(1, 'STA'), ('Hi', 160), (2.0, '2020')])

In [36]:
D.keys()

dict_keys([1, 'Hi', 2.0])

In [37]:
D.values()

dict_values(['STA', 160, '2020'])

In [38]:
for x,y in D.items():
    print(x,y)

1 STA
Hi 160
2.0 2020


In [39]:
for x in D.values():
    print(x)
print('finish')

STA
160
2020
finish


In [40]:
a=[str(x) for x in D.values()] ## list comprehension [function(i) for i in ... ]
print(a)

['STA', '160', '2020']


From __List__ to __String__

In [41]:
a

['STA', '160', '2020']

In [42]:
' '.join(a)

'STA 160 2020'

In [43]:
b=[1,2,3]
', '.join(map(str,b)) # map() to convert each item in the list to a string then join them

'1, 2, 3'

From __String__ to __List__

In [44]:
list('Python')

['P', 'y', 't', 'h', 'o', 'n']

From __String__ to __List__

In [45]:
set([1,2,3])

{1, 2, 3}

From __String__ to __List__

In [46]:
list({1,2,3})

[1, 2, 3]

From __Set__ to __Tuple__

In [47]:
tuple({1,2,3})

(1, 2, 3)

From __Tuple__ to __List__

In [48]:
list((1,2,3))

[1, 2, 3]

### Inserting values into strings

You can use the string method format method to create new strings with inserted values.
The curly braces show where the inserted value should go.

In [49]:
"Month {}, Year {}.".format('April', 2020)

'Month April, Year 2020.'

For % operator formating, you show where the inserted values should go using a % character followed by a format specifier, to say how the value should be inserted.

In [50]:
# Notice the %s marker to insert a string, and the %d marker to insert an integer.
"Month %s, Year %d." % ('April', 2020)

'Month April, Year 2020.'

## iterator

String, Lists, Tuples, Sets, and Dictionaries are all iterable objects. They are iterable containers which you can get an iterator from. All these objects have a iter() method which is used to get an iterator

In [51]:
it=iter([1,2,3])
for x in it:
    print(x)

1
2
3


In [52]:
it=iter({'a':1,'b':2,'c':3})
print(next(it))
print(next(it))
print(next(it))

a
b
c


In [53]:
# iterator: enumerate elements from 0 to length
enumerate('Python')

<enumerate at 0x7fc43e24aec0>

In [54]:
list(enumerate('Python')) # convert to list

[(0, 'P'), (1, 'y'), (2, 't'), (3, 'h'), (4, 'o'), (5, 'n')]

In [55]:
dict(enumerate('Python')) # convert to dictionary

{0: 'P', 1: 'y', 2: 't', 3: 'h', 4: 'o', 5: 'n'}

In [56]:
# iterator: element-wise pairs
zip([1,2,3],['a','b','c'])

<zip at 0x7fc43e2eb0c0>

In [57]:
list(zip([1,2,3],['a','b'])) # convert to list

[(1, 'a'), (2, 'b')]

In [58]:
dict(zip([1,2,3],['a','b'])) # convert to list

{1: 'a', 2: 'b'}

In [59]:
t=zip('abc','ABC')
[ u+l for l,u in t]

['Aa', 'Bb', 'Cc']

## For, While loops

In [60]:
L=[1,3,4]
# record L squared in result
result=[]
for i in range(len(L)):
    result.append(L[i]**2)
print(result)

[1, 9, 16]


In [61]:
# range() function
print(list(range(3)))
print(list(range(1,4)))

[0, 1, 2]
[1, 2, 3]


In [62]:
M=[[1,2,3],[4,5,6],[7,8,9]]

for i in range(len(M)):
    for j in range(len(M[0])):
        if M[i][j]==5:
            M[i][j]*=10
        else:
            M[i][j]+=1
print(M)

[[2, 3, 4], [5, 50, 7], [8, 9, 10]]


In [63]:
x=10
addup=0
while x>0:
    print(addup)
    addup+=x
    x-=1
print("The toal sum is: ",addup)

0
10
19
27
34
40
45
49
52
54
The toal sum is:  55


In [None]:
# x-=1 -> x=x-1

### break and continue

In [64]:
i = 0
while i < 6:
    i += 1
    print(i)
    if i == 3:
        break
    print('finish an iteration')

1
finish an iteration
2
finish an iteration
3


In [65]:
i = 0
while i < 6:
    i += 1
    print(i)
    if i == 3:
        continue # skip the rest and do next iteration
    print('finish an iteration')

1
finish an iteration
2
finish an iteration
3
4
finish an iteration
5
finish an iteration
6
finish an iteration


### Advanced Data Structures from Packages

For example,

- [numpy.array](https://numpy.org/devdocs/user/quickstart.html)

- [pandas.Series, pandas.DataFrame](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)

### Module and Package

*Python* is **open**. Python is developed under an OSI-approved open source license, making it freely usable and distributable, even for commercial use. Everyone can contribute to this community, such as developing useful modules.

It is much more convenient to manage your programming environment using *conda*. Read the [user guide](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html)

In [66]:
import math
#import numpy

math.sqrt(2)

1.4142135623730951

In [67]:
import math as m

m.sqrt(2)

1.4142135623730951

In [68]:
from math import sqrt

sqrt(2)

1.4142135623730951

In [69]:
from math import *

sqrt(2)

1.4142135623730951

Below are some useful modules you may use in this class.

- [numpy](https://numpy.org/): The fundamental package for scientific computing with Python
- [scikit-learn](https://scikit-learn.org/stable/): Machine Learning in Python
- [matploblib](https://matplotlib.org/): Visualization with Python
- [seaborn](https://seaborn.pydata.org/): Statistical data visualization
- [statsmodels](https://www.statsmodels.org/stable/index.html): statistical models, hypothesis tests, and data exploration
- [PyTorch](https://pytorch.org/): An open source machine learning framework that accelerates the path from research prototyping to production deployment


### Function

You can define your own function or call functions from modules.

In [70]:
# define your own functions
def myfunction(a):
    b = a+10
    return b

In [71]:
myfunction(2)

12

In Python, we usually use [NumPy](https://numpy.org/) to implement the matrix computations for the sake of efficiency. The package `NumPy` provides several powerful classes and methods for numerical computations.

To be specific, the class `np.ndarray` (`np.array`) is commonly used for matrix computations. Please check out this [manual](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).

### Vector - 1d array

In [72]:
import numpy as np
vec = np.array([1,2,3])
vec.shape

(3,)

`(3,)` is a tuple with only one element. The 1-d array has only the `length` attribute.

In [73]:
len(vec)

3

In [74]:
vec.max()

3

### Matrix - 2d array

In [75]:
mat = np.array([[1,2,3],[4,5,9],[7,8,9],[10,11,12]])
mat

array([[ 1,  2,  3],
       [ 4,  5,  9],
       [ 7,  8,  9],
       [10, 11, 12]])

In [76]:
mat.shape

(4, 3)

`(4, 3)` is a tuple with two elements, which corresponds to # of rows, # of columns.

### Tensor - multidimensional array

Tensors are a generalized data structure, which have a wide range of applications in modern machine learning. We will later discuss it in the section of deep neural networks. For example, in [PyTorch](https://pytorch.org/), we will use tensors to encode the inputs and outputs of a model, as well as the model’s parameters. Then, tensors can run on GPUs or other specialized hardware to accelerate computing.


### Matrix Computations


#### 1. Matrix Addition, Substraction, Element-wise Multiplication & Division

In [77]:
A = np.array([[1,2],[3,4]])
B = np.array([[2,3],[4,5]])

In [78]:
A+B

array([[3, 5],
       [7, 9]])

In [79]:
A-B

array([[-1, -1],
       [-1, -1]])

In [80]:
A*B

array([[ 2,  6],
       [12, 20]])

In [81]:
A/B

array([[0.5       , 0.66666667],
       [0.75      , 0.8       ]])

`+`, `-`, `*`, `/` will do element-wise addition and substraction. If the shapes of two arrays are different, this operation cannot be done with an error message. However, the matrix-scalar calculations are acceptable.

In [82]:
A + 1

array([[2, 3],
       [4, 5]])

In [83]:
A * 2

array([[2, 4],
       [6, 8]])

#### 2. Matrix Muplication

When computing the matrix muplication using NumPy and the operator `@`, make sure that the shapes of two operands follow the rules, i.e., `(m,p) @ (p,n)` $\rightarrow$ `(m,n)` .

- Matrix-Vector Product (2darray-1darray Product)

In [84]:
mat = np.array([[1,2,3],[4,5,9],[7,8,9],[10,11,12]])
b = np.array([1,1,1])

In [85]:
mat

array([[ 1,  2,  3],
       [ 4,  5,  9],
       [ 7,  8,  9],
       [10, 11, 12]])

In [86]:
b

array([1, 1, 1])

In [87]:
mat @ b

array([ 6, 18, 24, 33])

In [88]:
(mat @ b).shape

(4,)

As long as two shapes follow the matrix multiplication rules, it returns a 1d-array.

- Matrix-Matrix Product (2darray-2darray Product)

In [89]:
A @ B

array([[10, 13],
       [22, 29]])

In [90]:
np.matmul(A,B)

array([[10, 13],
       [22, 29]])

In [91]:
b = b.reshape(-1,1)
b
# b.reshape(1,-1)

array([[1],
       [1],
       [1]])

In [92]:
b.shape

(3, 1)

In [93]:
(mat @ b).shape

(4, 1)

It returns a 2d-array following the multiplication rules.

- Matrix Transpose

In [94]:
mat.T

array([[ 1,  4,  7, 10],
       [ 2,  5,  8, 11],
       [ 3,  9,  9, 12]])

#### 3. Eigenvalues and Eigenvectors

[`np.linalg.eig`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html)



In [95]:
# set a random seed
np.random.seed(2021)
# Create a 3 by 3 random matrix
A = np.random.rand(3,3)
print(A)

# Compute eigenvalues and eigenvectors
eigenvals, eigenvecs = np.linalg.eig(A)

[[0.60597828 0.73336936 0.13894716]
 [0.31267308 0.99724328 0.12816238]
 [0.17899311 0.75292543 0.66216051]]


In [96]:
eigenvals

array([1.46993771, 0.28721889, 0.50822548])

In [97]:
eigenvecs

array([[-0.55986774, -0.76336389, -0.14005302],
       [-0.54059029,  0.42411145, -0.16626493],
       [-0.62794128, -0.48724229,  0.97608459]])

Note that only if all the eigenvectors are linearly independent, we have the eigendecomposition $A = U\Sigma U^\top$.


#### 4. Sigular Value Decomposition (SVD)

[`np.linalg.svd`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html)

In [98]:
# Create a 5 by 3 random matrix
A = np.random.randn(5, 3)
# Singular Value Decomposition
U, S, Vt = np.linalg.svd(A, full_matrices=True)

In [99]:
U, U.shape

(array([[ 0.22073128, -0.03043587,  0.16712615, -0.89000162,  0.36099491],
        [-0.11471826,  0.92761524, -0.33552023, -0.09536938,  0.06856048],
        [ 0.22007922,  0.12998267,  0.32382857,  0.43547724,  0.80010267],
        [ 0.68167637, -0.16826726, -0.70307225,  0.0763847 ,  0.08281451],
        [-0.65192015, -0.30560473, -0.51021515, -0.057678  ,  0.46686145]]),
 (5, 5))

In [100]:
S, S.shape

(array([3.36166521, 2.40725415, 1.48282724]), (3,))

In [101]:
Vt, Vt.shape

(array([[-0.5160804 ,  0.28883446, -0.80637193],
        [ 0.19164251,  0.95649989,  0.21995704],
        [ 0.83482583, -0.04101963, -0.5489838 ]]),
 (3, 3))

#### 5. Norms

[`np.linalg.norm`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html)`(x, ord=None, axis=None, keepdims=False)`

Below are several useful norms in this class

| ord  | matrix norm                     | vector norm |
| ---- | ------------------------------- | ----------- |
| None | Frobenius norm                  | 2-norm      |
| 1    | max(sum(abs(x), axis=0))        | 1-norm      |
| 2    | 2-norm (largest singular value) | 2-norm      |
| inf  | max(sum(abs(x), axis=1))        | max(abs(x)) |
| 0    | –                               | sum(x != 0) |

In [102]:
np.linalg.norm(A) # Frobenius norm

4.392543930507632

In [103]:
np.linalg.norm(b, 1) # L1-norm for vector b

3.0

#### Random Variables

Draw random samples from a normal (Gaussian) distribution.

`np.random.normal(loc=0.0, scale=1.0, size=None)`

loc: float or array_like of floats
Mean (“centre”) of the distribution.

scale: float or array_like of floats
Standard deviation (spread or “width”) of the distribution. Must be non-negative.

size: int or tuple of ints, optional
Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. If size is None (default), a single value is returned if loc and scale are both scalars. Otherwise, np.broadcast(loc, scale).size samples are drawn.

In [104]:
import numpy as np

In [105]:
n=10 ## number of samples
p=5 ## dimension
Z=np.random.normal(0,1,size=(n,p))
print(Z)

[[-2.68680391  0.55787157  0.77617621  0.20264991 -0.50735635]
 [ 1.05898217  0.46323535  0.49699852 -1.25014539 -1.51195802]
 [ 0.89457475  0.24281041  1.00678612 -0.04124402  0.34396583]
 [ 0.02548591 -1.07844433  0.81565795 -0.13376737  0.61319221]
 [ 0.32808139  1.7748439   1.15295013  1.02961911 -0.50175762]
 [-1.09042286 -1.10397942 -0.64776684 -0.82386881 -0.97420544]
 [ 0.42570077  1.59288883 -1.2010321  -1.76059321 -0.8979381 ]
 [-0.21097685  1.78542198  0.31632373  0.42776574  1.32762063]
 [ 0.42865209 -0.80177471 -1.40404436  0.4366976   0.86278299]
 [-0.48927924  1.50419932 -1.05216008 -1.47562916  0.92864955]]


In the code above, we generate a $n\times p$ matrix $Z$ whose entries are i.i.d. samples from standard Gaussian distribution. If we want to generate $n$ samples from the multivariate normal distribution with mean zero and some specified covariance matrix $\Sigma$, then we can multiply $Z$ by the square root of the covariance matrix, i.e.
$$X=Z\Sigma^{\frac{1}{2}}\in\mathbb{R}^{n\times p}$$
By properties of multivariate normal distribution, then each row of $X$ is a sample from multivariate normal distribution with covariance matrix $\Sigma$.

In [106]:
Sigmasqrt = 5*np.identity(p) ## This can be other symmetric positive definite matrices
print(Sigmasqrt)
X = np.matmul(Z,Sigmasqrt)
print(X)
print(X.shape)

[[5. 0. 0. 0. 0.]
 [0. 5. 0. 0. 0.]
 [0. 0. 5. 0. 0.]
 [0. 0. 0. 5. 0.]
 [0. 0. 0. 0. 5.]]
[[-13.43401957   2.78935784   3.88088105   1.01324957  -2.53678177]
 [  5.29491087   2.31617676   2.4849926   -6.25072696  -7.55979012]
 [  4.47287377   1.21405207   5.03393061  -0.20622012   1.71982915]
 [  0.12742956  -5.39222167   4.07828977  -0.66883685   3.06596105]
 [  1.64040697   8.87421949   5.76475063   5.14809554  -2.5087881 ]
 [ -5.45211431  -5.51989708  -3.23883419  -4.11934406  -4.87102721]
 [  2.12850386   7.96444413  -6.00516048  -8.80296605  -4.4896905 ]
 [ -1.05488426   8.92710991   1.58161866   2.13882868   6.63810317]
 [  2.14326046  -4.00887353  -7.02022181   2.18348798   4.31391497]
 [ -2.44639619   7.5209966   -5.26080041  -7.37814582   4.64324774]]
(10, 5)


Function `np.random.multivariate_normal()` will do the same thing above. This function will return one sample (samples) from multivariate normal distribution with specified mean vector and covariance matrix.

In [107]:
Omega = np.identity(p)*16
for i in range(p-1):
  Omega[i,i+1] = 5
  Omega[i+1,i] = 5
print(Omega)
beta = np.matrix(np.random.multivariate_normal(mean=np.zeros(p),cov=Omega)).T
print(beta)

[[16.  5.  0.  0.  0.]
 [ 5. 16.  5.  0.  0.]
 [ 0.  5. 16.  5.  0.]
 [ 0.  0.  5. 16.  5.]
 [ 0.  0.  0.  5. 16.]]
[[-1.14230096]
 [ 7.97492267]
 [ 2.12671097]
 [ 0.17632685]
 [ 2.75735767]]
