# Module 0: Python Introduction

**This file is a very incomplete refresher on some Python Basics.** There are many resources online to learn Python. Again, this is not a coding class whose primary goal is to teach you Python, but a class that uses Python as a tool. 


## 1. Syntax Basics
The syntax in Python is direct and easy to understand. Below are some examples of the basic data structure in Python. You can try these out in a blank jupyter notebook:

In [None]:
print("Hello World!")  # print out function

Hello World!


In [None]:
1 + 2 + 3  # add numbers

6

In [None]:
for i in range(10):  # doing loop
    print(i)

0
1
2
3
4
5
6
7
8
9


In [None]:
x = "Hello" + " " + "World" + "!"  # define and concatenate strings
x

'Hello World!'

In [None]:
x.split("o")  # cut the string by "o"

['Hell', ' W', 'rld!']

In [None]:
" real estate  ".strip()  # remove space at the start (head) and end (tail) of the string

'real estate'

## 2. The Anonymous Function lambda

The `lambda` function is popular in Python because of its simplicity. When you need a place where the input is a function but you don't need a name to call that function, `lambda` is all you need. 

These two statements are identical:

```python
# method one: define a function explicitly with a name
def add_two_numbers(x, y):
    return x + y

# method two: define a function using lambda
add_two_numbers = lambda x, y: x + y
```

Here is another example:

```python
# method one: define a function which maps x to x + 1
def add_one(x):
    return add_two_numbers(x, 1)

# method two: lambda
add_one = lambda x: add_two_numbers(x, 1)
```

With `lambda`, you can easily adjust numbers of inputs by fixing some of them. This may also remind you of the partial derivatives:

```python
original_function = lambda x, y: np.power(x, 3) + x * np.power(y, 2) + x / y - np.sqrt(x)

def derivative(func, x0, delta=0.01):
    '''calculate the derivative of func at x0 numerically with delta'''
    return (func(x0 + delta) - func(x0 - delta)) / delta / 2
```

If you want to calculate the partial derivative of `original_function` w.r.t. `y` when `x` is fixed, you could use `lambda` to help:
```python
ys = np.arange(5, 6, 0.01)  # generate a pack of numbers from 5 to 6 with step size 0.01
x_fixed = 10

# use lambda as the first argument (func) for function `derivative`
derivatives = [derivative(lambda y: original_function(x_fixed, y), y) for y in ys]
```

## 3. Scientific Computing With NumPy

To conduct scientific computation, a powerful package, [NumPy](https://numpy.org/), is available. Here are some examples to help you understand the basic usage of NumPy.

In [None]:
# import package
import numpy as np

# create a one-dim vector of zeros
x = np.zeros(5)

# add one to each element of the vector
x += 1
# display result, \n means starting a new line
print("\nThis is x: \n",x)

# generate a (4, 5) matrix of i.i.d. normal random variables with mean loc and standard deviation scale
y = np.random.normal(loc=4.2, scale=0.7, size=(4, 5))
print("\nThis is y: \n",y)

# make x from (5,) to (5, 1) and do matrix multiplication y*x
z = y.dot(x.reshape(-1, 1))
print("\nThis is z: \n",z)

# transpose y and compute y*y'
a = y.dot(y.T)
print("\nThis is a: \n",a)

# compute the rank of matrix a
print("\nThe rank of matrix a is: ", np.linalg.matrix_rank(a))

# compute the inverse of matrix a
print("\nHere's the inverse of matrix a: \n", np.linalg.inv(a))  


This is x: 
 [1. 1. 1. 1. 1.]

This is y: 
 [[4.63657071 4.1571001  4.45117085 4.5471641  3.43995896]
 [3.8932531  4.20553194 3.55473942 3.49193479 5.43409597]
 [4.02373813 4.35155406 3.70249128 5.53659936 2.73536869]
 [4.73048213 4.31688768 4.71588022 3.56954387 3.54339158]]

This is z: 
 [[21.23196472]
 [20.57955522]
 [20.34975152]
 [20.87618547]]

This is a: 
 [[91.1022101  85.92838063 87.81199529 89.29056097]
 [85.92838063 87.20309849 81.32512212 85.05524283]
 [87.81199529 81.32512212 86.97110729 84.73551347]
 [89.29056097 85.05524283 84.73551347 88.54977394]]

The rank of matrix a is:  4

Here's the inverse of matrix a: 
 [[ 9.45193381 -0.32336456 -3.80591124 -5.5784307 ]
 [-0.32336456  0.19275775  0.13225526  0.01436061]
 [-3.80591124  0.13225526  1.70241704  2.08162886]
 [-5.5784307   0.01436061  2.08162886  3.63063458]]


## 4. Handling Data With Pandas

[Pandas](https://pandas.pydata.org/) creates tables in Python that are easier to manipulate than numbers. Note that the version of Pandas matters, so please keep your Pandas up-to-date. If you run into errors, they may be caused by depreciated functions/methods.

In [None]:
import pandas as pd

# define a blank dataframe
df = pd.DataFrame()

# define the first column val and populate it with normally distributed random numbers
df["val"] = np.random.normal(loc=4.2, scale=0.7, size=(4,))

# define another column tag and populate it with letters a and b
df["tag"] = ["a", "b", "a", "b"]

# define two more columns that are shifted versions of val by one period (lead and lag)
df["val_lead"] = df["val"].shift(1)
df["val_lag"] = df["val"].shift(-1)

# display the result
df

Unnamed: 0,val,tag,val_lead,val_lag
0,4.05258,a,,5.331946
1,5.331946,b,4.05258,5.016573
2,5.016573,a,5.331946,3.299166
3,3.299166,b,5.016573,


In [None]:
# sort the table by the variable tag using groupby and report the min
df.groupby("tag").min()
# note that we ignore the NaN when computing the minimum 

Unnamed: 0_level_0,val,val_lead,val_lag
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,4.05258,5.331946,3.299166
b,3.299166,4.05258,5.016573


In [None]:
# sort the table by the variable tag using groupby and report the mean
df.groupby("tag").mean()

Unnamed: 0_level_0,val,val_lead,val_lag
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,4.534576,5.331946,4.315556
b,4.315556,4.534576,5.016573


In [None]:
# fill in the NaN using fillna, and replace the original column
df["val_lead"].fillna(2.2, inplace=True)

df

Unnamed: 0,val,tag,val_lead,val_lag
0,4.05258,a,2.2,5.331946
1,5.331946,b,4.05258,5.016573
2,5.016573,a,5.331946,3.299166
3,3.299166,b,5.016573,


In [None]:
# groupby and report the first element
df.groupby("tag").first()

Unnamed: 0_level_0,val,val_lead,val_lag
tag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,4.05258,2.2,5.331946
b,5.331946,4.05258,5.016573


In [None]:
# use ".apply" to exponentiate the val_lead column
# this creates a new column in memory and outputs the result but does not overwrite the original column
df["val_lead"].apply(np.exp)

0      9.025013
1     57.545715
2    206.840196
3    150.893247
Name: val_lead, dtype: float64

In [None]:
# use apply together with lambda
df["val_lead"].apply(lambda x: np.around(x / 6, 4))  # divide by 6 and keep at most 4 decimals

0    0.3667
1    0.6754
2    0.8887
3    0.8361
Name: val_lead, dtype: float64

In [None]:
# merge two dataframes
# first define two dfs
left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)
right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)


In [None]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [None]:
right

Unnamed: 0,key,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3


In [None]:
# merge by "key"
pd.merge(left, right, on="key")

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


## 5. Saving Python Object Using Pickle

[Pickle](https://docs.python.org/3/library/pickle.html) is the most common and convenient way to save any Python object to a file. It retains the data structure between what you saved (dump) and when you load that file again. Even though saving a data frame in excel format makes it easy to visualize and view the data, the next time you open the file, pandas may guess the wrong data type for some column because the data type is not specified in an excel file. With pickle, you avoid this. The file you reopen will be idential to the one you saved and all data types will be preserved. 

In [None]:
import pickle  # import the package

# define a dictionary which you want to save to a file
dic = {"a": [1, 9, 10], "b": "key", "c":{"d": 5, "e": ["real", "estate"]}}  

# check the dictionary (a is a list, b is a string, and c is a dictionary itself with keys d and e)
dic

{'a': [1, 9, 10], 'b': 'key', 'c': {'d': 5, 'e': ['real', 'estate']}}

In [None]:
dic["c"]

{'d': 5, 'e': ['real', 'estate']}

In [None]:
# save the dictionary dic to a pickle file
with open("save_temp.pkl", "wb") as f:  # "wb" means write-in-byte mode
    pickle.dump(dic, f)  # dump to file

# load the dictionary saved in this pickle file
with open("save_temp.pkl", "rb") as f:  # "rb" means read-in-byte mode
    dic_loaded = pickle.load(f)

# check it again
dic_loaded

{'a': [1, 9, 10], 'b': 'key', 'c': {'d': 5, 'e': ['real', 'estate']}}

For `pandas.DataFrame` object, there's a native pandas method to save it to a pickle file and load it afterwards.

In [None]:
# define a DataFrame
df = pd.DataFrame()
df["a"] = [1, 2, 3]
df["b"] = ["real", "estate", "analytics"]

# check the dataframe
df

Unnamed: 0,a,b
0,1,real
1,2,estate
2,3,analytics


In [None]:
# save it to pickle file
df.to_pickle("save_df.pkl")  # no need to specify "wb"

# load it to a python object
df_loaded = pd.read_pickle("save_df.pkl")  # also no need to specify "rb"

# check the loaded dataframe
df_loaded

Unnamed: 0,a,b
0,1,real
1,2,estate
2,3,analytics
