<a href="https://colab.research.google.com/github/kerryback/2022-BUSI520/blob/main/Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Code cells and markdown cells in Jupyter notebooks

* Execute either with run icon or SHFT-ENTER
* Can use latex in markdown cells
* Can make bulleted lists with *, large font with # or ## or ###, much more
* Double click a markdown cell to put it in edit mode and SHFT-ENTER to put it in viewing mode
* In a code cell, everything on a line after a # is ignored by the python interpreter
* Use # to add comments to a file for yourself or others to read
* Can also have multi-line comments using triple-quoted strings (later)
* If the last expression in a code cell is not assigned to a variable, its value will be printed below the cell.

### Jupyter notebooks vs .py scripts

You can run python from a Jupyter notebook or by executing a script.  A script is just a text file with a sequence of python commands (a program).  The conventional file extension for a python script is .py.  

Notebooks are interactive and are better for exploring data and developing code.  Notebooks can be confusing, because
you can skip around and execute things in a different order than they appear in the file.  When your code is complete (until you have to revise your paper), it could be better to put in script form so it is clearer when you come back to it.  If I have something that takes hours or days to execute, I usually write it as a script and run it from a terminal window on the JGSB server. 

### IDEs

In addition to notebook vs .py, you have a choice of IDE (Integrated Development Environment).  It is possible to write python code in a text editor like Notebook and run scripts in a terminal window, but there are better choices.  Possibilities include

* Jupyter Notebook or JupyterLab installed by Anaconda
* stand-alone JupyterLab
* PyCharm
* VS Code (Visual Studio Code)
* Spyder
* JupyterHub on the JGSB server
* Cloud servers including Colab and Paperspace Gradient

PyCharm and VS Code can run either notebooks or scripts.  They have the best code completion and syntax highlighting.  The extra versatility means the learning curve for them is a little steeper.

If you ever want to do deep learning (neural networks) you will probably want to use a cloud server that provides GPUs.

### Modules

Python consists of an interpreter and modules (or libraries or packages).  If you install from python.org, you get the interpreter and core modules.  If you install from Anaconda, you get the interpreter, core modules, and standard scientific modules.

If you install from Anaconda, you by default work in a "conda environment."  It is best to use the Anaconda Navigator or "conda install" in a terminal window (in the conda environment) to install new modules when possible.  Not all modules have been catalogued by conda.  For those, you need to use "pip install" in a terminal window (in the conda environment).

We can issue operating system commands from inside a Jupyter notebook by prefacing them with !.  So, we can bypass using the terminal.  The following will install pandas-datareader.  It is already installed on Colab, and you can use conda install instead with Anaconda. 

New modules are created daily and published to PyPi for downloading with pip.  It is actually quite easy to create and publish your own modules.  There is no quality control, but there are millions of users of standard packages, and bugs are quickly reported.  Use Google and especially StackOverflow.

In [1]:
!pip install pandas-datareader

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Start by importing modules or things from modules

For clarity, it is a good idea to import everything you need at the top of a notebook or script.  

* "import ..." or "import ... as" provides access to everything in the module by prefixing with the module name or alias (e.g., np.whatever)
* "from ... import ..." or "from ... import ... as ..." provides direct access to whatever is imported using its name without a prefix
* The reason for the prefix style is to avoid having duplicate names in the workspace (only one would work).

In [2]:
import numpy as np
import pandas as pd
from pandas_datareader import DataReader as pdr
from pandas_datareader.famafrench import get_available_datasets as gad 
from seaborn import load_dataset
from math import sqrt, exp, log
from scipy.stats import norm

### Assignment statements

* a=b means "evaluate b and assign the result to a variable named a"
* spaces around = are optional, and are optional in most other places as well (with one important exception we will discuss)
* can use any name not reserved by python for a variable, must start with letter or underscore, cannot have spaces or some other characters
* can have multiple assignments on a single line


In [3]:
x = 3
x = x + 1
x

4

In [4]:
x, y, z = 3, 4, 5
z

5

### Basic object types

In [5]:
# type(3)
# type(2.718)
# type('some text')
# type(['a', 'b', 'c'])
# type(('a', 'b', 'c'))
# type({'a': 1, 'b': 2})
# type(sqrt)
# type(True)
# type(False)
# type(None)

### Some basic functions

In [6]:
# round(3.14, 1)
# int(3.14)
# int('3')
# str(3)

### Print statements


In [7]:
x = 3.14
y = 2.718
print(x)
print(x, y)
print("x is", x, "and y is", y)


3.14
3.14 2.718
x is 3.14 and y is 2.718


### Working with lists

* Counting starts from zero.  
* Ranges int1:int2 start with int1 and go to but not including int2.
* Consequently, the subset has int2 - int1 elements
* Last item can be accessed as -1, next-to-last as -2, etc.
* Can use range int1:int2:int3 to go from int1 to int2 stepping by int3
* In range int1:int2:int3, int3 can be negative and int1>int2
* Concatenate lists with +



In [8]:
x = ['a', 'b', 'c', 'd']
y = ['m', 'n', 'o' 'p', 'q']

# len(x)
# x[0]
# x[1]
# x[:2]
# x[0:2]
# x[1:4]
# x[1:4:2]
# x[4:1:-1]
# x[-1]
# x[:-1]
# x[-2]
# x[-3:]
# x[-3:-1]
# x[1]='w'
# x[1:3] = ['y','z']
# x.append('u')
# indx = x.index('a')
# x.remove('a')
# x.insert(indx, 's')
# x.reverse()
# x + y
# 3 * x
# 7 * [0]

### Working with strings

* Can use either single or double quotes
* Concatenate with +
* Strings are basically lists of characters, and many list operations work on strings

In [9]:
string1 = 'This is some text'
string2 = "This is some different text"

# string1[:4]
# string1[-4:]
# string1 + ". " + string2 + "."
# string1.split(" ")
# string1.split(" ")[0] + " " + string1.split(" ")[3]
# string1.title()
# string1.upper()



### Logical conditions

A single = is an assignment, so to test for equality we use ==

In [10]:
# 3 == 3
# 3 == 5
# 3 != 5
# not(3==5)
# 3 > 5
# (2<4) and (3>5)
# (2<4) & (3>5)
# (2<4) or (3>5)
# (2<4) | (3>5)
# 1 in [3, 4]
# 1 not in [3, 4]
# 1 * (3>5)
# 1 * (2<4)


### Ternary operator




In [11]:
# "yes" if 3>5 else "no"
# 10 if 3>5 else 100
# 10 if (2<4) or (3>5) else 100
# 10 if 0 else 100
# 10 if None else 100
# 10 if 16 else 100

### List enumeration

In [12]:
letters = ['a', 'b', 'c', 'd']
# [x + '1' for x in letters]
# [x + '1' for x in letters if not x=='c']


### Range objects

In [13]:
# [i for i in range(6)]
# [i for i in range(1,7)]
# [i for i in range(2,8,2)]
# [i for i in range(6,2,-1)]

### Assign by reference or by value

"By reference" means the memory location is assigned to a variable.  Multiple variables can be assigned to the same location and changes to any of them will affect all of them.  Integers and floats are assigned by value.  Lists and other types of arrays are assigned by reference.

Lists have a copy method that creates a copy at a new memory location.

In [14]:
x = 3
y = x
x = x + 1
y

3

In [15]:
x = ['a', 'b', 'c']
y = x
x.append('d')
y

['a', 'b', 'c', 'd']

In [16]:
x = ['a', 'b', 'c']
y = x.copy()
x.append('d')
y

['a', 'b', 'c']

### Numpy

Numpy is a wrapper to code written in C or a variant that performs fast mathematical operations.  Numpy functions operate element-wise on lists and return numpy arrays.

In [17]:
x = [1, 2, 3]
# np.sqrt(x)
# np.log(x)
# np.exp(x)

### Numpy arrays

* For lists, + concatenates and * duplicates.  For numpy arrays, + adds and * multiplies element-wise.
* Exponentiation in python is **.  It operates on numpy arrays element-wise.

In [18]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
# x + y
# x * y
# x**2
# 3**x

### Higher-dimensional arrays



In [19]:
x = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
x

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

The following illustrates a formatting convention that can make code more readable.  It is useful in many places and is not actually related to numpy arrays.  Arbitrary indentation is allowed inside parentheses or braces.

In [20]:
x = np.array(
    [ 
        [1, 2, 3, 4],
        [5, 6, 7, 8],
        [9, 10, 11, 12]
    ]
)
x

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

### Shape and reshape

In [21]:
x.shape

(3, 4)

In [22]:
x = np.array([i for i in range(1,13)]).reshape(3,4)
x

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

### Some standard arrays

In [23]:
# np.ones(4)
# np.zeros(3)
# np.identity(3)

### Matrix algebra

In [24]:
x = np.array([i for i in range(1,13)]).reshape(3, 4)
y = np.array([1, 2, 5, 10])
x @ y

array([ 60, 132, 204])

In [25]:
x.T

array([[ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11],
       [ 4,  8, 12]])

In [26]:
x = norm.rvs(loc=0, scale=1, size=9).reshape(3, 3)
np.linalg.inv(x)

array([[-0.06797128, -1.5429288 , -4.08519515],
       [ 1.0147743 , -0.74504471, -1.93364472],
       [ 2.42699669, -1.95814352, -9.41291272]])

In [27]:
np.linalg.inv(x) @ x

array([[ 1.00000000e+00,  7.85511212e-16, -2.60873972e-16],
       [-1.30515574e-16,  1.00000000e+00,  5.63686522e-17],
       [-3.53332462e-17, -1.29789806e-16,  1.00000000e+00]])

### Concatenating numpy arrays

In [28]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6, 7])
np.concatenate((x, y))

array([1, 2, 3, 4, 5, 6, 7])

### Defining functions

Why functions?  Modularized code is easier to test, maintain, and reuse.

* The def keyword starts the function definition.
* Arguments are enclosed in parentheses and followed by a colon.
* The return keyword indicates the value returned by the function.
* Functions can return numbers, lists, strings, ...
* Indentation is crucial.  All lines within the function definition must be indented the same number of spaces, unless there is a reason the line must be indented further (more later) or it is within parentheses or braces.  A good IDE will prompt you to indent and tell you when you have indentation wrong.


In [29]:
def double(x):
    return 2*x

double(3)

6

### Passing arguments by name

In [30]:
def exponentiate(base, exponent):
    return base ** exponent

# exponentiate(2, 5)
# exponentiate(base=2, exponent=5)
# exponentiate(exponent=5, base=2)
# exponentiate(5, 2)

### Returning tuples


In [31]:
def f(x):
    return 2*x, 3*x 

a, b = f(2)
b

6

### Defining classes

Why classes?  To store data so we don't have to input it repeatedly into functions.

* Main takeaways: objects have attributes and methods
* Attributes are data (in a general sense - not necessarily numbers) that are stored in the object
* Methods are functions that operate on the data and possibly on other arguments
* A class definition is initiated with the class keyword
* The __init__ method is how an instance of the object is created.  It usually defines the attributes.
* Note that the lines following the class keyword must be indented, and method definitions must be further indented.
* In general, indentation is sequential.  Each class / function / for or while block / if-else block must be further indented.


In [32]:
class multiplier():
    def __init__(self,x) :
        self.factor = x
    def multiply(self, y) :
        return y * self.factor

x = multiplier(3)
# x.multiply(4)
# x.factor

### Sets

Sets are unordered collections with no repeated items.

In [33]:
x = set([1, 1, 2, 3])
x

{1, 2, 3}

In [34]:
x.issubset([1, 2, 2, 3, 4, 4])

True

### Dictionaries

* Dictionaries are unordered collections of key/value pairs
* Sometimes called look-up tables or hash tables 
* Compared to normal dictionaries, key $\sim$ word and value $\sim$ definition.
* Created with dict function or by enclosing key/value pairs in {}
* Keys and values can be any types of objects

In [35]:
x = {'a': 1, 'b': 2}
x['a']

1

In [36]:
x = dict(a=1, b=2)
# x['a']
# list(x.keys())
# list(x.values())

### Pandas

* Pandas was created and is maintained by Wes McKinney, formerly of AQR.
* .loc extracts rows using row labels (index).  In this case, the labels are just 0, 1, 2, ...
* .iloc extracts rows using the index location.  It works like extracting items from a list, starting at 0.
* Because the row labels here are just 0, 1, 2, ... .loc and .iloc work almost the same.  The only difference is that .iloc works like extracting items from a list, so it goes up to but not including the last number.  .loc uses the row labels including the last label.

In [37]:
tips = load_dataset("tips")
# tips.info()
# tips.describe()
# tips.dtypes
# tips.head()
# tips.tail()
# tips.columns
# tips.index
# tips.day.unique()
# tips.loc[0]
# tips.loc[3:6]
# tips.loc[-1]
# tips.loc[-4:]
# tips.iloc[0]
# tips.iloc[3:6]
# tips.iloc[3:10:2]
# tips.iloc[-1]
# tips.iloc[-4:]
# tips['tip']
# tips.tip
# tips[['total_bill', 'tip]]
# tips[['total_bill', 'tip']].loc[3:7]
# tips.loc[3:7][['total_bill', 'tip']]
# tips.to_dict()
# tips.to_dict('records')
# tips[['total_bill', 'tip']].to_numpy()

In [38]:
tips2 = tips.copy()
tips2.columns = ['new_' + c for c in tips2.columns]
tips2.head(3)

Unnamed: 0,new_total_bill,new_tip,new_sex,new_smoker,new_day,new_time,new_size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


In [39]:
tips2 = tips2.rename(columns={'new_smoker': 'new_new_smoker'})
tips2.head(3)

Unnamed: 0,new_total_bill,new_tip,new_sex,new_new_smoker,new_day,new_time,new_size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


### Sorting

In [40]:
tips = tips.sort_values(by=['sex', 'total_bill'])
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
172,7.25,5.15,Male,Yes,Sun,Dinner,2
149,7.51,2.0,Male,No,Thur,Lunch,2
195,7.56,1.44,Male,No,Thur,Lunch,2
218,7.74,1.44,Male,Yes,Sat,Dinner,2
126,8.52,1.48,Male,No,Thur,Lunch,2


### Inserting columns

We can add new rows and columns.  More often, we want to add new columns.  Operations on columns are element-wise as with numpy.

In [41]:
tips['pct'] = tips.tip / tips.total_bill
tips['day_type'] = tips.day.map(lambda x: 'Weekend' if x in ['Sat', 'Sun'] else 'Weekday')
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,pct,day_type
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345,Weekend
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312,Weekday
195,7.56,1.44,Male,No,Thur,Lunch,2,0.190476,Weekday


### Filtering

In [42]:
# tips[tips.sex=="Male"].head()
# tips[(tips.sex=="Male") & (tips.pct>0.2)].head()
# tips[(tips.sex=="Male") | (tips.pct>0.2)].head()


### Aggregating

There are built-in methods and you can create your own with apply.  Often it is more convenient to define a new function with the "lambda" method instead of "def ..." but "def" is useful for longer functions.

In [43]:
tips2 = tips[['total_bill', 'tip']]

# tips2.sum()
# tips[['total_bill", 'tip']].sum()
# tips.total_bill.sum()
# tips['total_bill'].sum()
# tips2.mean()
# tips2.std()
# tips2.corr()
# tips2.cov()
# tips2.median()
# tips2.quantile([0.25, 0.5, 0.75])
# tips2.sum(axis=1)
tips2.apply(lambda x: (x**2).sum())
# def sumsquares(x):
#     return (x**2).sum()
# tips2.apply(sumsquares)


total_bill    114780.4443
tip             2658.6932
dtype: float64

### Aggregating by groups

In empirical asset pricing, we are constantly either grouping by stock and doing something to each time series of stock data, or we are grouping by date and doing something to each cross-section.  

In [44]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,pct,day_type
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345,Weekend
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312,Weekday
195,7.56,1.44,Male,No,Thur,Lunch,2,0.190476,Weekday
218,7.74,1.44,Male,Yes,Sat,Dinner,2,0.186047,Weekend
126,8.52,1.48,Male,No,Thur,Lunch,2,0.173709,Weekday


In [45]:
# tips.groupby('sex').total_bill.mean()
# tips.groupby(['sex', 'time']).total_bill.mean()
# tips.groupby(['sex']).total_bill.apply(lambda x: (x**2).sum())

### Transform

When we aggregate, we get a lower-dimensional object - for example, just one number for each column.  With transform, we get an object of the same dimension we started with, which is useful when we want to paste it into the original object.  This new object will repeat the aggregate in order to be of the same dimension as the original.  This is useful, for example, if we want to include a group mean as a characteristic in a model.

In [46]:
tips['total_by_sex'] = tips.groupby('sex').total_bill.transform(lambda x: x.mean())
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,pct,day_type,total_by_sex
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345,Weekend,20.744076
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312,Weekday,20.744076
195,7.56,1.44,Male,No,Thur,Lunch,2,0.190476,Weekday,20.744076
218,7.74,1.44,Male,Yes,Sat,Dinner,2,0.186047,Weekend,20.744076
126,8.52,1.48,Male,No,Thur,Lunch,2,0.173709,Weekday,20.744076


### Demeaning by group

If we want to demean by group, we can use apply instead of transform, because demeaning is not an aggregation.

In [47]:
tips['total_dev_from_mean_by_sex'] = tips.groupby('sex').total_bill.apply(lambda x: x - x.mean())
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,pct,day_type,total_by_sex,total_dev_from_mean_by_sex
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345,Weekend,20.744076,-13.494076
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312,Weekday,20.744076,-13.234076
195,7.56,1.44,Male,No,Thur,Lunch,2,0.190476,Weekday,20.744076,-13.184076
218,7.74,1.44,Male,Yes,Sat,Dinner,2,0.186047,Weekend,20.744076,-13.004076
126,8.52,1.48,Male,No,Thur,Lunch,2,0.173709,Weekday,20.744076,-12.224076


### Index and reset index

In [48]:
avgs = tips.groupby(['sex', 'time']).total_bill.mean()
# avgs
# avgs.loc[('Male', 'Dinner')]
# avgs.index
# avgs.reset_index()
# avgs.reset_index().set_index(['sex', 'time])

### Wide versus long form

In [49]:
# avgs.unstack()
# avgs.unstack().stack()

### Dataframes versus series

* Pandas provides two classes: DataFrame and Series.  
* A series is like a one column dataframe but not quite.  There are occasionally dataframe methods that are not available for a series.  
* We can convert a series to a dataframe with pd.DataFrame.

In [50]:
# type(tips)
# type(tips.tip)
# type(avgs)
# type(avgs.unstack())
# type(pd.DataFrame(tips.tip))

### Pandas data reader

In [51]:
treasury10 = pdr('DGS10', 'fred', start=1980)
treasury30 = pdr('DGS30', 'fred', start=1980)
both = pdr(['DGS10', 'DGS30'], 'fred', start=1980)

# treasury10.head()
# treasury30.head()
# both.head()
# treasury10.index
# type(treasury10)
# np.log(1+both/100)

### Merging dataframes

* Merge, join, and concatenate are different methods for combining dataframes.  
* Merge provides the finest control.

In [52]:
both1 = treasury10.merge(treasury30, left_index=True, right_index=True)
both2 = treasury10.merge(treasury30, on='DATE')
both3 = treasury10.join(treasury30)
both4 = pd.concat((treasury10, treasury30), axis=1)
[both.equals(b) for b in [both1, both2, both3, both4]]


[True, True, True, True]

### Missing data

* Missing values are recorded as NaN (not a number).  
* We can drop them or fill them.  
* We can fill with a specific value or fill from the previous entry or the next entry.

In [53]:
# both.isna()
# both.dropna()
# both.dropna(subset=['DGS10'])
# both.fillna(0)
# both.bfill()
# both.ffill()

### Working with time series



In [54]:
# both.shift().head()
# both.shift(2).head()
# both.shift(-1).head()
# both.diff().head()
# both.diff(2).head()
# both.pct_change()
# both.rolling(5).mean()
# both.rolling(5).std()
# both.resample('M').last()
# both.resample('MS').first()
# both.resample('M').mean()

The datetime format is a standard format.  When pdr gets daily data, it returns dates in the datetime format.  The datetime module contains functions for working with datetime objects.  Pandas implements some of them.  strftime will format datetime objects in many different ways.  Its inverse is strptime, which converts dates in different string formats into datetime objects.

In [55]:
# both.index.dtype
# [x.year for x in both.index]
# both.index.map(lambda x: x.month)
# both.index.astype(str)
# both.index.strftime("%b %d, %Y")
# both.resample('M').last().index.to_period('M')

### Upsampling

We can map a time series into a higher frequency index with reindex.  This creates NaN's for the new dates.  We might want to fill those NaN's by inserting the most recent valid value with ffill (forward filling the time series).  Downsampling is done with resample as shown above.

In [56]:
monthly = both.resample('MS').first()
min_date = monthly.index.min()
max_date = monthly.index.max()
new_index = pd.date_range(start=min_date, end=max_date, freq="D")
monthly = monthly.reindex(new_index).ffill()
monthly.head()

Unnamed: 0,DGS10,DGS30
1980-01-01,10.5,10.23
1980-01-02,10.5,10.23
1980-01-03,10.5,10.23
1980-01-04,10.5,10.23
1980-01-05,10.5,10.23


### Saving and reading dataframes

* To save or read in Colab, you need to "mount" your Google Drive.  Click on the file icon in the left toolbar and then click the Google drive icon.
* pandas has read_csv, read_stata, read_sas, and read_excel functions.

In [57]:
both.to_csv('filename.csv')
newboth = pd.read_csv('filename.csv', parse_dates=['DATE'])

### Loops

* A loop is a block of code that is executed repeatedly, for a given number of times (for loop) or until some condition is met (while loop).
* It is not common to need a while loop.
* On the other hand, zip and enumerate are often useful (especially zip) in for loops.
* Indentation is again crucial.  

In [58]:
for i in range(5):
    print(i)

0
1
2
3
4


In [59]:
for ltr in ['a', 'b', 'c']:
    print(ltr)

a
b
c


In [60]:
both5 = None
for data in [treasury10, treasury30]:
    both5 = pd.concat((both5, data), axis=1)
both5.head()

Unnamed: 0_level_0,DGS10,DGS30
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
1980-01-01,,
1980-01-02,10.5,10.23
1980-01-03,10.6,10.31
1980-01-04,10.66,10.34
1980-01-07,10.63,10.35


In [61]:
lst1 = ['a', 'b', 'c']
lst2 = ['1', '2', '3']
for a, b in zip(lst1, lst2):
    print(a+b)

a1
b2
c3


In [62]:
for i, ltr in enumerate(lst1):
    print(ltr+lst2[i])

a1
b2
c3


In [63]:
i = 0
while i<3:
    print(i)
    i = i + 1

0
1
2


### Conditional execution

* An indented block following an if statement is executed only if the condition evaluates to True.
* Often but not always there is an else with another indented block following the if block.
* There can also be one or more elif (else if) blocks based on additional conditions.

In [64]:
def f(number):
    if number < 10:
        return 'small' 
    elif number < 100:
        return 'medium' 
    else:
        return 'large' 

f(20)

'medium'

In [65]:
def g(number):
    return 'small' if number<10 else ('medium' if number<100 else 'large')

g(20)

'medium'

### Dictionaries

* Dictionaries are unordered collections of key/value pairs
* Sometimes called look-up tables or hash tables 
* Compared to normal dictionaries, key $\sim$ word and value $\sim$ definition.
* Created with dict function or by enclosing key/value pairs in {}
* Keys and values can be any types of objects

In [66]:
x = {'a': 1, 'b': 2}
x['a']

1

In [67]:
x = dict(a=1, b=2)
# x['a']
# list(x.keys())
# list(x.values())

### Pandas

* Pandas was created and is maintained by Wes McKinney, formerly of AQR.
* .loc extracts rows using row labels (index).  In this case, the labels are just 0, 1, 2, ...
* .iloc extracts rows using the index location.  It works like extracting items from a list, starting at 0.
* Because the row labels here are just 0, 1, 2, ... .loc and .iloc work almost the same.  The only difference is that .iloc works like extracting items from a list, so it goes up to but not including the last number.  .loc uses the row labels including the last label.

In [68]:
tips = load_dataset("tips")
# tips.info()
# tips.describe()
# tips.dtypes
# tips.head()
# tips.tail()
# tips.columns
# tips.index
# tips.day.unique()
# tips.loc[0]
# tips.loc[3:6]
# tips.loc[-1]
# tips.loc[-4:]
# tips.iloc[0]
# tips.iloc[3:6]
# tips.iloc[3:10:2]
# tips.iloc[-1]
# tips.iloc[-4:]
# tips['tip']
# tips.tip
# tips[['total_bill', 'tip]]
# tips[['total_bill', 'tip']].loc[3:7]
# tips.loc[3:7][['total_bill', 'tip']]
# tips.to_dict()
# tips.to_dict('records')
# tips[['total_bill', 'tip']].to_numpy()

In [69]:
tips2 = tips.copy()
tips2.columns = ['new_' + c for c in tips2.columns]
tips2.head(3)

Unnamed: 0,new_total_bill,new_tip,new_sex,new_smoker,new_day,new_time,new_size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


In [70]:
tips2 = tips2.rename(columns={'new_smoker': 'new_new_smoker'})
tips2.head(3)

Unnamed: 0,new_total_bill,new_tip,new_sex,new_new_smoker,new_day,new_time,new_size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


### Sorting

In [71]:
tips = tips.sort_values(by=['sex', 'total_bill'])
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
172,7.25,5.15,Male,Yes,Sun,Dinner,2
149,7.51,2.0,Male,No,Thur,Lunch,2
195,7.56,1.44,Male,No,Thur,Lunch,2
218,7.74,1.44,Male,Yes,Sat,Dinner,2
126,8.52,1.48,Male,No,Thur,Lunch,2


### Inserting columns

We can add new rows and columns.  More often, we want to add new columns.  Operations on columns are element-wise as with numpy.

In [72]:
tips['pct'] = tips.tip / tips.total_bill
tips['day_type'] = tips.day.map(lambda x: 'Weekend' if x in ['Sat', 'Sun'] else 'Weekday')
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,pct,day_type
172,7.25,5.15,Male,Yes,Sun,Dinner,2,0.710345,Weekend
149,7.51,2.0,Male,No,Thur,Lunch,2,0.266312,Weekday
195,7.56,1.44,Male,No,Thur,Lunch,2,0.190476,Weekday


### Filtering

In [73]:
# tips[tips.sex=="Male"].head()
# tips[(tips.sex=="Male") & (tips.pct>0.2)].head()
# tips[(tips.sex=="Male") | (tips.pct>0.2)].head()


### Aggregating

In [74]:
tips2 = tips[['total_bill', 'tip']]

# tips2.sum()
# tips2.mean()
# tips2.std()
# tips2.corr()
# tips2.cov()
# tips2.median()
# tips2.quantile([0.25, 0.5, 0.75])
# tips2.sum(axis=1)
# tips2.apply(lambda x: (x**2).sum())


### Aggregating by groups

In empirical asset pricing, we are constantly either grouping by stock and doing something to each time series of stock data, or we are grouping by date and doing something to each cross-section.

In [75]:
# tips.groupby('sex').total_bill.mean()
# tips.groupby(['sex', 'time']).total_bill.mean()

### Index and reset index

In [76]:
avgs = tips.groupby(['sex', 'time']).total_bill.mean()
# avgs
# avgs.loc[('Male', 'Dinner')]
# avgs.index
# avgs.reset_index()
# avgs.reset_index().set_index(['sex', 'time])

### Wide versus long form

In [77]:
# avgs.unstack()
# avgs.unstack().stack()

### Dataframes versus series

* Pandas provides two classes: DataFrame and Series.  
* A series is like a one column dataframe but not quite.  There are occasionally dataframe methods that are not available for a series.  
* We can convert a series to a dataframe with pd.DataFrame.

In [78]:
# type(tips)
# type(tips.tip)
# type(avgs)
# type(avgs.unstack())
# type(pd.DataFrame(tips.tip))

### Pandas data reader

In [79]:
treasury10 = pdr('DGS10', 'fred', start=1980)
treasury30 = pdr('DGS30', 'fred', start=1980)
both = pdr(['DGS10', 'DGS30'], 'fred', start=1980)

# treasury10.head()
# treasury30.head()
# both.head()
# treasury10.index
# type(treasury10)
# np.log(1+both/100)

### Merging dataframes

* Merge, join, and concatenate are different methods for combining dataframes.  
* Merge provides the finest control.

In [80]:
both1 = treasury10.merge(treasury30, left_index=True, right_index=True)
both2 = treasury10.merge(treasury30, on='DATE')
both3 = treasury10.join(treasury30)
both4 = pd.concat((treasury10, treasury30), axis=1)
[both.equals(b) for b in [both1, both2, both3, both4]]


[True, True, True, True]

### Missing data

* Missing values are recorded as NaN (not a number).  
* We can drop them or fill them.  
* We can fill with a specific value or fill from the previous entry or the next entry.

In [81]:
# both.isna()
# both.dropna()
# both.dropna(subset=['DGS10'])
# both.fillna(0)
# both.bfill()
# both.ffill()

### Working with time series



In [82]:
# both.shift().head()
# both.shift(2).head()
# both.shift(-1).head()
# both.diff().head()
# both.diff(2).head()
# both.pct_change()
# both.rolling(5).mean()
# both.rolling(5).std()
# both.resample('M').last()

### Working with dates

* The following use the pandas period M and period Y formats.  
* The datetime module has more functions for working with dates (and times of course).
* Google strftime for more ways to format dates as strings.
* More often, we need to convert dates in various formats to a standard format using strptime.

In [83]:
monthly = both.resample('M').last()
# monthly.index
# monthly.index = monthly.index.to_period('M')
# monthly.index
# monthly.index = monthly.index.astype(str)
# monthly.index

In [84]:
annual = both.resample('Y').last()
annual = annual.reset_index()
# annual.head()
# annual.DATE = annual.DATE.dt.to_period('Y')
# annual.dtypes
# annual.DATE = annual.DATE.astype(str).astype(int)
# annual.dtypes

In [85]:
both['Year'] = both.index.map(lambda x: x.year)
both['Month'] = both.index.map(lambda x: x.month)
both['date'] = both.index.strftime("%b %d, %Y")
both.head()

Unnamed: 0_level_0,DGS10,DGS30,Year,Month,date
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1980-01-01,,,1980,1,"Jan 01, 1980"
1980-01-02,10.5,10.23,1980,1,"Jan 02, 1980"
1980-01-03,10.6,10.31,1980,1,"Jan 03, 1980"
1980-01-04,10.66,10.34,1980,1,"Jan 04, 1980"
1980-01-07,10.63,10.35,1980,1,"Jan 07, 1980"


### Saving and reading dataframes

* To save or read in Colab, you need to "mount" your Google Drive.  Click on the file icon in the left toolbar and then click the Google drive icon.
* pandas has pd.read_csv, pd.read_stata, pd.read_sas, and pd.read_excel functions.

In [86]:
# both.to_csv('filename.csv')
# newboth = pd.read_csv('filename.csv')

### Loops

* A loop is a block of code that is executed repeatedly, for a given number of times (for loop) or until some condition is met (while loop).
* It is not common to need a while loop.
* On the other hand, zip and enumerate are often useful (especially zip) in for loops.
* Indentation is again crucial.  

In [87]:
for i in range(5):
    print(i)

0
1
2
3
4


In [88]:
for ltr in ['a', 'b', 'c']:
    print(ltr)

a
b
c


In [89]:
both5 = None
for data in [treasury10, treasury30]:
    both5 = pd.concat((both5, data), axis=1)
both5.head()

Unnamed: 0_level_0,DGS10,DGS30
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
1980-01-01,,
1980-01-02,10.5,10.23
1980-01-03,10.6,10.31
1980-01-04,10.66,10.34
1980-01-07,10.63,10.35


In [90]:
lst1 = ['a', 'b', 'c']
lst2 = ['1', '2', '3']
for a, b in zip(lst1, lst2):
    print(a+b)

a1
b2
c3


In [91]:
for i, ltr in enumerate(lst1):
    print(ltr+lst2[i])

a1
b2
c3


In [92]:
i = 0
while i<3:
    print(i)
    i = i + 1

0
1
2


### Conditional execution

* An indented block following an if statement is executed only if the condition evaluates to True.
* Often but not always there is an else with another indented block following the if block.
* There can also be one or more elif (else if) blocks based on additional conditions.

In [93]:
def f(number):
    if number < 10:
        return 'small' 
    elif number < 100:
        return 'medium' 
    else:
        return 'large' 

f(20)

'medium'

In [94]:
def g(number):
    return 'small' if number<10 else ('medium' if number<100 else 'large')

g(20)

'medium'