<a href="https://colab.research.google.com/github/kerryback/2022-BUSI520/blob/main/Intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Code cells and markdown cells in Jupyter notebooks

* Execute either with run icon or SHFT-ENTER
* Can use latex in markdown cells
* Can make bulleted lists with *, large font with # or ## or ###, much more
* Double click a markdown cell to put it in edit mode and SHFT-ENTER to put it in viewing mode
* In a code cell, everything on a line after a # is ignored by the python interpreter
* Use # to add comments to a file for yourself or others to read
* Can also have multi-line comments using triple-quoted strings (later)
* If the last expression in a code cell is not assigned to a variable, its value will be printed below the cell.

### Jupyter notebooks vs .py scripts

You can run python from a Jupyter notebook or by executing a script.  A script is just a text file with a sequence of python commands (a program).  The conventional file extension for a python script is .py.  

Notebooks are interactive and are better for exploring data and developing code.  Notebooks can be confusing, because
you can skip around and execute things in a different order than they appear in the file.  When your code is complete (until you have to revise your paper), it could be better to put in script form so it is clearer when you come back to it.  If I have something that takes hours or days to execute, I usually write it as a script and run it from a terminal window on the JGSB server. 

### IDEs

In addition to notebook vs .py, you have a choice of IDE (Integrated Development Environment).  It is possible to write python code in a text editor like Notebook and run scripts in a terminal window, but there are better choices.  Possibilities include

* Jupyter Notebook or JupyterLab installed by Anaconda
* JupyterLab Desktop
* PyCharm
* VS Code (Visual Studio Code)
* Spyder
* JupyterHub on the JGSB server
* Cloud servers including Colab and Paperspace Gradient

PyCharm and VS Code can run either notebooks or scripts.  They have the best code completion and syntax highlighting.  The extra versatility means the learning curve for them is a little steeper.

If you ever want to do deep learning (neural networks) you will probably want to use a cloud server that provides GPUs.

### Modules

Python consists of an interpreter and modules (or libraries or packages).  If you install from python.org, you get the interpreter and core modules.  If you install from Anaconda, you get the interpreter, core modules, and standard scientific modules.

If you install from Anaconda, you by default work in a "conda environment."  It is best to use the Anaconda Navigator or "conda install" in a terminal window (in the conda environment) to install new modules when possible.  Not all modules have been catalogued by conda.  For those, you need to use "pip install" in a terminal window (in the conda environment).

We can issue operating system commands from inside a Jupyter notebook by prefacing them with !.  So, we can bypass using the terminal.  The following will install pandas-datareader.  It is already installed on Colab, and you can use conda install instead with Anaconda. 

New modules are created daily and published to PyPi for downloading with pip.  It is actually quite easy to create and publish your own modules.  There is no quality control, but there are millions of users of standard packages, and bugs are quickly reported.  Use Google and especially StackOverflow.

In [5]:
!pip install pandas-datareader

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Start by importing modules or things from modules

For clarity, it is a good idea to import everything you need at the top of a notebook or script.  

* "import ..." or "import ... as" provides access to everything in the module by prefixing with the module name or alias (e.g., np.whatever)
* "from ... import ..." or "from ... import ... as ..." provides direct access to whatever is imported using its name without a prefix
* The reason for the prefix style is to avoid having duplicate names in the workspace (only one would work).

In [6]:
import numpy as np
import pandas as pd
from pandas_datareader import DataReader as pdr
from seaborn import load_dataset
from scipy.stats import norm

ModuleNotFoundError: ignored

### Assignment statements

* a=b means "evaluate b and assign the result to a variable named a"
* spaces around = are optional, and are optional in most other places as well (with one important exception we will discuss)
* can use any name not reserved by python for a variable, must start with letter or underscore, cannot have spaces or some other characters
* can have multiple assignments on a single line


In [None]:
x = 3
x = x + 1
x

In [None]:
x, y, z = 3, 4, 5
z

### Basic object types

Uncomment (delete the # and following space) each of the following one at a time and execute the cell.  You should do this in cells throughout the notebook.

In [None]:
# type(3)
# type(2.718)
# type('some text')
# type(['a', 'b', 'c'])
# type(('a', 'b', 'c'))
# type({'a': 1, 'b': 2})
# type(np.sqrt)
# type(True)
# type(False)
# type(None)

### Some basic functions

In [None]:
# round(3.14, 1)
# int(3.14)
# int('3')
# str(3)

### Help

In [9]:
# help(round)
# help(np.sqrt)
# help(np)

Help on built-in function round in module builtins:

round(number, ndigits=None)
    Round a number to a given precision in decimal digits.
    
    The return value is an integer if ndigits is omitted or None.  Otherwise
    the return value has the same type as the number.  ndigits may be negative.



### Print statements


In [None]:
x = 3.14
y = 2.718
print(x)
print(x, y)
print("x is", x, "and y is", y)


### Working with lists

* Counting starts from zero.  
* Ranges int1:int2 start with int1 and go to *but not including* int2.
* If int1 is a colon, then start with 0.  If int2 is a colon then go through the last element (including the last element).
* Consequently, the subset has int2 - int1 elements
* Last item can be accessed as -1, next-to-last as -2, etc.
* In range int1:int2 if int2 is -1 then go to *but do not include* the last element, analogously for -2, etc.
* Can use range int1:int2:int3 to go from int1 to int2 stepping by int3
* In range int1:int2:int3, int3 can be negative and int1>int2
* Concatenate lists with +



In [None]:
x = ['a', 'b', 'c', 'd']
y = ['m', 'n', 'o' 'p', 'q']

# len(x)
# x[0]
# x[1]
# x[:2]
# x[0:2]
# x[1:4]
# x[1:4:2]
# x[4:1:-1]
# x[-1]
# x[:-1]
# x[-2]
# x[-3:]
# x[-3:-1]
# x[1]='w'
# x[1:3] = ['y','z']
# x.append('u')
# indx = x.index('a')
# x.remove('a')
# x.insert(indx, 's')
# x.reverse()
# x + y
# 3 * x
# 7 * [0]

### Working with strings

* Can use either single or double quotes
* Concatenate with +
* Strings are basically lists of characters, and many list operations work on strings

In [None]:
string1 = 'This is some text'
string2 = "This is some different text"

# string1[:4]
# string1[-4:]
# string1 + ". " + string2 + "."
# string1.split(" ")
# string1.split(" ")[0] + " " + string1.split(" ")[3]
# string1.title()
# string1.upper()



### Logical conditions

A single = is an assignment, so to test for equality we use ==

In [None]:
# 3 == 3
# 3 == 5
# 3 != 5
# not 3 == 5
# 3 > 5
# 2 < 4 and 3 > 5
# 2 < 4 or 3 > 5
# 1 in [3, 4]
# 1 not in [3, 4]
# 1 * (3>5)
# 1 * (2<4)


### Ternary operator




In [None]:
# "yes" if 3>5 else "no"
# 10 if 3>5 else 100
# 10 if 2 < 4 or 3 > 5 else 100
# 10 if 0 else 100
# 10 if None else 100
# 10 if 16 else 100

### List enumeration

In [None]:
letters = ['a', 'b', 'c', 'd']
# [x + '1' for x in letters]
# [x + '1' for x in letters if not x=='c']


### Range objects

In [None]:
# [i for i in range(6)]
# [i for i in range(1,7)]
# [i for i in range(2,8,2)]
# [i for i in range(6,2,-1)]

### Assign by reference or by value

An assignment a=b can be by reference or by value.  In python, integers and floats are assigned by value.  Lists and other types of arrays are assigned by reference.  Assignment by reference means that the new variable a is associated with the memory location holding the old variable b.  Since, they are both associated with the same memory location, changes to either variable will affect the other.  Assignment by value means that the value of b is copied to a new memory location, which is then associated with a.  Since they are associated to different memory locations, changes to one will not affect the other.  

The takeaway is that if you do a=b for a list or array and then change one, the other will also be changed.  In this situation, you probably want to make a copy of b, so you can change a without changing b.  Lists have a copy method that creates a copy at a new memory location.

In [None]:
x = 3
y = x
x = x + 1
y

In [None]:
x = ['a', 'b', 'c']
y = x
x.append('d')
y

In [None]:
x = ['a', 'b', 'c']
y = x.copy()
x.append('d')
y

### Numpy

Numpy is a wrapper to code written in C or a variant that performs fast mathematical operations.  Numpy functions operate element-wise on lists and return numpy arrays.

In [None]:
x = [1, 2, 3]
# np.sqrt(x)
# np.log(x)
# np.exp(x)

### Numpy arrays

* For lists, + concatenates and * duplicates.  For numpy arrays, + adds and * multiplies element-wise.
* Exponentiation in python is **.  It operates on numpy arrays element-wise.

In [None]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
# x + y
# x * y
# x**2
# 3**x

### Higher-dimensional arrays



In [None]:
x = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
x

The following illustrates a formatting convention that can make code more readable.  It is useful in many places and is not actually related to numpy arrays.  Arbitrary indentation is allowed inside parentheses or braces.

In [None]:
x = np.array(
    [ 
        [1, 2, 3, 4],
        [5, 6, 7, 8],
        [9, 10, 11, 12]
    ]
)
x

### Shape and reshape

In [None]:
x.shape

In [None]:
x = np.array([i for i in range(1,13)]).reshape(3,4)
x

### Some standard arrays

In [None]:
# np.ones(4)
# np.zeros(3)
# np.identity(3)

### Matrix algebra

In [None]:
x = np.array([i for i in range(1,13)]).reshape(3, 4)
y = np.array([1, 2, 5, 10])
x @ y

In [None]:
x.T

In [None]:
x = norm.rvs(loc=0, scale=1, size=9).reshape(3, 3)
np.linalg.inv(x)

In [None]:
np.linalg.inv(x) @ x

### Concatenating numpy arrays

In [None]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6, 7])
np.concatenate((x, y))

### Defining functions

Why functions?  Modularized code is easier to test, maintain, and reuse.

* The def keyword starts the function definition.
* Arguments are enclosed in parentheses and followed by a colon.
* The return keyword indicates the value returned by the function.
* Functions can return numbers, lists, strings, ...
* Indentation is crucial.  All lines within the function definition must be indented the same number of spaces, unless there is a reason the line must be indented further (more later) or it is within parentheses or braces.  A good IDE will prompt you to indent and tell you when you have indentation wrong.


In [None]:
def double(x):
    return 2*x

double(3)

### Passing arguments by name

In [None]:
def exponentiate(base, exponent):
    return base ** exponent

# exponentiate(2, 5)
# exponentiate(base=2, exponent=5)
# exponentiate(exponent=5, base=2)
# exponentiate(5, 2)

### Returning tuples


In [None]:
def f(x):
    return 2*x, 3*x 

a, b = f(2)
b

### Local and global variables

* If a variable is created inside a function definition, it will not exist outside the function (it is local to the function).
* If a variable has been assigned a value outside a function definition and the variable name is used again inside the definition, the value outside the function will not change (there will be two variables with the same name but different values in *different namespaces*).
* If a variable has been assigned a value outside a function definition, then the variable with the assigned value can be used in a function definition.  But changes to the variable outside the function *will change the operation of the function.*  This is dangerous to do.
* An exception to the second bullet point is if the global keyword is used inside the function definition.  This removes the local property of the variable inside the function and causes changes within the function to affect the value outside.  This is sometimes what you want to do, but you should use special names for such variables so that it is clear (to you) that this is being done.  A convention is to use all caps for global variables.

In [16]:
def addone(x):
  aNewVariableName = 1
  return x + aNewVariableName

print(addone(2))
print(aNewVariableName)

NameError: ignored

In [17]:
anOldVariableName = 5
def addone(x):
  anOldVariableName = 1
  return x + anOldVariableName
  
print(addone(2))
print(anOldVariableName)

3
5


In [13]:
anOldVariableName = 1
def addone(x):
  return x + anOldVariableName
  
print(addone(2))
print(anOldVariableName)

anOldVariableName = 5
print(addone(2))

5

In [18]:
anOldVariableName = 5
def addone(x):
  global anOldVariableName
  anOldVariableName = 1
  return x + anOldVariableName

print(addone(2))
print(anOldVariableName)

3
1


### Defining classes

Why classes?  To store data so we don't have to input it repeatedly into functions.

* Main takeaways: objects have attributes and methods.
* Attributes are data (in a general sense - not necessarily numbers) that are stored in the object.
* Methods are functions that operate on the data and possibly on other arguments.
* A class definition is initiated with the class keyword.
* The __init__ method is how an instance of the object is created.  It usually defines the attributes.
* Note that the lines following the class keyword must be indented, and method definitions must be further indented.
* In general, indentation is sequential.  Each class / function / for or while block / if-else block must be further indented.


In [None]:
class multiplier():
    def __init__(self,x) :
        self.factor = x
    def multiply(self, y) :
        return y * self.factor

x = multiplier(3)
# x.multiply(4)
# x.factor

### Sets

Sets are unordered collections with no repeated items.

In [None]:
x = set([1, 1, 2, 3])
x

In [None]:
x.issubset([1, 2, 2, 3, 4, 4])

### Dictionaries

* Dictionaries are unordered collections of key/value pairs
* Sometimes called look-up tables or hash tables 
* Compared to normal dictionaries, key $\sim$ word and value $\sim$ definition.
* Created with dict function or by enclosing key/value pairs in {}
* Keys and values can be any types of objects

In [None]:
x = {'a': 1, 'b': 2}
x['a']

In [None]:
x = dict(a=1, b=2)
# x['a']
# list(x.keys())
# list(x.values())

### Pandas

* pandas was created and is maintained by Wes McKinney, formerly of AQR.
* The load_dataset function returns a pandas dataframe.
* .loc extracts rows using row labels (index).  In this case, the labels are just 0, 1, 2, ...
* .iloc extracts rows using the index location.  It works like extracting items from a list, starting at 0.
* Because the row labels here are just 0, 1, 2, ... .loc and .iloc work almost the same.  The only difference is that .iloc works like extracting items from a list, so it goes up to but not including the last number.  .loc uses the row labels including the last label.
* pandas has a copy method like lists so you can create a copy of a dataframe and change the copy without affecting the original.

In [None]:
tips = load_dataset("tips")
# tips.info()
# tips.describe()
# tips.dtypes
# tips.head()
# tips.tail()
# tips.columns
# tips.index
# tips.day.unique()
# tips.loc[0]
# tips.loc[3:6]
# tips.loc[-1]
# tips.loc[-4:]
# tips.iloc[0]
# tips.iloc[3:6]
# tips.iloc[3:10:2]
# tips.iloc[-1]
# tips.iloc[-4:]
# tips['tip']
# tips.tip
# tips.loc[3, 'tip']
# tips[['total_bill', 'tip]]
# tips[['total_bill', 'tip']].loc[3:7]
# tips.loc[3:7][['total_bill', 'tip']]
# tips.to_dict()
# tips[['total_bill', 'tip']].to_numpy()

In [None]:
tips2 = tips.copy()
tips2.columns = ['new_' + c for c in tips2.columns]
tips2.head(3)

In [None]:
tips2 = tips2.rename(columns={'new_smoker': 'new_new_smoker'})
tips2.head(3)

### Sorting

In [None]:
tips = tips.sort_values(by=['sex', 'total_bill'])
tips.head()

### Inserting columns

We can add new rows and columns.  More often, we want to add new columns.  Operations on columns are element-wise as with numpy.

In [None]:
tips['pct'] = tips.tip / tips.total_bill
tips['day_type'] = tips.day.map(lambda x: 'Weekend' if x in ['Sat', 'Sun'] else 'Weekday')
tips.head(3)

### Filtering

In [None]:
# tips[tips.sex=="Male"].head()
# tips[(tips.sex=="Male") & (tips.pct>0.2)].head()
# tips[(tips.sex=="Male") | (tips.pct>0.2)].head()


### Aggregating

In [None]:
tips2 = tips[['total_bill', 'tip']]

# tips2.sum()
# tips2.mean()
# tips2.std()
# tips2.corr()
# tips2.cov()
# tips2.median()
# tips2.quantile([0.25, 0.5, 0.75])
# tips2.sum(axis=1)
# tips2.apply(lambda x: (x**2).sum())


### Aggregating by groups

In empirical asset pricing, we are constantly either grouping by stock and doing something to each time series of stock data, or we are grouping by date and doing something to each cross-section.

In [None]:
# tips.groupby('sex').total_bill.mean()
# tips.groupby(['sex', 'time']).total_bill.mean()

### Index and reset index

In [None]:
avgs = tips.groupby(['sex', 'time']).total_bill.mean()
# avgs
# avgs.loc[('Male', 'Dinner')]
# avgs.index
# avgs.reset_index()
# avgs.reset_index().set_index(['sex', 'time])

### Wide versus long form

In [None]:
# avgs.unstack()
# avgs.unstack().stack()

### Dataframes versus series

* Pandas provides two classes: DataFrame and Series.  
* A series is like a one column dataframe but not quite.  There are occasionally dataframe methods that are not available for a series.  
* We can convert a series to a dataframe with pd.DataFrame.

In [None]:
# type(tips)
# type(tips.tip)
# type(avgs)
# type(avgs.unstack())
# type(pd.DataFrame(tips.tip))

### Pandas data reader

In [None]:
treasury10 = pdr('DGS10', 'fred', start=1980)
treasury30 = pdr('DGS30', 'fred', start=1980)
both = pdr(['DGS10', 'DGS30'], 'fred', start=1980)

# treasury10.head()
# treasury30.head()
# both.head()
# treasury10.index
# type(treasury10)
# np.log(1+both/100)

### Merging dataframes

* Merge, join, and concatenate are different methods for combining dataframes.  
* Merge provides the finest control.

In [None]:
both1 = treasury10.merge(treasury30, left_index=True, right_index=True)
both2 = treasury10.merge(treasury30, on='DATE')
both3 = treasury10.join(treasury30)
both4 = pd.concat((treasury10, treasury30), axis=1)
[both.equals(b) for b in [both1, both2, both3, both4]]


### Missing data

* Missing values are recorded as NaN (not a number).  
* We can drop them or fill them.  
* We can fill with a specific value or fill from the previous entry or the next entry.

In [None]:
# both.isna()
# both.dropna()
# both.dropna(subset=['DGS10'])
# both.fillna(0)
# both.bfill()
# both.ffill()

### Working with time series



In [None]:
# both.shift().head()
# both.shift(2).head()
# both.shift(-1).head()
# both.diff().head()
# both.diff(2).head()
# both.pct_change()
# both.rolling(5).mean()
# both.rolling(5).std()
# both.resample('M').last()

### Working with dates

* The following use the pandas period M and period Y formats.  
* The datetime module has more functions for working with dates (and times of course).
* Google strftime for more ways to format dates as strings.
* More often, we need to convert dates in various formats to the standard datetime format using strptime.
* To convert an index to period format, use .to_period(), but to convert a date column of a dataframe to period format, use .dt.to_period().

In [None]:
monthly = both.resample('M').last()
# monthly.index
# monthly.index.to_period('M')
# monthly.index.to_period('M').astype(str)
# monthly.index = monthly.index.to_period('M')
# monthly.index

In [None]:
annual = both.resample('Y').last()
annual = annual.reset_index()
# annual.head()
# annual.DATE = annual.DATE.dt.to_period('Y')
# annual.dtypes
# annual.DATE = annual.DATE.astype(str).astype(int)
# annual.dtypes

In [None]:
both['Year'] = both.index.map(lambda x: x.year)
both['Month'] = both.index.map(lambda x: x.month)
both['date'] = both.index.strftime("%b %d, %Y")
both.head()

### Saving and reading dataframes

* To save or read in Colab, you need to "mount" your Google Drive.  Click on the file icon in the left toolbar and then click the Google drive icon.
* pandas has pd.read_csv, pd.read_stata, pd.read_sas, and pd.read_excel functions.

In [None]:
both.to_csv('filename.csv')
newboth = pd.read_csv('filename.csv')

### Loops

* A loop is a block of code that is executed repeatedly, for a given number of times (for loop) or until some condition is met (while loop).
* It is not common to need a while loop.
* On the other hand, zip and enumerate are often useful (especially zip) in for loops.
* Indentation is again crucial.  

In [None]:
for i in range(5):
    print(i)

In [None]:
for ltr in ['a', 'b', 'c']:
    print(ltr)

In [None]:
both5 = None
for data in [treasury10, treasury30]:
    both5 = pd.concat((both5, data), axis=1)
both5.head()

In [None]:
lst1 = ['a', 'b', 'c']
lst2 = ['1', '2', '3']
for a, b in zip(lst1, lst2):
    print(a+b)

In [None]:
for i, ltr in enumerate(lst1):
    print(ltr+lst2[i])

In [None]:
i = 0
while i<3:
    print(i)
    i = i + 1

### Conditional execution

* An indented block following an if statement is executed only if the condition evaluates to True.
* Often but not always there is an else with another indented block following the if block.
* There can also be one or more elif (else if) blocks based on additional conditions.

In [None]:
def f(number):
    if number < 10:
        return 'small' 
    elif number < 100:
        return 'medium' 
    else:
        return 'large' 

f(20)

In [None]:
def g(number):
    return 'small' if number<10 else ('medium' if number<100 else 'large')

g(20)