# `DataFrame` Basics

In this chapter we cover the basics of working with `DataFrames` in **pandas**.

## Importing Packages

Let's begin by importing the packages that we will need.

In [1]:
import pandas as pd
import yfinance as yf
pd.set_option('display.max_rows', 10)

## Reading-In Data

Next, let's use **pandas_datareader** to read-in SPY prices from March 2020.  SPY is an ETF that tracks the S&P500 index.

In [2]:
df_spy = yf.download('SPY', start='2020-02-28', end='2020-03-31', auto_adjust=False, rounding=True)
df_spy.head()

[*********************100%***********************]  1 of 1 completed


Price,Adj Close,Close,High,Low,Open,Volume
Ticker,SPY,SPY,SPY,SPY,SPY,SPY
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2020-02-28,273.04,296.26,297.89,285.54,288.7,384975800
2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600
2020-03-03,276.71,300.24,313.84,297.57,309.5,300139100
2020-03-04,288.34,312.86,313.1,303.33,306.12,176613400
2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800


As a bit of clean-up, let's do the following:

- drop the `SPY` level of the column index
- make the `Date` a regular column instead of an index
- make the column names snake-case.

In [3]:
df_spy = df_spy.droplevel(level=1, axis=1)
df_spy = df_spy.rename_axis(None, axis=1)
df_spy.reset_index(drop=False, inplace=True)
df_spy.columns = df_spy.columns.str.lower().str.replace(' ', '_')
df_spy.head()

Unnamed: 0,date,adj_close,close,high,low,open,volume
0,2020-02-28,273.04,296.26,297.89,285.54,288.7,384975800
1,2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600
2,2020-03-03,276.71,300.24,313.84,297.57,309.5,300139100
3,2020-03-04,288.34,312.86,313.1,303.33,306.12,176613400
4,2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800


## Exploring a `DataFrame`

We can explore our `df_spy` `DataFrame` in a variety of ways.

First, we can first use the `type()` method to make sure what we have created is in fact a `DataFrame`.

In [4]:
type(df_spy)

pandas.core.frame.DataFrame

Next, we can use the `.dtypes` attribute of the `DataFrame` to see the data types of each of the columns.

In [5]:
df_spy.dtypes

date         datetime64[ns]
adj_close           float64
close               float64
high                float64
low                 float64
open                float64
volume                int64
dtype: object

We can also check the number of rows and columns by using the `.shape` attribute.

In [6]:
df_spy.shape

(22, 7)

As we can see, our `DataFrame` `df_spy` consists of 22 rows and 7 columns.

 ---

**Code Challenge:** Try the `DataFrame.info()` and `DataFrame.describe()` methods on `df_spy`.

In [7]:
#| code-fold: true
#| code-summary: "Solution"
df_spy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       22 non-null     datetime64[ns]
 1   adj_close  22 non-null     float64       
 2   close      22 non-null     float64       
 3   high       22 non-null     float64       
 4   low        22 non-null     float64       
 5   open       22 non-null     float64       
 6   volume     22 non-null     int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 1.3 KB


In [8]:
#| code-fold: true
#| code-summary: "Solution"
df_spy.describe().round(2)

Unnamed: 0,date,adj_close,close,high,low,open,volume
count,22,22.0,22.0,22.0,22.0,22.0,22.0
mean,2020-03-14 12:00:00,246.07,266.54,272.89,258.81,265.03,278005100.0
min,2020-02-28 00:00:00,206.68,222.95,229.68,218.26,228.19,171369500.0
25%,2020-03-06 18:00:00,226.22,244.06,256.22,237.14,243.12,236296800.0
50%,2020-03-14 12:00:00,242.35,261.42,264.73,250.05,255.85,282883000.0
75%,2020-03-22 06:00:00,271.23,294.3,295.55,282.53,287.68,321873200.0
max,2020-03-30 00:00:00,288.34,312.86,313.84,303.33,309.5,392220700.0
std,,25.07,27.55,25.44,26.98,26.43,61345510.0


## `DataFrame` Columns

In order to isolate a particular column of a `DataFrame` we can use square brackets (`[ ]`).  The following code isolates the `close` price column of `df_spy`.

In [9]:
df_spy['close']

0     296.26
1     309.09
2     300.24
3     312.86
4     302.46
       ...  
17    243.15
18    246.79
19    261.20
20    253.42
21    261.65
Name: close, Length: 22, dtype: float64

 ---

 


**Code Challenge:** Isolate the `date` column of `df_spy`.

In [10]:
#| code-fold: true
#| code-summary: "Solution"
df_spy['date']

0    2020-02-28
1    2020-03-02
2    2020-03-03
3    2020-03-04
4    2020-03-05
        ...    
17   2020-03-24
18   2020-03-25
19   2020-03-26
20   2020-03-27
21   2020-03-30
Name: date, Length: 22, dtype: datetime64[ns]

--- 

As we can see from the following code, each column of a `DataFrame` is actually a different kind of **pandas** structure called a `Series`. 

In [11]:
type(df_spy['close'])

pandas.core.series.Series

Here is a bit of **pandas** inside baseball:

- A `DataFrame` is collection of columns that are glued together.

- Each column is a `Series`.

- A `Series` has two main attributes: 1) `.values`; 2) `.index`.

- The `.values` component of a `Series` is a `numpy.array`.

Let's look at the `.values` attribute of the `close` column of `df_spy`.

In [12]:
df_spy['close'].values

array([296.26, 309.09, 300.24, 312.86, 302.46, 297.46, 274.23, 288.42,
       274.36, 248.11, 269.32, 239.85, 252.8 , 240.  , 240.51, 228.8 ,
       222.95, 243.15, 246.79, 261.2 , 253.42, 261.65])

 ---

**Code Challenge:** Verify that the `values` component of the `close` column of `df_spy` is in fact a a `numpy.array`.

In [13]:
#| code-fold: true
#| code-summary: "Solution"
type(df_spy['close'].values)

numpy.ndarray

---

## Component-wise Column Operations

We can perform component-wise (i.e. vector-like) calculations with `DataFrame` columns.

The following code divides all the `close` prices by 100.

In [14]:
df_spy['close'] / 100

0     2.9626
1     3.0909
2     3.0024
3     3.1286
4     3.0246
       ...  
17    2.4315
18    2.4679
19    2.6120
20    2.5342
21    2.6165
Name: close, Length: 22, dtype: float64

We can also perform component-wise calculations between two colums.

Let's say we want to calculate the *intraday range* of SPY for each of the trade-dates in `df_spy`; this is the difference between the `high` and the `low` of each day.  We can do this easily from the columns of our `DataFrame`.

In [15]:
df_spy['high'] - df_spy['low']

0     12.35
1     14.70
2     16.27
3      9.77
4      8.46
      ...  
17    10.30
18    16.60
19    13.75
20     9.76
21     8.90
Length: 22, dtype: float64

 ---

**Code Challenge:** Calculate the difference between the `close` and `open` columns of `df_spy`.

In [16]:
#| code-fold: true
#| code-summary: "Solution"
df_spy['close'] - df_spy['open']

0      7.56
1     10.88
2     -9.26
3      6.74
4     -2.52
      ...  
17     8.73
18     1.92
19    11.68
20     0.15
21     5.95
Length: 22, dtype: float64

 ---

## Adding Columns via Variable Assignment

Let's say we want to save our intraday ranges back into `df_spy` for further analysis later.  The most straightforward way to do this is using variable assignment as follows.

In [17]:
df_spy['intraday_range'] = df_spy['high'] - df_spy['low']
df_spy.head()

Unnamed: 0,date,adj_close,close,high,low,open,volume,intraday_range
0,2020-02-28,273.04,296.26,297.89,285.54,288.7,384975800,12.35
1,2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600,14.7
2,2020-03-03,276.71,300.24,313.84,297.57,309.5,300139100,16.27
3,2020-03-04,288.34,312.86,313.1,303.33,306.12,176613400,9.77
4,2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800,8.46


---

**Code Challenge:**  Add a new column to `df_spy` called `open_to_close` that consists of the difference between the `close` and `open` of each day.

In [18]:
#| code-fold: true
#| code-summary: "Solution"
df_spy['open_to_close'] = df_spy['close'] - df_spy['open']
df_spy.head()

Unnamed: 0,date,adj_close,close,high,low,open,volume,intraday_range,open_to_close
0,2020-02-28,273.04,296.26,297.89,285.54,288.7,384975800,12.35,7.56
1,2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600,14.7,10.88
2,2020-03-03,276.71,300.24,313.84,297.57,309.5,300139100,16.27,-9.26
3,2020-03-04,288.34,312.86,313.1,303.33,306.12,176613400,9.77,6.74
4,2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800,8.46,-2.52


---

## Adding Columns via `.assign()` 

A powerful but less intuitive way of adding a column to a `DataFrame` uses the `.assign()` function, which makes use of `lambda` functions (i.e. anonymous functions).  

The following code adds another column called `intraday_range_assign`.

In [19]:
df_spy.assign(intraday_range_assign = lambda df: df['high'] - df['low'])

Unnamed: 0,date,adj_close,close,high,low,open,volume,intraday_range,open_to_close,intraday_range_assign
0,2020-02-28,273.04,296.26,297.89,285.54,288.70,384975800,12.35,7.56,12.35
1,2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600,14.70,10.88,14.70
2,2020-03-03,276.71,300.24,313.84,297.57,309.50,300139100,16.27,-9.26,16.27
3,2020-03-04,288.34,312.86,313.10,303.33,306.12,176613400,9.77,6.74,9.77
4,2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800,8.46,-2.52,8.46
...,...,...,...,...,...,...,...,...,...,...
17,2020-03-24,225.41,243.15,244.10,233.80,234.42,235494500,10.30,8.73,10.30
18,2020-03-25,228.78,246.79,256.35,239.75,244.87,299430300,16.60,1.92,16.60
19,2020-03-26,242.14,261.20,262.80,249.05,249.52,257632800,13.75,11.68,13.75
20,2020-03-27,234.93,253.42,260.81,251.05,253.27,224341200,9.76,0.15,9.76


---

**Code Challenge:** Verify that the column `intraday_range_assign` was not actually added to the `df_spy`.

In [20]:
#| code-fold: true
#| code-summary: "Solution"
df_spy.head()

Unnamed: 0,date,adj_close,close,high,low,open,volume,intraday_range,open_to_close
0,2020-02-28,273.04,296.26,297.89,285.54,288.7,384975800,12.35,7.56
1,2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600,14.7,10.88
2,2020-03-03,276.71,300.24,313.84,297.57,309.5,300139100,16.27,-9.26
3,2020-03-04,288.34,312.86,313.1,303.33,306.12,176613400,9.77,6.74
4,2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800,8.46,-2.52


---

In order to add the `intraday_range_assign` column to `df_spy` we will need to reassign to it.

In [21]:
df_spy = df_spy.assign(intraday_range_assign = lambda df: df['high'] - df['low'])
df_spy.head()

Unnamed: 0,date,adj_close,close,high,low,open,volume,intraday_range,open_to_close,intraday_range_assign
0,2020-02-28,273.04,296.26,297.89,285.54,288.7,384975800,12.35,7.56,12.35
1,2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600,14.7,10.88,14.7
2,2020-03-03,276.71,300.24,313.84,297.57,309.5,300139100,16.27,-9.26,16.27
3,2020-03-04,288.34,312.86,313.1,303.33,306.12,176613400,9.77,6.74,9.77
4,2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800,8.46,-2.52,8.46


---

**Code Challenge:** Use `.assign()` to create a new column in `df_spy`, call it `open_to_close_assign`, that contains the difference between the `close` and `open`.

In [22]:
#| code-fold: true
#| code-summary: "Solution"
df_spy = df_spy.assign(open_to_close_assign = lambda df: df['close'] - df['open'])
df_spy.head()

Unnamed: 0,date,adj_close,close,high,low,open,volume,intraday_range,open_to_close,intraday_range_assign,open_to_close_assign
0,2020-02-28,273.04,296.26,297.89,285.54,288.7,384975800,12.35,7.56,12.35,7.56
1,2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600,14.7,10.88,14.7,10.88
2,2020-03-03,276.71,300.24,313.84,297.57,309.5,300139100,16.27,-9.26,16.27,-9.26
3,2020-03-04,288.34,312.86,313.1,303.33,306.12,176613400,9.77,6.74,9.77,6.74
4,2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800,8.46,-2.52,8.46,-2.52


---

## Method Chaining

The value of `.assign()` becomes clear when we start *chaining* methods together.  

In order to see this let's first `drop` the columns that we created.

In [23]:
lst_cols = ['intraday_range', 'open_to_close', 'intraday_range_assign', 'open_to_close_assign']
df_spy.drop(columns=lst_cols, inplace=True)
df_spy.head()

Unnamed: 0,date,adj_close,close,high,low,open,volume
0,2020-02-28,273.04,296.26,297.89,285.54,288.7,384975800
1,2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600
2,2020-03-03,276.71,300.24,313.84,297.57,309.5,300139100
3,2020-03-04,288.34,312.86,313.1,303.33,306.12,176613400
4,2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800


The following code adds the `intraday` and and `open_to_close` columns at the same time. 

In [24]:
df_spy = \
    (
    df_spy
        .assign(intraday_range = lambda df: df['high'] - df['low'])
        .assign(open_to_close = lambda df: df['close'] - df['open'])
    )
df_spy.head()

Unnamed: 0,date,adj_close,close,high,low,open,volume,intraday_range,open_to_close
0,2020-02-28,273.04,296.26,297.89,285.54,288.7,384975800,12.35,7.56
1,2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600,14.7,10.88
2,2020-03-03,276.71,300.24,313.84,297.57,309.5,300139100,16.27,-9.26
3,2020-03-04,288.34,312.86,313.1,303.33,306.12,176613400,9.77,6.74
4,2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800,8.46,-2.52


---

**Code Challenge:** Use `.assign()` to add a two new column to `df_spy`:
    
1. the difference between the `close` and `adj_close`
1. the average of the `low` and `open`

In [25]:
#| code-fold: true
#| code-summary: "Solution"
df_spy = \
    (
    df_spy
        .assign(div = lambda df: df['close'] - df['adj_close'])
        .assign(avg = lambda df: (df['low'] + df['open']) / 2)
    )
df_spy.head()

Unnamed: 0,date,adj_close,close,high,low,open,volume,intraday_range,open_to_close,div,avg
0,2020-02-28,273.04,296.26,297.89,285.54,288.7,384975800,12.35,7.56,23.22,287.12
1,2020-03-02,284.86,309.09,309.16,294.46,298.21,238703600,14.7,10.88,24.23,296.335
2,2020-03-03,276.71,300.24,313.84,297.57,309.5,300139100,16.27,-9.26,23.53,303.535
3,2020-03-04,288.34,312.86,313.1,303.33,306.12,176613400,9.77,6.74,24.52,304.725
4,2020-03-05,278.75,302.46,308.47,300.01,304.98,186366800,8.46,-2.52,23.71,302.495


---

## Aggregating Calulations on `Series`

`Series` have a variety of built-in aggregation functions.

For example, we can use the following code to calculate the total SPY volume during March 2020.

In [26]:
df_spy['volume'].sum()

6116112300

Here are some summary statistics on the `intraday_range` column that we added to our `DataFrame` earlier.

In [27]:
print("Mean:  ", df_spy['intraday_range'].mean()) # average
print("St Dev: ", df_spy['intraday_range'].std()) # standard deviation
print("Min:    " , df_spy['intraday_range'].min()) # minimum
print("Max:   " , df_spy['intraday_range'].max()) # maximum

Mean:   14.077727272727275
St Dev:  4.28352428533215
Min:     8.460000000000036
Max:    22.960000000000008


---

**Code Challenge:** Calculate the average daily `volume` for the trade dates in `df_spy`.

In [28]:
#| code-fold: true
#| code-summary: "Solution"
df_spy['volume'].mean()

278005104.54545456

---

## Related Reading

*Python Data Science Handbook* - Section 3.1 - Introducing Pandas Objects

*Python Data Science Handbook* - Section 2.1 - Understanding Data Types in Python

*Python Data Science Handbook* - Section 2.2 - The Basics of NumPy Arrays

*Python Data Science Handbook* - Section 2.3 - Computation on NumPy Arrays: Universal Functions

*Python Data Science Handbook* - Section 2.4 - Aggregations: Min, Max, and Everything In Between