# `DataFrame` Basics

In this tutorial we cover the basics of working with `DataFrames` in `pandas`.

### Importing Packages

Let's begin by importing the packages that we will need.

In [1]:
import pandas as pd
import pandas_datareader as pdr
pd.set_option('display.max_rows', 10)

### Reading-In Data

Next, let's use `pandas_datareader` to read-in SPY prices from March 2020.  SPY is an ETF that tracks the S&P500 index.

In [2]:
df_spy = pdr.get_data_yahoo('SPY', start='2020-02-28', end='2020-03-31')
df_spy = df_spy.round(2)
df_spy.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-02-28,297.89,285.54,288.7,296.26,384975800.0,288.93
2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45
2020-03-03,313.84,297.57,309.5,300.24,300139100.0,292.82
2020-03-04,313.1,303.33,306.12,312.86,176613400.0,305.12
2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98


Let's make the `date` a regular column, instead of an index, and also make the column names snake-case.

In [3]:
df_spy.reset_index(drop=False, inplace=True)
df_spy.columns = df_spy.columns.str.lower().str.replace(' ', '_')
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close
0,2020-02-28,297.89,285.54,288.7,296.26,384975800.0,288.93
1,2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45
2,2020-03-03,313.84,297.57,309.5,300.24,300139100.0,292.82
3,2020-03-04,313.1,303.33,306.12,312.86,176613400.0,305.12
4,2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98


### Exploring a `DataFrame`

We can explore our `df_spy` `DataFrame` in a variety of ways.

First, we can first use the `type()` method to make sure what we have created is in fact a `DataFrame`.

In [4]:
type(df_spy)

pandas.core.frame.DataFrame

Next, we can use the `.dtypes` attribute of the `DataFrame` to see the data types of each of the columns.

In [5]:
df_spy.dtypes

date         datetime64[ns]
high                float64
low                 float64
open                float64
close               float64
volume              float64
adj_close           float64
dtype: object

We can also check the number of rows and columns by using the `.shape` attribute.

In [6]:
df_spy.shape

(23, 7)

As we can see, our dataframe `df_spy` consists of 23 row and 7 columns.

**Code Challenge:** Try the `DataFrame.info()` and `DataFrame.describe()` methods on `df_spy`.

In [7]:
df_spy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       23 non-null     datetime64[ns]
 1   high       23 non-null     float64       
 2   low        23 non-null     float64       
 3   open       23 non-null     float64       
 4   close      23 non-null     float64       
 5   volume     23 non-null     float64       
 6   adj_close  23 non-null     float64       
dtypes: datetime64[ns](1), float64(6)
memory usage: 1.4 KB


In [8]:
df_spy.describe().round(2)

Unnamed: 0,high,low,open,close,volume,adj_close
count,23.0,23.0,23.0,23.0,23.0,23.0
mean,272.47,258.7,264.83,266.16,274391000.0,260.07
std,24.93,26.36,25.84,26.98,62390950.0,25.96
min,229.68,218.26,228.19,222.95,171369500.0,218.72
25%,256.26,237.22,243.7,244.97,232080800.0,240.25
50%,263.33,251.05,256.0,261.2,276444100.0,256.24
75%,293.2,279.52,286.67,292.34,317721200.0,285.11
max,313.84,303.33,309.5,312.86,392220700.0,305.12


### `DataFrame` Columns

In order to isolate a particular column we can use square brackets (`[ ]`).  The following code isolates the `close` price column of `df_spy`:

In [9]:
df_spy['close']

0     296.26
1     309.09
2     300.24
3     312.86
4     302.46
       ...  
18    246.79
19    261.20
20    253.42
21    261.65
22    257.75
Name: close, Length: 23, dtype: float64

 


**Code Challenge:** Isolate the `date` column of `df_spy`.

In [10]:
df_spy['date']

0    2020-02-28
1    2020-03-02
2    2020-03-03
3    2020-03-04
4    2020-03-05
        ...    
18   2020-03-25
19   2020-03-26
20   2020-03-27
21   2020-03-30
22   2020-03-31
Name: date, Length: 23, dtype: datetime64[ns]

As we can see from the following code, each column of a `DataFrame` is actually a different kind of `pandas` structure called a `Series`. 

In [11]:
type(df_spy['close'])

pandas.core.series.Series

Here is a bit of `pandas` inside baseball:

- A `DataFrame` is collection of columns that are glued together.

- Each column is a `Series`.

- A `Series` has two main components: 1) `.values`; 2) `.index`.

- The `.values` component of a `Series` is a `numpy.array`.

Let's look at the `.values` attribute of the `close` column of `df_spy`:

In [12]:
df_spy['close'].values

array([296.26, 309.09, 300.24, 312.86, 302.46, 297.46, 274.23, 288.42,
       274.36, 248.11, 269.32, 239.85, 252.8 , 240.  , 240.51, 228.8 ,
       222.95, 243.15, 246.79, 261.2 , 253.42, 261.65, 257.75])

**Code Challenge:** Verify that the `values` component of the `close` column of `df_spy` is in fact a a `numpy.array`.

In [13]:
type(df_spy['close'].values)

numpy.ndarray

### Component-wise Column Operations

We can perform component-wise (i.e. vector-like) calculations with `DataFrame` columns.

The following code divides all the `close` prices by 100.

In [14]:
df_spy['close'] / 100

0     2.9626
1     3.0909
2     3.0024
3     3.1286
4     3.0246
       ...  
18    2.4679
19    2.6120
20    2.5342
21    2.6165
22    2.5775
Name: close, Length: 23, dtype: float64

We can also perform component-wise calculations between two colums.

Let's say we want to calculate the *intraday range* of SPY for each of the trade-dates in `df_spy`; this is the difference between the `high` and the `low` of each day.  We can do this easily from the columns of our `DataFrame`.

In [15]:
df_spy['high'] - df_spy['low']

0     12.35
1     14.70
2     16.27
3      9.77
4      8.46
      ...  
18    16.60
19    13.75
20     9.76
21     8.90
22     7.11
Length: 23, dtype: float64

**Code Challenge:** Calculate the difference between the `close` and `open` columns of `df_spy`.

In [16]:
df_spy['close'] - df_spy['open']

0      7.56
1     10.88
2     -9.26
3      6.74
4     -2.52
      ...  
18     1.92
19    11.68
20     0.15
21     5.95
22    -2.81
Length: 23, dtype: float64

### Adding Columns via Variable Assignment

Let's say we want to save our intraday ranges back into `df_spy` for further analysis later.  The most straight forward to do this is using variable assignment as follows:

In [17]:
df_spy['intraday_range'] = df_spy['high'] - df_spy['low']
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close,intraday_range
0,2020-02-28,297.89,285.54,288.7,296.26,384975800.0,288.93,12.35
1,2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45,14.7
2,2020-03-03,313.84,297.57,309.5,300.24,300139100.0,292.82,16.27
3,2020-03-04,313.1,303.33,306.12,312.86,176613400.0,305.12,9.77
4,2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98,8.46


**Code Challenge:**  Add a new column to `df_spy` called `open_to_close` that consists of the difference between the `close` and `open` of each day.

In [18]:
df_spy['open_to_close'] = df_spy['close'] - df_spy['open']
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close,intraday_range,open_to_close
0,2020-02-28,297.89,285.54,288.7,296.26,384975800.0,288.93,12.35,7.56
1,2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45,14.7,10.88
2,2020-03-03,313.84,297.57,309.5,300.24,300139100.0,292.82,16.27,-9.26
3,2020-03-04,313.1,303.33,306.12,312.86,176613400.0,305.12,9.77,6.74
4,2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98,8.46,-2.52


### Adding Columns via `.assign()` 

A powerful, but less intuitive of way of adding a column to a `DataFrame` uses the `.assign()` function, which makes use of `lambda` functions (i.e. anonymous functions).  

The following code adds another column called `intraday_range_assign`.

In [19]:
df_spy.assign(intraday_range_assign = lambda df: df['high'] - df['low'])

Unnamed: 0,date,high,low,open,close,volume,adj_close,intraday_range,open_to_close,intraday_range_assign
0,2020-02-28,297.89,285.54,288.70,296.26,384975800.0,288.93,12.35,7.56,12.35
1,2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45,14.70,10.88,14.70
2,2020-03-03,313.84,297.57,309.50,300.24,300139100.0,292.82,16.27,-9.26,16.27
3,2020-03-04,313.10,303.33,306.12,312.86,176613400.0,305.12,9.77,6.74,9.77
4,2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98,8.46,-2.52,8.46
...,...,...,...,...,...,...,...,...,...,...
18,2020-03-25,256.35,239.75,244.87,246.79,299430300.0,242.10,16.60,1.92,16.60
19,2020-03-26,262.80,249.05,249.52,261.20,257632800.0,256.24,13.75,11.68,13.75
20,2020-03-27,260.81,251.05,253.27,253.42,224341200.0,248.61,9.76,0.15,9.76
21,2020-03-30,262.43,253.53,255.70,261.65,171369500.0,256.68,8.90,5.95,8.90


**Code Challenge:** Verify that the column `intraday_range_assign` was not actually added to the `df_spy`.

In [20]:
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close,intraday_range,open_to_close
0,2020-02-28,297.89,285.54,288.7,296.26,384975800.0,288.93,12.35,7.56
1,2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45,14.7,10.88
2,2020-03-03,313.84,297.57,309.5,300.24,300139100.0,292.82,16.27,-9.26
3,2020-03-04,313.1,303.33,306.12,312.86,176613400.0,305.12,9.77,6.74
4,2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98,8.46,-2.52


In order to modify the original `DataFrame` we will need to reassign to the variable.

In [21]:
df_spy = df_spy.assign(intraday_range_assign = lambda df: df['high'] - df['low'])
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close,intraday_range,open_to_close,intraday_range_assign
0,2020-02-28,297.89,285.54,288.7,296.26,384975800.0,288.93,12.35,7.56,12.35
1,2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45,14.7,10.88,14.7
2,2020-03-03,313.84,297.57,309.5,300.24,300139100.0,292.82,16.27,-9.26,16.27
3,2020-03-04,313.1,303.33,306.12,312.86,176613400.0,305.12,9.77,6.74,9.77
4,2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98,8.46,-2.52,8.46


**Code Challenge:** Use `.assign()` to create a new column in `df_spy`, call it `open_to_close_assign`, that contains the difference between the `close` and `open`.

In [22]:
df_spy = df_spy.assign(open_to_close_assign = lambda df: df['close'] - df['open'])
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close,intraday_range,open_to_close,intraday_range_assign,open_to_close_assign
0,2020-02-28,297.89,285.54,288.7,296.26,384975800.0,288.93,12.35,7.56,12.35,7.56
1,2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45,14.7,10.88,14.7,10.88
2,2020-03-03,313.84,297.57,309.5,300.24,300139100.0,292.82,16.27,-9.26,16.27,-9.26
3,2020-03-04,313.1,303.33,306.12,312.86,176613400.0,305.12,9.77,6.74,9.77,6.74
4,2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98,8.46,-2.52,8.46,-2.52


### Method Chaining

The value of `.assign()` becomes clear when we start *chaining* methods together.

In order to see this, let's first `drop` the columns that we created

In [23]:
lst_cols = ['intraday_range', 'open_to_close', 'intraday_range_assign', 'open_to_close_assign']
df_spy.drop(columns=lst_cols, inplace=True)
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close
0,2020-02-28,297.89,285.54,288.7,296.26,384975800.0,288.93
1,2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45
2,2020-03-03,313.84,297.57,309.5,300.24,300139100.0,292.82
3,2020-03-04,313.1,303.33,306.12,312.86,176613400.0,305.12
4,2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98


The following code adds the `intraday` and and `open_to_close` columns: 

In [24]:
df_spy = \
    (
    df_spy
        .assign(intraday_range = lambda df: df['high'] - df['low'])
        .assign(open_to_close = lambda df: df['close'] - df['open'])
    )
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close,intraday_range,open_to_close
0,2020-02-28,297.89,285.54,288.7,296.26,384975800.0,288.93,12.35,7.56
1,2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45,14.7,10.88
2,2020-03-03,313.84,297.57,309.5,300.24,300139100.0,292.82,16.27,-9.26
3,2020-03-04,313.1,303.33,306.12,312.86,176613400.0,305.12,9.77,6.74
4,2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98,8.46,-2.52


**Code Challenge:** Use `.assign()` to add a two new column to `df_spy`:
    
1. difference betwee the `close` and `adj_close`
1. the average of the `low` and `open`

In [25]:
df_spy = \
    (
    df_spy
        .assign(div = lambda df: df['close'] - df['adj_close'])
        .assign(avg = lambda df: (df['low'] + df['open']) / 2)
    )
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close,intraday_range,open_to_close,div,avg
0,2020-02-28,297.89,285.54,288.7,296.26,384975800.0,288.93,12.35,7.56,7.33,287.12
1,2020-03-02,309.16,294.46,298.21,309.09,238703600.0,301.45,14.7,10.88,7.64,296.335
2,2020-03-03,313.84,297.57,309.5,300.24,300139100.0,292.82,16.27,-9.26,7.42,303.535
3,2020-03-04,313.1,303.33,306.12,312.86,176613400.0,305.12,9.77,6.74,7.74,304.725
4,2020-03-05,308.47,300.01,304.98,302.46,186366800.0,294.98,8.46,-2.52,7.48,302.495


### Aggregating Calulations on `Series`

`Series` have a variety of built-in aggregation functions.

For example, we can use the following code to calculate the total SPY volume during March 2020:

In [26]:
df_spy['volume'].sum()

6310993400.0

Here some summary statistics on the `intraday_range` column that we added to our `DataFrame` earlier.

In [27]:
print("Mean: ", df_spy['intraday_range'].mean()) # average
print("St Dev:", df_spy['intraday_range'].std()) # standard deviation
print("Min:" , df_spy['intraday_range'].min()) # minimum
print("Max:" , df_spy['intraday_range'].max()) # maximum

Mean:  13.774782608695652
St Dev: 4.430055273167614
Min: 7.109999999999957
Max: 22.960000000000008


**Code Challenge:** Calculate the average daily volume for the trade dates in `df_spy`.

In [28]:
df_spy['volume'].mean()

274391017.3913044

## Related Reading

*PDSH* - Section 3.1 - Introducing Pandas Objects

*PDSH* - Section 2.1 - Understanding Data Types in Python

*PDSH* - Section 2.2 - The Basics of NumPy Arrays

*PDSH* - Section 2.3 - Computation on NumPy Arrays: Universal Functions

*PDSH* - Section 2.4 - Aggregations: Min, Max, and Everything In Between