# `DataFrame` Basics

In this tutorial we cover the basics of working with `DataFrames` in `pandas`.

### Importing Packages

Let's begin by importing the packages that we will need.

In [1]:
##> import pandas as pd
##> import pandas_datareader as pdr
##> pd.set_option('display.max_rows', 10)




### Reading-In Data

Next, let's use `pandas_datareader` to read-in SPY prices from March 2020.  SPY is an ETF that tracks the S&P500 index.

In [2]:
##> df_spy = pdr.get_data_yahoo('SPY', start='2020-02-28', end='2020-03-31')
##> df_spy = df_spy.round(2)
##> df_spy.head()




Let's make the `date` a regular column, instead of an index, and also make the column names snake-case.

In [3]:
##> df_spy.reset_index(drop=False, inplace=True)
##> df_spy.columns = df_spy.columns.str.lower().str.replace(' ', '_')
##> df_spy.head()




### Exploring a `DataFrame`

We can explore our `df_spy` `DataFrame` in a variety of ways.

First, we can first use the `type()` method to make sure what we have created is in fact a `DataFrame`.

In [4]:
##> type(df_spy)



Next, we can use the `.dtypes` attribute of the `DataFrame` to see the data types of each of the columns.

In [5]:
##> df_spy.dtypes



We can also check the number of rows and columns by using the `.shape` attribute.

In [6]:
##> df_spy.shape



As we can see, our dataframe `df_spy` consists of 23 row and 7 columns.

**Code Challenge:** Try the `DataFrame.info()` and `DataFrame.describe()` methods on `df_spy`.

### `DataFrame` Columns

In order to isolate a particular column we can use square brackets (`[ ]`).  The following code isolates the `close` price column of `df_spy`:

In [7]:
##> df_spy['close']



 


**Code Challenge:** Isolate the `date` column of `df_spy`.

As we can see from the following code, each column of a `DataFrame` is actually a different kind of `pandas` structure called a `Series`. 

In [8]:
##> type(df_spy['close'])




Here is a bit of `pandas` inside baseball:

- A `DataFrame` is collection of columns that are glued together.

- Each column is a `Series`.

- A `Series` has two main components: 1) `.values`; 2) `.index`.

- The `.values` component of a `Series` is a `numpy.array`.

Let's look at the `.values` attribute of the `close` column of `df_spy`:

In [9]:
##> df_spy['close'].values




**Code Challenge:** Verify that the `values` component of the `close` column of `df_spy` is in fact a a `numpy.array`.

### Component-wise Column Operations

We can perform component-wise (i.e. vector-like) calculations with `DataFrame` columns.

The following code divides all the `close` prices by 100.

In [10]:
##> df_spy['close'] / 100




We can also perform component-wise calculations between two colums.

Let's say we want to calculate the *intraday range* of SPY for each of the trade-dates in `df_spy`; this is the difference between the `high` and the `low` of each day.  We can do this easily from the columns of our `DataFrame`.

In [11]:
##> df_spy['high'] - df_spy['low']




**Code Challenge:** Calculate the difference between the `close` and `open` columns of `df_spy`.

### Adding Columns via Variable Assignment

Let's say we want to save our intraday ranges back into `df_spy` for further analysis later.  The most straight forward to do this is using variable assignment as follows:

In [12]:
##> df_spy['intraday_range'] = df_spy['high'] - df_spy['low']
##> df_spy.head()




**Code Challenge:**  Add a new column to `df_spy` called `open_to_close` that consists of the difference between the `close` and `open` of each day.

### Adding Columns via `.assign()` 

A powerful, but less intuitive of way of adding a column to a `DataFrame` uses the `.assign()` function, which makes use of `lambda` functions (i.e. anonymous functions).  

The following code adds another column called `intraday_range_assign`.

In [13]:
##> df_spy.assign(intraday_range_assign = lambda df: df['high'] - df['low'])




**Code Challenge:** Verify that the column `intraday_range_assign` was not actually added to the `df_spy`.

In order to modify the original `DataFrame` we will need to reassign to the variable.

In [14]:
##> df_spy = df_spy.assign(intraday_range_assign = lambda df: df['high'] - df['low'])
##> df_spy.head()




**Code Challenge:** Use `.assign()` to create a new column in `df_spy`, call it `open_to_close_assign`, that contains the difference between the `close` and `open`.

### Method Chaining

The value of `.assign()` becomes clear when we start *chaining* methods together.

In order to see this, let's first `drop` the columns that we created

In [15]:
##> lst_cols = ['intraday_range', 'open_to_close', 'intraday_range_assign', 'open_to_close_assign']
##> df_spy.drop(columns=lst_cols, inplace=True)
##> df_spy.head()




The following code adds the `intraday` and and `open_to_close` columns: 

In [16]:
##> df_spy = \
##>     (
##>     df_spy
##>         .assign(intraday_range = lambda df: df['high'] - df['low'])
##>         .assign(open_to_close = lambda df: df['close'] - df['open'])
##>     )
##> df_spy.head()




**Code Challenge:** Use `.assign()` to add a two new column to `df_spy`:
    
1. difference betwee the `close` and `adj_close`
1. the average of the `low` and `open`

### Aggregating Calulations on `Series`

`Series` have a variety of built-in aggregation functions.

For example, we can use the following code to calculate the total SPY volume during March 2020:

In [18]:
##> df_spy['volume'].sum()




Here some summary statistics on the `intraday_range` column that we added to our `DataFrame` earlier.

In [19]:
##> print("Mean: ", df_spy['intraday_range'].mean()) # average
##> print("St Dev:", df_spy['intraday_range'].std()) # standard deviation
##> print("Min:" , df_spy['intraday_range'].min()) # minimum
##> print("Max:" , df_spy['intraday_range'].max()) # maximum




**Code Challenge:** Calculate the average daily volume for the trade dates in `df_spy`.

## Related Reading

*PDSH* - Section 3.1 - Introducing Pandas Objects

*PDSH* - Section 2.1 - Understanding Data Types in Python

*PDSH* - Section 2.2 - The Basics of NumPy Arrays

*PDSH* - Section 2.3 - Computation on NumPy Arrays: Universal Functions

*PDSH* - Section 2.4 - Aggregations: Min, Max, and Everything In Between