# `DataFrame` Indexing and Slicing

Access specific rows of a `DataFrame` by their location is referred to as *indexing*.  

If you are accessing a sequence of contiguous rows, this action is sometimes called *slicing*.

The purpose of this tutorial is to survey various methods for indexing and slicing in `pandas`.

### Importing Packages

Let's begin by importing the packages that we will need.

In [1]:
import numpy as np
import pandas as pd
import pandas_datareader as pdr

### Reading-In Data

Next, lets grab some data from Yahoo finance.  In particular, we'll grab `SPY` price data from July 2021.

In [2]:
df_spy = pdr.get_data_yahoo('SPY', start='2021-06-30', end='2021-07-31')
df_spy = df_spy.round(2)
df_spy.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06
2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43
2021-07-02,434.1,430.52,431.67,433.72,57697700,433.72
2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93
2021-07-07,434.76,431.51,433.66,434.46,63549500,434.46


The following code resets the index so that `date` is a regular column; it also puts the column names into snake-case.

In [3]:
df_spy.reset_index(inplace=True)
df_spy.columns = df_spy.columns.str.lower().str.replace(' ', '_')
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06
1,2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43
2,2021-07-02,434.1,430.52,431.67,433.72,57697700,433.72
3,2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93
4,2021-07-07,434.76,431.51,433.66,434.46,63549500,434.46


It is often useful to look at the data type of each of the columns of a new data set.  We can do so with the `DataFrame.dtypes` attribute.

In [4]:
df_spy.dtypes

date         datetime64[ns]
high                float64
low                 float64
open                float64
close               float64
volume                int64
adj_close           float64
dtype: object

### Row Slicing

The simplest way to slice a `DataFrame` is to use square brackets: `[]`.  The syntax `df[i:j]` will generate a `DataFrame` who's first row is the `i`th row of `df` and who's last row is the `(j-1)`th row of `df`.   Let's demonstrate this with a some examples:

Starting from the 0th row, and ending with the 0th row:

In [5]:
df_spy[0:1]

Unnamed: 0,date,high,low,open,close,volume,adj_close
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06


Starting with the 3rd row, and ending with the 6th row:

In [6]:
df_spy[3:7]

Unnamed: 0,date,high,low,open,close,volume,adj_close
3,2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93
4,2021-07-07,434.76,431.51,433.66,434.46,63549500,434.46
5,2021-07-08,431.73,427.52,428.78,430.92,97595200,430.92
6,2021-07-09,435.84,430.71,432.53,435.52,76238600,435.52


**Code Challenge:** Retrieve the 15th, 16th, and 17th rows of `df_spy`.

In [7]:
df_spy[15:18]

Unnamed: 0,date,high,low,open,close,volume,adj_close
15,2021-07-22,435.72,433.69,434.74,435.46,47878500,435.46
16,2021-07-23,440.3,436.79,437.52,439.94,63766600,439.94
17,2021-07-26,441.03,439.26,439.31,441.02,43719200,441.02


Using the syntax `df[:n]` automatically starts the indexing at `0`.  For example, the following code retrieves all of `df_spy` (notice that `len(df_spy)` gives the number of rows of `df_spy`):

In [8]:
df_spy[:len(df_spy)]

Unnamed: 0,date,high,low,open,close,volume,adj_close
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06
1,2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43
2,2021-07-02,434.1,430.52,431.67,433.72,57697700,433.72
3,2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93
4,2021-07-07,434.76,431.51,433.66,434.46,63549500,434.46
5,2021-07-08,431.73,427.52,428.78,430.92,97595200,430.92
6,2021-07-09,435.84,430.71,432.53,435.52,76238600,435.52
7,2021-07-12,437.35,434.97,435.43,437.08,52889600,437.08
8,2021-07-13,437.84,435.31,436.24,435.59,52911300,435.59
9,2021-07-14,437.92,434.91,437.4,436.24,64130400,436.24


**Code Challenge:** Retrieve the first five rows of `df_spy`.

In [9]:
df_spy[:5]

Unnamed: 0,date,high,low,open,close,volume,adj_close
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06
1,2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43
2,2021-07-02,434.1,430.52,431.67,433.72,57697700,433.72
3,2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93
4,2021-07-07,434.76,431.51,433.66,434.46,63549500,434.46


There are a couple of row slicing tricks that involve negative numbers that are worth mentioning.

The syntax `df[-n:]` retrieves the last `n` rows of `df`.  The following code retrieves the last five rows of `df_spy`.

In [10]:
df_spy[-5:]

Unnamed: 0,date,high,low,open,close,volume,adj_close
17,2021-07-26,441.03,439.26,439.31,441.02,43719200,441.02
18,2021-07-27,439.94,435.99,439.91,439.01,67397100,439.01
19,2021-07-28,440.3,437.31,439.68,438.83,52472400,438.83
20,2021-07-29,441.8,439.81,439.82,440.65,47435300,440.65
21,2021-07-30,440.06,437.77,437.91,438.51,68890600,438.51


The syntax `df[:-n]` retrieves all but the last `n` rows of `df`.  The following code retrieves all but the last 10 rows of `df_spy`:

In [11]:
df_spy[:-10]

Unnamed: 0,date,high,low,open,close,volume,adj_close
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06
1,2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43
2,2021-07-02,434.1,430.52,431.67,433.72,57697700,433.72
3,2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93
4,2021-07-07,434.76,431.51,433.66,434.46,63549500,434.46
5,2021-07-08,431.73,427.52,428.78,430.92,97595200,430.92
6,2021-07-09,435.84,430.71,432.53,435.52,76238600,435.52
7,2021-07-12,437.35,434.97,435.43,437.08,52889600,437.08
8,2021-07-13,437.84,435.31,436.24,435.59,52911300,435.59
9,2021-07-14,437.92,434.91,437.4,436.24,64130400,436.24


**Code Challenge:** Retrieve the first row of `df_spy` with negative indexing.

In [12]:
df_spy[:-(len(df_spy)-1)]

Unnamed: 0,date,high,low,open,close,volume,adj_close
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06


**Code Challenge:** Use simple slicing to select the last three rows of a `df_spy` without explicitly using row numbers. 

In [13]:
df_spy[len(df_spy)-3:len(df_spy)]

Unnamed: 0,date,high,low,open,close,volume,adj_close
19,2021-07-28,440.3,437.31,439.68,438.83,52472400,438.83
20,2021-07-29,441.8,439.81,439.82,440.65,47435300,440.65
21,2021-07-30,440.06,437.77,437.91,438.51,68890600,438.51


In [14]:
df_spy[-3:]

Unnamed: 0,date,high,low,open,close,volume,adj_close
19,2021-07-28,440.3,437.31,439.68,438.83,52472400,438.83
20,2021-07-29,441.8,439.81,439.82,440.65,47435300,440.65
21,2021-07-30,440.06,437.77,437.91,438.51,68890600,438.51


### `DataFrame` Indexes

Under the hood, a `DataFrame` has several `indexes`:

`columns` - the set of column names is an (explicit) index.

`row` - whenever a `DataFrame` is created, there is an explicit row index that is created.  If one isn't specified, then a sequence of non-negative integers is used.

`implicit` - each row has an implicit row-number, and each column has an implicit column-number.

Let's take a look at the `columns` index of `df_spy`:

In [15]:
df_spy.columns

Index(['date', 'high', 'low', 'open', 'close', 'volume', 'adj_close'], dtype='object')

In [16]:
type(df_spy.columns)

pandas.core.indexes.base.Index

Next, let's take a look at the explicit row `index` attribute of `df_spy`:

In [17]:
df_spy.index

RangeIndex(start=0, stop=22, step=1)

In [18]:
type(df_spy.index)

pandas.core.indexes.range.RangeIndex

Since we reset the index for `df_spy`, a `RangeIndex` object is used for the explicit row `index`.  You can think of a `RangeIndex` object as a glorified set of consecutive integers.

For the most part, we won't be too concerned with `indexes`.  A lot of data analysis can be done without worrying about them.  However, it's good to be aware `indexes` exist becase they can come into play for more advanced topics, such as joining tables together; they also come up in Stack Overflow examples frequently.

For the purposes of this tutorial, our interest in `indexes` comes from how they are related to two built-in `DataFrame` *indexers*: `DataFrame.iloc` and `DataFrame.loc`.

### Indexing with `DataFrame.iloc`

The indexer `DataFrame.iloc` can be used to access rows and columns using their implicit row and column numbers.

Here is an example of `iloc` that retrieves the first two rows of `df_spy`:

In [19]:
df_spy.iloc[0:2,]

Unnamed: 0,date,high,low,open,close,volume,adj_close
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06
1,2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43


Notice, that because we didn't specify any column numbers, the code above retrieves all columns.

The following code grabs the first three row and the first three columns of `df_spy`:

In [20]:
df_spy.iloc[0:3, 0:3]

Unnamed: 0,date,high,low
0,2021-06-30,428.78,427.18
1,2021-07-01,430.6,428.8
2,2021-07-02,434.1,430.52


We can also supply `.iloc` with `lists` rather than ranges to specify custom sets of columns and rows:

In [21]:
lst_row = [0, 2] # 0th and 2nd row
lst_col = [0, 6] # date and adj_close columns
df_spy.iloc[lst_row, lst_col]

Unnamed: 0,date,adj_close
0,2021-06-30,428.06
2,2021-07-02,433.72


Using `lists` as a means of indexing is sometimes referred to as *fancy indexing*.

**Code Challenge** Use fancy indexing to grab the 14th, 0th, and 5th rows of `df_spy` - in that order.

In [22]:
df_spy.iloc[[14, 0, 5]]

Unnamed: 0,date,high,low,open,close,volume,adj_close
14,2021-07-21,434.7,431.01,432.34,434.55,64724400,434.55
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06
5,2021-07-08,431.73,427.52,428.78,430.92,97595200,430.92


### Indexing with `DataFrame.loc`

Rather than using the implicit row or column numbers, it is often more useful to access data by using the explicit row or column indices.

Let's use the `DataFrame.set_index()` method to set the `date` column as our new index.  The `dates` will be a more interesting explicit index.

In [23]:
df_spy.set_index('date', inplace = True)
df_spy.head()

Unnamed: 0_level_0,high,low,open,close,volume,adj_close
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06
2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43
2021-07-02,434.1,430.52,431.67,433.72,57697700,433.72
2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93
2021-07-07,434.76,431.51,433.66,434.46,63549500,434.46


To see the effect of the above code, we can have a look at the `index` of `df_spy`.

In [24]:
df_spy.index

DatetimeIndex(['2021-06-30', '2021-07-01', '2021-07-02', '2021-07-06',
               '2021-07-07', '2021-07-08', '2021-07-09', '2021-07-12',
               '2021-07-13', '2021-07-14', '2021-07-15', '2021-07-16',
               '2021-07-19', '2021-07-20', '2021-07-21', '2021-07-22',
               '2021-07-23', '2021-07-26', '2021-07-27', '2021-07-28',
               '2021-07-29', '2021-07-30'],
              dtype='datetime64[ns]', name='date', freq=None)

And notice that `date` is no longer column of `df_spy`:

In [25]:
df_spy.columns

Index(['high', 'low', 'open', 'close', 'volume', 'adj_close'], dtype='object')

Now that we have successfully set the row `index` of `df_spy` to be the `date`, let's see how we can use this `index` to access the data via `.loc`.
        
Here is an example of how we can grab a slice of rows, associated with a date-range:

In [26]:
df_spy.loc['2021-07-23':'2021-07-31']

Unnamed: 0_level_0,high,low,open,close,volume,adj_close
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-07-23,440.3,436.79,437.52,439.94,63766600,439.94
2021-07-26,441.03,439.26,439.31,441.02,43719200,441.02
2021-07-27,439.94,435.99,439.91,439.01,67397100,439.01
2021-07-28,440.3,437.31,439.68,438.83,52472400,438.83
2021-07-29,441.8,439.81,439.82,440.65,47435300,440.65
2021-07-30,440.06,437.77,437.91,438.51,68890600,438.51


If we want to select only the `volume` and `adjusted` columns for these dates, we would type the following: 

In [27]:
df_spy.loc['2021-07-23':'2021-07-31', ['volume', 'adj_close']]

Unnamed: 0_level_0,volume,adj_close
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-07-23,63766600,439.94
2021-07-26,43719200,441.02
2021-07-27,67397100,439.01
2021-07-28,52472400,438.83
2021-07-29,47435300,440.65
2021-07-30,68890600,438.51


**Code Challenge:** Use `.loc` to grab the `date`, `volume`, and `close` columns from `df_spy`.

In [28]:
df_spy.loc[:,['volume', 'close']]

Unnamed: 0_level_0,volume,close
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-06-30,64827900,428.06
2021-07-01,53441000,430.43
2021-07-02,57697700,433.72
2021-07-06,68710400,432.93
2021-07-07,63549500,434.46
2021-07-08,97595200,430.92
2021-07-09,76238600,435.52
2021-07-12,52889600,437.08
2021-07-13,52911300,435.59
2021-07-14,64130400,436.24


## Related Reading

*PDSH* - 2.7 - Fancy Indexing

*PDSH* - 3.2 - Data Indexing and Selection 