# `DataFrame` Indexing and Slicing

Accessing a specific row of a `DataFrame` by its location is referred to as *indexing*.  Accessing a sequence of contiguous rows is referred to as *slicing*.

The purpose of this chapter is to survey various methods for indexing and slicing in **pandas**.

## Importing Packages

Let's begin by importing the packages that we will need.

In [None]:
import pandas as pd
import yfinance as yf
yf.pdr_override()
from pandas_datareader import data as pdr
pd.set_option('display.max_rows', 10)

## Reading-In Data

Next, lets grab some data from Yahoo finance.  In particular, we'll grab `SPY` price data from July 2021.

In [None]:
df_spy = pdr.get_data_yahoo('SPY', start='2021-06-30', end='2021-07-31')
df_spy = df_spy.round(2)
df_spy.head()

[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-06-30,427.21,428.78,427.18,428.06,415.28,64827900
2021-07-01,428.87,430.6,428.8,430.43,417.58,53441000
2021-07-02,431.67,434.1,430.52,433.72,420.77,57697700
2021-07-06,433.78,434.01,430.01,432.93,420.0,68710400
2021-07-07,433.66,434.76,431.51,434.46,421.49,63549500


The following code resets the index so that `Date` is a regular column; it also puts the column names into snake-case.

In [None]:
df_spy.reset_index(inplace=True)
df_spy.columns = df_spy.columns.str.lower().str.replace(' ', '_')
df_spy.head()

Unnamed: 0,date,open,high,low,close,adj_close,volume
0,2021-06-30,427.21,428.78,427.18,428.06,415.28,64827900
1,2021-07-01,428.87,430.6,428.8,430.43,417.58,53441000
2,2021-07-02,431.67,434.1,430.52,433.72,420.77,57697700
3,2021-07-06,433.78,434.01,430.01,432.93,420.0,68710400
4,2021-07-07,433.66,434.76,431.51,434.46,421.49,63549500


It is often useful to look at the data type of each of the columns of a new data set.  We can do so with the `DataFrame.dtypes` attribute.

In [None]:
df_spy.dtypes

date         datetime64[ns]
open                float64
high                float64
low                 float64
close               float64
adj_close           float64
volume                int64
dtype: object

## Row Slicing

The simplest way to slice a `DataFrame` is to use square brackets: `[]`.  The syntax `df[i:j]` will generate a `DataFrame` who's first row is the `i`th row of `df` and who's last row is the `(j-1)`th row of `df`.   Let's demonstrate this with a some examples:

Starting from the 0th row, and ending with the 0th row:

In [None]:
df_spy[0:1]

Unnamed: 0,date,open,high,low,close,adj_close,volume
0,2021-06-30,427.21,428.78,427.18,428.06,415.28,64827900


Starting with the 3rd row, and ending with the 6th row:

In [None]:
df_spy[3:7]

Unnamed: 0,date,open,high,low,close,adj_close,volume
3,2021-07-06,433.78,434.01,430.01,432.93,420.0,68710400
4,2021-07-07,433.66,434.76,431.51,434.46,421.49,63549500
5,2021-07-08,428.78,431.73,427.52,430.92,418.05,97595200
6,2021-07-09,432.53,435.84,430.71,435.52,422.51,76238600


---

**Code Challenge:** Retrieve the 15th, 16th, and 17th rows of `df_spy`.

In [None]:
#| code-fold: true
#| code-summary: "Solution"
df_spy[15:18]

Unnamed: 0,date,open,high,low,close,adj_close,volume
15,2021-07-22,434.74,435.72,433.69,435.46,422.46,47878500
16,2021-07-23,437.52,440.3,436.79,439.94,426.8,63766600
17,2021-07-26,439.31,441.03,439.26,441.02,427.85,43719200


---

Using the syntax `df[:n]` automatically starts the indexing at `0`.  For example, the following code retrieves all of `df_spy` (notice that `len(df_spy)` gives the number of rows of `df_spy`):

In [None]:
df_spy[:len(df_spy)]

Unnamed: 0,date,open,high,low,close,adj_close,volume
0,2021-06-30,427.21,428.78,427.18,428.06,415.28,64827900
1,2021-07-01,428.87,430.60,428.80,430.43,417.58,53441000
2,2021-07-02,431.67,434.10,430.52,433.72,420.77,57697700
3,2021-07-06,433.78,434.01,430.01,432.93,420.00,68710400
4,2021-07-07,433.66,434.76,431.51,434.46,421.49,63549500
...,...,...,...,...,...,...,...
17,2021-07-26,439.31,441.03,439.26,441.02,427.85,43719200
18,2021-07-27,439.91,439.94,435.99,439.01,425.90,67397100
19,2021-07-28,439.68,440.30,437.31,438.83,425.73,52472400
20,2021-07-29,439.82,441.80,439.81,440.65,427.49,47435300


---

**Code Challenge:** Retrieve the first five rows of `df_spy`.

In [None]:
#| code-fold: true
#| code-summary: "Solution"
df_spy[:5]

Unnamed: 0,date,open,high,low,close,adj_close,volume
0,2021-06-30,427.21,428.78,427.18,428.06,415.28,64827900
1,2021-07-01,428.87,430.6,428.8,430.43,417.58,53441000
2,2021-07-02,431.67,434.1,430.52,433.72,420.77,57697700
3,2021-07-06,433.78,434.01,430.01,432.93,420.0,68710400
4,2021-07-07,433.66,434.76,431.51,434.46,421.49,63549500


---

There are a couple of row slicing tricks that involve negative numbers that are worth mentioning.

The syntax `df[-n:]` retrieves the last `n` rows of `df`.  The following code retrieves the last five rows of `df_spy`.

In [None]:
df_spy[-5:]

Unnamed: 0,date,open,high,low,close,adj_close,volume
17,2021-07-26,439.31,441.03,439.26,441.02,427.85,43719200
18,2021-07-27,439.91,439.94,435.99,439.01,425.9,67397100
19,2021-07-28,439.68,440.3,437.31,438.83,425.73,52472400
20,2021-07-29,439.82,441.8,439.81,440.65,427.49,47435300
21,2021-07-30,437.91,440.06,437.77,438.51,425.42,68951200


The syntax `df[:-n]` retrieves all but the last `n` rows of `df`.  The following code retrieves all but the last 10 rows of `df_spy`:

In [None]:
df_spy[:-10]

Unnamed: 0,date,open,high,low,close,adj_close,volume
0,2021-06-30,427.21,428.78,427.18,428.06,415.28,64827900
1,2021-07-01,428.87,430.60,428.80,430.43,417.58,53441000
2,2021-07-02,431.67,434.10,430.52,433.72,420.77,57697700
3,2021-07-06,433.78,434.01,430.01,432.93,420.00,68710400
4,2021-07-07,433.66,434.76,431.51,434.46,421.49,63549500
...,...,...,...,...,...,...,...
7,2021-07-12,435.43,437.35,434.97,437.08,424.03,52889600
8,2021-07-13,436.24,437.84,435.31,435.59,422.58,52911300
9,2021-07-14,437.40,437.92,434.91,436.24,423.21,64130400
10,2021-07-15,434.81,435.53,432.72,434.75,421.77,55126400


---

**Code Challenge:** Retrieve the first row of `df_spy` with negative indexing.

In [None]:
#| code-fold: true
#| code-summary: "Solution"
df_spy[:-(len(df_spy)-1)]

Unnamed: 0,date,open,high,low,close,adj_close,volume
0,2021-06-30,427.21,428.78,427.18,428.06,415.28,64827900


---

**Code Challenge:** Use simple slicing to select the last three rows of a `df_spy`: 1) without explicitly using row numbers; 2) with explicitly using row numbers.

In [None]:
#| code-fold: true
#| code-summary: "Solution"
df_spy[len(df_spy)-3:len(df_spy)]

Unnamed: 0,date,open,high,low,close,adj_close,volume
19,2021-07-28,439.68,440.3,437.31,438.83,425.73,52472400
20,2021-07-29,439.82,441.8,439.81,440.65,427.49,47435300
21,2021-07-30,437.91,440.06,437.77,438.51,425.42,68951200


In [None]:
#| code-fold: true
#| code-summary: "Solution"
df_spy[-3:]

Unnamed: 0,date,open,high,low,close,adj_close,volume
19,2021-07-28,439.68,440.3,437.31,438.83,425.73,52472400
20,2021-07-29,439.82,441.8,439.81,440.65,427.49,47435300
21,2021-07-30,437.91,440.06,437.77,438.51,425.42,68951200


---

## `DataFrame` Indexes

Under the hood, a `DataFrame` has several `indexes`:

`columns` - the set of column names is an (explicit) index.

`row` - whenever a `DataFrame` is created, there is an explicit row index that is created.  If one isn't specified, then a sequence of non-negative integers is used.

`implicit` - each row has an implicit row-number, and each column has an implicit column-number.

Let's take a look at the `columns` index of `df_spy`.

In [None]:
df_spy.columns

Index(['date', 'open', 'high', 'low', 'close', 'adj_close', 'volume'], dtype='object')

In [None]:
type(df_spy.columns)

pandas.core.indexes.base.Index

Next, let's take a look at the explicit row `index` attribute of `df_spy`.

In [None]:
df_spy.index

RangeIndex(start=0, stop=22, step=1)

In [None]:
type(df_spy.index)

pandas.core.indexes.range.RangeIndex

Since we reset the index for `df_spy`, a `RangeIndex` object is used for the explicit row `index`.  You can think of a `RangeIndex` object as a glorified set of consecutive integers.

For the most part, we won't be too concerned with `indexes`.  A lot of data analysis can be done without worrying about them.  However, it's good to be aware `indexes` exist becase they can come into play for more advanced topics, such as joining tables together; they also come up in Stack Overflow examples frequently.

For the purposes of this chapter, our interest in `indexes` comes from how they are related to two built-in `DataFrame` *indexers*: `DataFrame.iloc` and `DataFrame.loc`.

## Indexing with `DataFrame.iloc`

The indexer `DataFrame.iloc` can be used to access rows and columns using their implicit row and column numbers.

Here is an example of `iloc` that retrieves the first two rows of `df_spy`.

In [None]:
df_spy.iloc[0:2,]

Unnamed: 0,date,open,high,low,close,adj_close,volume
0,2021-06-30,427.21,428.78,427.18,428.06,415.28,64827900
1,2021-07-01,428.87,430.6,428.8,430.43,417.58,53441000


Notice, that because we didn't specify any column numbers, the code above retrieves all columns.

The following code grabs the first three row and the first three columns of `df_spy`.

In [None]:
df_spy.iloc[0:3, 0:3]

Unnamed: 0,date,open,high
0,2021-06-30,427.21,428.78
1,2021-07-01,428.87,430.6
2,2021-07-02,431.67,434.1


We can also supply `.iloc` with `lists` rather than ranges to specify custom sets of columns and rows:

In [None]:
lst_row = [0, 2] # 0th and 2nd row
lst_col = [0, 6] # date and adj_close columns
df_spy.iloc[lst_row, lst_col]

Unnamed: 0,date,volume
0,2021-06-30,64827900
2,2021-07-02,57697700


Using `lists` as a means of indexing is sometimes referred to as *fancy indexing*.

---

**Code Challenge** Use fancy indexing to grab the 14th, 0th, and 5th rows of `df_spy` - in that order.

In [None]:
#| code-fold: true
#| code-summary: "Solution"
df_spy.iloc[[14, 0, 5]]

Unnamed: 0,date,open,high,low,close,adj_close,volume
14,2021-07-21,432.34,434.7,431.01,434.55,421.57,64724400
0,2021-06-30,427.21,428.78,427.18,428.06,415.28,64827900
5,2021-07-08,428.78,431.73,427.52,430.92,418.05,97595200


---

## Indexing with `DataFrame.loc`

Rather than using the implicit row or column numbers, it is often more useful to access data by using the explicit row or column indices.

Let's use the `DataFrame.set_index()` method to set the `date` column as our new index.  The `dates` will be a more interesting explicit index.

In [None]:
df_spy.set_index('date', inplace = True)
df_spy.head()

Unnamed: 0_level_0,open,high,low,close,adj_close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-06-30,427.21,428.78,427.18,428.06,415.28,64827900
2021-07-01,428.87,430.6,428.8,430.43,417.58,53441000
2021-07-02,431.67,434.1,430.52,433.72,420.77,57697700
2021-07-06,433.78,434.01,430.01,432.93,420.0,68710400
2021-07-07,433.66,434.76,431.51,434.46,421.49,63549500


To see the effect of the above code, we can have a look at the `index` of `df_spy`.

In [None]:
df_spy.index

DatetimeIndex(['2021-06-30', '2021-07-01', '2021-07-02', '2021-07-06',
               '2021-07-07', '2021-07-08', '2021-07-09', '2021-07-12',
               '2021-07-13', '2021-07-14', '2021-07-15', '2021-07-16',
               '2021-07-19', '2021-07-20', '2021-07-21', '2021-07-22',
               '2021-07-23', '2021-07-26', '2021-07-27', '2021-07-28',
               '2021-07-29', '2021-07-30'],
              dtype='datetime64[ns]', name='date', freq=None)

And notice that `date` is no longer column of `df_spy`.

In [None]:
df_spy.columns

Index(['open', 'high', 'low', 'close', 'adj_close', 'volume'], dtype='object')

Now that we have successfully set the row `index` of `df_spy` to be the `date`, let's see how we can use this `index` to access the data via `.loc`.
        
Here is an example of how we can grab a slice of rows, associated with a date-range.

In [None]:
df_spy.loc['2021-07-23':'2021-07-31']

Unnamed: 0_level_0,open,high,low,close,adj_close,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-07-23,437.52,440.3,436.79,439.94,426.8,63766600
2021-07-26,439.31,441.03,439.26,441.02,427.85,43719200
2021-07-27,439.91,439.94,435.99,439.01,425.9,67397100
2021-07-28,439.68,440.3,437.31,438.83,425.73,52472400
2021-07-29,439.82,441.8,439.81,440.65,427.49,47435300
2021-07-30,437.91,440.06,437.77,438.51,425.42,68951200


If we want to select only the `volume` and `adjusted` columns for these dates, we would type the following: 

In [None]:
df_spy.loc['2021-07-23':'2021-07-31', ['volume', 'adj_close']]

Unnamed: 0_level_0,volume,adj_close
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-07-23,63766600,426.8
2021-07-26,43719200,427.85
2021-07-27,67397100,425.9
2021-07-28,52472400,425.73
2021-07-29,47435300,427.49
2021-07-30,68951200,425.42


---

**Code Challenge:** Use `.loc` to grab the `date`, `volume`, and `close` columns from `df_spy`.

In [None]:
#| code-fold: true
#| code-summary: "Solution"
df_spy.loc[:,['volume', 'close']]

Unnamed: 0_level_0,volume,close
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-06-30,64827900,428.06
2021-07-01,53441000,430.43
2021-07-02,57697700,433.72
2021-07-06,68710400,432.93
2021-07-07,63549500,434.46
...,...,...
2021-07-26,43719200,441.02
2021-07-27,67397100,439.01
2021-07-28,52472400,438.83
2021-07-29,47435300,440.65


---

## Related Reading

*Python Data Science Handbook* - Section 2.7 - Fancy Indexing

*Python Data Science Handbook* - Section 3.2 - Data Indexing and Selection 