# `DataFrame` Indexing and Slicing

Accessing a specific row of a `DataFrame` by its location is referred to as *indexing*.  Accessing a sequence of contiguous rows is referred to as *slicing*.

The purpose of this chapter is to survey various methods for indexing and slicing in **pandas**.

## Importing Packages

Let's begin by importing the packages that we will need.

In [1]:
import pandas as pd
import yfinance as yf
pd.set_option('display.max_rows', 10)

## Reading-In Data

Next, lets grab some data from Yahoo finance.  In particular, we'll grab `SPY` price data from July 2021.

In [2]:
df_spy = yf.download('SPY', start='2021-06-30', end='2021-07-31', auto_adjust=False, rounding=True)
df_spy.head()

[*********************100%***********************]  1 of 1 completed


Price,Adj Close,Close,High,Low,Open,Volume
Ticker,SPY,SPY,SPY,SPY,SPY,SPY
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2021-06-30,404.51,428.06,428.78,427.18,427.21,64827900
2021-07-01,406.75,430.43,430.6,428.8,428.87,53441000
2021-07-02,409.86,433.72,434.1,430.52,431.67,57697700
2021-07-06,409.11,432.93,434.01,430.01,433.78,68710400
2021-07-07,410.56,434.46,434.76,431.51,433.66,63549500


The following code:

- removes the `SPY` level of the column index
- resets the index so that `Date` is a regular column
- puts the column names into snake-case.

In [3]:
df_spy = df_spy.droplevel(level=1, axis=1)
df_spy = df_spy.rename_axis(None, axis=1)
df_spy.reset_index(inplace=True)
df_spy.columns = df_spy.columns.str.lower().str.replace(' ', '_')
df_spy.head()

Unnamed: 0,date,adj_close,close,high,low,open,volume
0,2021-06-30,404.51,428.06,428.78,427.18,427.21,64827900
1,2021-07-01,406.75,430.43,430.6,428.8,428.87,53441000
2,2021-07-02,409.86,433.72,434.1,430.52,431.67,57697700
3,2021-07-06,409.11,432.93,434.01,430.01,433.78,68710400
4,2021-07-07,410.56,434.46,434.76,431.51,433.66,63549500


It is often useful to look at the data type of each of the columns of a new data set.  We can do so with the `DataFrame.dtypes` attribute.

In [4]:
df_spy.dtypes

date         datetime64[ns]
adj_close           float64
close               float64
high                float64
low                 float64
open                float64
volume                int64
dtype: object

## Row Slicing

The simplest way to slice a `DataFrame` is to use square brackets: `[]`.  The syntax `df[i:j]` will generate a `DataFrame` who's first row is the `i`th row of `df` and who's last row is the `(j-1)`th row of `df`.   Let's demonstrate this with a some examples:

Starting from the 0th row, and ending with the 0th row:

In [5]:
df_spy[0:1]

Unnamed: 0,date,adj_close,close,high,low,open,volume
0,2021-06-30,404.51,428.06,428.78,427.18,427.21,64827900


Starting with the 3rd row, and ending with the 6th row:

In [6]:
df_spy[3:7]

Unnamed: 0,date,adj_close,close,high,low,open,volume
3,2021-07-06,409.11,432.93,434.01,430.01,433.78,68710400
4,2021-07-07,410.56,434.46,434.76,431.51,433.66,63549500
5,2021-07-08,407.21,430.92,431.73,427.52,428.78,97595200
6,2021-07-09,411.56,435.52,435.84,430.71,432.53,76238600


---

**Code Challenge:** Retrieve the 15th, 16th, and 17th rows of `df_spy`.

In [7]:
#| code-fold: true
#| code-summary: "Solution"
df_spy[15:18]

Unnamed: 0,date,adj_close,close,high,low,open,volume
15,2021-07-22,411.5,435.46,435.72,433.69,434.74,47878500
16,2021-07-23,415.74,439.94,440.3,436.79,437.52,63766600
17,2021-07-26,416.76,441.02,441.03,439.26,439.31,43719200


---

Using the syntax `df[:n]` automatically starts the indexing at `0`.  For example, the following code retrieves all of `df_spy` (notice that `len(df_spy)` gives the number of rows of `df_spy`):

In [8]:
df_spy[:len(df_spy)]

Unnamed: 0,date,adj_close,close,high,low,open,volume
0,2021-06-30,404.51,428.06,428.78,427.18,427.21,64827900
1,2021-07-01,406.75,430.43,430.60,428.80,428.87,53441000
2,2021-07-02,409.86,433.72,434.10,430.52,431.67,57697700
3,2021-07-06,409.11,432.93,434.01,430.01,433.78,68710400
4,2021-07-07,410.56,434.46,434.76,431.51,433.66,63549500
...,...,...,...,...,...,...,...
17,2021-07-26,416.76,441.02,441.03,439.26,439.31,43719200
18,2021-07-27,414.86,439.01,439.94,435.99,439.91,67397100
19,2021-07-28,414.69,438.83,440.30,437.31,439.68,52472400
20,2021-07-29,416.41,440.65,441.80,439.81,439.82,47435300


---

**Code Challenge:** Retrieve the first five rows of `df_spy`.

In [9]:
#| code-fold: true
#| code-summary: "Solution"
df_spy[:5]

Unnamed: 0,date,adj_close,close,high,low,open,volume
0,2021-06-30,404.51,428.06,428.78,427.18,427.21,64827900
1,2021-07-01,406.75,430.43,430.6,428.8,428.87,53441000
2,2021-07-02,409.86,433.72,434.1,430.52,431.67,57697700
3,2021-07-06,409.11,432.93,434.01,430.01,433.78,68710400
4,2021-07-07,410.56,434.46,434.76,431.51,433.66,63549500


---

There are a couple of row slicing tricks that involve negative numbers that are worth mentioning.

The syntax `df[-n:]` retrieves the last `n` rows of `df`.  The following code retrieves the last five rows of `df_spy`.

In [10]:
df_spy[-5:]

Unnamed: 0,date,adj_close,close,high,low,open,volume
17,2021-07-26,416.76,441.02,441.03,439.26,439.31,43719200
18,2021-07-27,414.86,439.01,439.94,435.99,439.91,67397100
19,2021-07-28,414.69,438.83,440.3,437.31,439.68,52472400
20,2021-07-29,416.41,440.65,441.8,439.81,439.82,47435300
21,2021-07-30,414.39,438.51,440.06,437.77,437.91,68951200


The syntax `df[:-n]` retrieves all but the last `n` rows of `df`.  The following code retrieves all but the last 10 rows of `df_spy`:

In [11]:
df_spy[:-10]

Unnamed: 0,date,adj_close,close,high,low,open,volume
0,2021-06-30,404.51,428.06,428.78,427.18,427.21,64827900
1,2021-07-01,406.75,430.43,430.60,428.80,428.87,53441000
2,2021-07-02,409.86,433.72,434.10,430.52,431.67,57697700
3,2021-07-06,409.11,432.93,434.01,430.01,433.78,68710400
4,2021-07-07,410.56,434.46,434.76,431.51,433.66,63549500
...,...,...,...,...,...,...,...
7,2021-07-12,413.03,437.08,437.35,434.97,435.43,52889600
8,2021-07-13,411.63,435.59,437.84,435.31,436.24,52911300
9,2021-07-14,412.24,436.24,437.92,434.91,437.40,64130400
10,2021-07-15,410.83,434.75,435.53,432.72,434.81,55126400


---

**Code Challenge:** Retrieve the first row of `df_spy` with negative indexing.

In [12]:
#| code-fold: true
#| code-summary: "Solution"
df_spy[:-(len(df_spy)-1)]

Unnamed: 0,date,adj_close,close,high,low,open,volume
0,2021-06-30,404.51,428.06,428.78,427.18,427.21,64827900


---

**Code Challenge:** Use simple slicing to select the last three rows of a `df_spy`: 1) without explicitly using row numbers; 2) with explicitly using row numbers.

In [13]:
#| code-fold: true
#| code-summary: "Solution"
df_spy[len(df_spy)-3:len(df_spy)]

Unnamed: 0,date,adj_close,close,high,low,open,volume
19,2021-07-28,414.69,438.83,440.3,437.31,439.68,52472400
20,2021-07-29,416.41,440.65,441.8,439.81,439.82,47435300
21,2021-07-30,414.39,438.51,440.06,437.77,437.91,68951200


In [14]:
#| code-fold: true
#| code-summary: "Solution"
df_spy[-3:]

Unnamed: 0,date,adj_close,close,high,low,open,volume
19,2021-07-28,414.69,438.83,440.3,437.31,439.68,52472400
20,2021-07-29,416.41,440.65,441.8,439.81,439.82,47435300
21,2021-07-30,414.39,438.51,440.06,437.77,437.91,68951200


---

## `DataFrame` Indexes

Under the hood, a `DataFrame` has several `indexes`:

`columns` - the set of column names is an (explicit) index.

`row` - whenever a `DataFrame` is created, there is an explicit row index that is created.  If one isn't specified, then a sequence of non-negative integers is used.

`implicit` - each row has an implicit row-number, and each column has an implicit column-number.

Let's take a look at the `columns` index of `df_spy`.

In [15]:
df_spy.columns

Index(['date', 'adj_close', 'close', 'high', 'low', 'open', 'volume'], dtype='object')

In [16]:
type(df_spy.columns)

pandas.core.indexes.base.Index

Next, let's take a look at the explicit row `index` attribute of `df_spy`.

In [17]:
df_spy.index

RangeIndex(start=0, stop=22, step=1)

In [18]:
type(df_spy.index)

pandas.core.indexes.range.RangeIndex

Since we reset the index for `df_spy`, a `RangeIndex` object is used for the explicit row `index`.  You can think of a `RangeIndex` object as a glorified set of consecutive integers.

For the most part, we won't be too concerned with `indexes`.  A lot of data analysis can be done without worrying about them.  However, it's good to be aware `indexes` exist becase they can come into play for more advanced topics, such as joining tables together; they also come up in Stack Overflow examples frequently.

For the purposes of this chapter, our interest in `indexes` comes from how they are related to two built-in `DataFrame` *indexers*: `DataFrame.iloc` and `DataFrame.loc`.

## Indexing with `DataFrame.iloc`

The indexer `DataFrame.iloc` can be used to access rows and columns using their implicit row and column numbers.

Here is an example of `iloc` that retrieves the first two rows of `df_spy`.

In [19]:
df_spy.iloc[0:2,]

Unnamed: 0,date,adj_close,close,high,low,open,volume
0,2021-06-30,404.51,428.06,428.78,427.18,427.21,64827900
1,2021-07-01,406.75,430.43,430.6,428.8,428.87,53441000


Notice, that because we didn't specify any column numbers, the code above retrieves all columns.

The following code grabs the first three row and the first three columns of `df_spy`.

In [20]:
df_spy.iloc[0:3, 0:3]

Unnamed: 0,date,adj_close,close
0,2021-06-30,404.51,428.06
1,2021-07-01,406.75,430.43
2,2021-07-02,409.86,433.72


We can also supply `.iloc` with `lists` rather than ranges to specify custom sets of columns and rows:

In [21]:
lst_row = [0, 2] # 0th and 2nd row
lst_col = [0, 6] # date and adj_close columns
df_spy.iloc[lst_row, lst_col]

Unnamed: 0,date,volume
0,2021-06-30,64827900
2,2021-07-02,57697700


Using `lists` as a means of indexing is sometimes referred to as *fancy indexing*.

---

**Code Challenge** Use fancy indexing to grab the 14th, 0th, and 5th rows of `df_spy` - in that order.

In [22]:
#| code-fold: true
#| code-summary: "Solution"
df_spy.iloc[[14, 0, 5]]

Unnamed: 0,date,adj_close,close,high,low,open,volume
14,2021-07-21,410.64,434.55,434.7,431.01,432.34,64724400
0,2021-06-30,404.51,428.06,428.78,427.18,427.21,64827900
5,2021-07-08,407.21,430.92,431.73,427.52,428.78,97595200


---

## Indexing with `DataFrame.loc`

Rather than using the implicit row or column numbers, it is often more useful to access data by using the explicit row or column indices.

Let's use the `DataFrame.set_index()` method to set the `date` column as our new index.  The `dates` will be a more interesting explicit index.

In [23]:
df_spy.set_index('date', inplace=True)
df_spy.head()

Unnamed: 0_level_0,adj_close,close,high,low,open,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-06-30,404.51,428.06,428.78,427.18,427.21,64827900
2021-07-01,406.75,430.43,430.6,428.8,428.87,53441000
2021-07-02,409.86,433.72,434.1,430.52,431.67,57697700
2021-07-06,409.11,432.93,434.01,430.01,433.78,68710400
2021-07-07,410.56,434.46,434.76,431.51,433.66,63549500


To see the effect of the above code, we can have a look at the `index` of `df_spy`.

In [24]:
df_spy.index

DatetimeIndex(['2021-06-30', '2021-07-01', '2021-07-02', '2021-07-06',
               '2021-07-07', '2021-07-08', '2021-07-09', '2021-07-12',
               '2021-07-13', '2021-07-14', '2021-07-15', '2021-07-16',
               '2021-07-19', '2021-07-20', '2021-07-21', '2021-07-22',
               '2021-07-23', '2021-07-26', '2021-07-27', '2021-07-28',
               '2021-07-29', '2021-07-30'],
              dtype='datetime64[ns]', name='date', freq=None)

And notice that `date` is no longer column of `df_spy`.

In [25]:
df_spy.columns

Index(['adj_close', 'close', 'high', 'low', 'open', 'volume'], dtype='object')

Now that we have successfully set the row `index` of `df_spy` to be the `date`, let's see how we can use this `index` to access the data via `.loc`.
        
Here is an example of how we can grab a slice of rows, associated with a date-range.

In [26]:
df_spy.loc['2021-07-23':'2021-07-31']

Unnamed: 0_level_0,adj_close,close,high,low,open,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-07-23,415.74,439.94,440.3,436.79,437.52,63766600
2021-07-26,416.76,441.02,441.03,439.26,439.31,43719200
2021-07-27,414.86,439.01,439.94,435.99,439.91,67397100
2021-07-28,414.69,438.83,440.3,437.31,439.68,52472400
2021-07-29,416.41,440.65,441.8,439.81,439.82,47435300
2021-07-30,414.39,438.51,440.06,437.77,437.91,68951200


If we want to select only the `volume` and `adjusted` columns for these dates, we would type the following: 

In [27]:
df_spy.loc['2021-07-23':'2021-07-31', ['volume', 'adj_close']]

Unnamed: 0_level_0,volume,adj_close
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-07-23,63766600,415.74
2021-07-26,43719200,416.76
2021-07-27,67397100,414.86
2021-07-28,52472400,414.69
2021-07-29,47435300,416.41
2021-07-30,68951200,414.39


---

**Code Challenge:** Use `.loc` to grab the `date`, `volume`, and `close` columns from `df_spy`.

In [28]:
#| code-fold: true
#| code-summary: "Solution"
df_spy.loc[:,['volume', 'close']]

Unnamed: 0_level_0,volume,close
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-06-30,64827900,428.06
2021-07-01,53441000,430.43
2021-07-02,57697700,433.72
2021-07-06,68710400,432.93
2021-07-07,63549500,434.46
...,...,...
2021-07-26,43719200,441.02
2021-07-27,67397100,439.01
2021-07-28,52472400,438.83
2021-07-29,47435300,440.65


---

## Related Reading

*Python Data Science Handbook* - Section 2.7 - Fancy Indexing

*Python Data Science Handbook* - Section 3.2 - Data Indexing and Selection 