# Tutorial 04 - `DataFrame` Indexing and Slicing

While performing data analysis, you often have the need to access specific rows of a `DataFrame`. This action is referred to as *indexing*.  If the rows you are accessing are contiguous, the action is referred to as *slicing*.

The purpose of this tutorial is to survey various methods for indexing and slicing.

## Importing Packages

Let's begin by importing the packages that we will be using in this tutorial.

In [1]:
import numpy as np
import pandas as pd

## Reading-In Sample Data

Next, we'll use the `read_csv()` function from `pandas` to read-in some sample data to work with.  We'll again use the SPY prices from December 2018.

In [2]:
df_spy = pd.read_csv("../data/spy_dec_2018.csv")
df_spy

Unnamed: 0,date,open,high,low,close,volume,adjusted
0,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,277.678436
1,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,268.681
2,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,268.273376
3,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,262.039795
4,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,262.536896
5,2018-12-11,267.660004,267.869995,262.480011,264.130005,121504400,262.596527
6,2018-12-12,267.470001,269.0,265.369995,265.459991,97976700,263.918793
7,2018-12-13,266.519989,267.48999,264.119995,265.369995,96662700,263.829315
8,2018-12-14,262.959991,264.029999,259.850006,260.470001,116961100,258.957794
9,2018-12-17,259.399994,260.649994,253.529999,255.360001,165492300,253.877457


It is often useful to look at the data type of each of the columns of a new data set.

That is what the following line of code accomplishes:

In [3]:
df_spy.dtypes

date         object
open        float64
high        float64
low         float64
close       float64
volume        int64
adjusted    float64
dtype: object

- Notice that the `date` column has a `dtype` of `object`.  

- This means that `pandas` is interpreting it as a text column, rather than a date.  
- Formatting problems with dates is very common in data analysis. 

- We'll address this issue a later in the tutorial.

## Simple Row Slicing

The simplest way to slice a `DataFrame` is to use square brackets.  The syntax `df[i:j]` will generate a `DataFrame` who's first row is the `i`th row of `df` and who's last row is the `(j-1)`th row of `df`.

Let's demonstrate with a couple of examples:

In [4]:
df_spy[0:1] # starting from the 0th row, and ending with the 0th row
df_spy[3:7] # starting with the 3rd row, and ending with the 6th row
df_spy # full dataframe

Unnamed: 0,date,open,high,low,close,volume,adjusted
0,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,277.678436
1,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,268.681
2,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,268.273376
3,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,262.039795
4,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,262.536896
5,2018-12-11,267.660004,267.869995,262.480011,264.130005,121504400,262.596527
6,2018-12-12,267.470001,269.0,265.369995,265.459991,97976700,263.918793
7,2018-12-13,266.519989,267.48999,264.119995,265.369995,96662700,263.829315
8,2018-12-14,262.959991,264.029999,259.850006,260.470001,116961100,258.957794
9,2018-12-17,259.399994,260.649994,253.529999,255.360001,165492300,253.877457


## `DataFrame` Index

Under the hood of `pandas`, a `DataFrame` has several `index` attributes associated with it, which helps to organize the data and keep it internally consistent.  There are several index attributes worth mentioning:

`columns` - the set of column names is an (explicit) index.

`row` - whenever a `DataFrame` is created, there is an explicit row index that is created.  If one isn't specified, then the the sequence of positive integers is used.

`implicit` - each row has an implicit row-number, and each column has an implicit column-number.

Let's take a look at the `columns` index of `df_spy`:

In [5]:
df_spy.columns
type(df_spy.columns)

pandas.core.indexes.base.Index

Next, let's take a look at the explicit row `index` attribute of `df_spy`: 

In [6]:
df_spy.index
type(df_spy.index)

pandas.core.indexes.range.RangeIndex

Since we didn't specify one when reading in the data, a `RangeIndex` object is used for the explicit row `index`.  You can think of a `RangeIndex` object as a glorified set of consecutive non-negative integers that start at 0.

We won't be too concerned with `index` attributes at the moment.  A lot of data analysis can be done without worrying about them.  However, it's good to be aware `indexes` exist becase they come into play for more advanced topics.

The reason I mention `indexes` now is that they are related to two built-in `DataFrame` attributes that are used for indexing and slicing data: `DataFrame.iloc` and `DataFrame.loc`.

## Indexing with `DataFrame.iloc`

The indexer attribute `DataFrame.iloc` can be used to access rows and columns using their implicit row and column numbers.

Here is an example of `iloc` that retrieves the first two rows of `df_spy`:

In [7]:
df_spy.iloc[0:2]

Unnamed: 0,date,open,high,low,close,volume,adjusted
0,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,277.678436
1,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,268.681


Notice, that because no column numbers were specified, all the columns are retrieved.

The following code grabs just the first three columns of the first two rows of `df_spy`:

In [8]:
df_spy.iloc[0:2, 0:3]

Unnamed: 0,date,open,high
0,2018-12-03,280.279999,280.399994
1,2018-12-04,278.369995,278.850006


We can also supply `.iloc` with lists rather than ranges to specify custom sets of columns and rows:

In [9]:
lst_row = [0, 2] # first and third rows
lst_col = [0, 6] # date and adjusted columns
df_spy.iloc[lst_row, lst_col]

Unnamed: 0,date,adjusted
0,2018-12-03,277.678436
2,2018-12-06,268.273376


Using `lists` as a means of indexing is sometimes referred to as *fancy indexing*.

## Indexing with `DataFrame.loc`

Rather than using the implicit row or column numbers, it is often more useful to access data by using the explicit row or column indices.



As a preliminary step, let's first recast the `date` column as a `datetime`.  This can be done easily by using `pandas` built-in `to_datetime()` method:

In [10]:
df_spy['date'] = pd.to_datetime(df_spy['date']) 
df_spy.dtypes

date        datetime64[ns]
open               float64
high               float64
low                float64
close              float64
volume               int64
adjusted           float64
dtype: object

Next we'll use the `DataFrame.set_index()` method to set the `date` column as our new index.  We are doing this because we ultimately want to use our dates as an index to access the data.

In [11]:
df_spy.set_index('date', inplace = True)
df_spy.head()

Unnamed: 0_level_0,open,high,low,close,volume,adjusted
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,277.678436
2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,268.681
2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,268.273376
2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,262.039795
2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,262.536896


To see the effect of the above code, we can have a look at the `index` of `df_spy`.  Notice that `date` is no longer column:

In [12]:
df_spy.index

DatetimeIndex(['2018-12-03', '2018-12-04', '2018-12-06', '2018-12-07',
               '2018-12-10', '2018-12-11', '2018-12-12', '2018-12-13',
               '2018-12-14', '2018-12-17', '2018-12-18', '2018-12-19',
               '2018-12-20', '2018-12-21', '2018-12-24', '2018-12-26',
               '2018-12-27', '2018-12-28'],
              dtype='datetime64[ns]', name='date', freq=None)

Now that we have successful set the row `index` of `df_spy` to be the `date` column, let's see how we can use this `index` to access the data via `.loc`.
        
Here is an example of how we can grab a slice of rows, associated with a date-range:

In [13]:
df_spy.loc['2018-12-21':'2018-12-28']

Unnamed: 0_level_0,open,high,low,close,volume,adjusted
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-21,246.740005,249.710007,239.979996,240.699997,255345600,240.699997
2018-12-24,239.039993,240.839996,234.270004,234.339996,147311600,234.339996
2018-12-26,235.970001,246.179993,233.759995,246.179993,218485400,246.179993
2018-12-27,242.570007,248.289993,238.960007,248.070007,186267300,248.070007
2018-12-28,249.580002,251.399994,246.449997,247.75,153100200,247.75


If we want to select only the `volume` and `adjusted` columns for these dates, we would type the following: 

In [14]:
df_spy.loc['2018-12-21':'2018-12-28', ['volume', 'adjusted']]

Unnamed: 0_level_0,volume,adjusted
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-12-21,255345600,240.699997
2018-12-24,147311600,234.339996
2018-12-26,218485400,246.179993
2018-12-27,186267300,248.070007
2018-12-28,153100200,247.75


## Related Reading

*PDSH* - 2.6 - Comparisons, Masks, and Boolean Logic

*PDSH* - 2.7 - Fancy Indexing

*PDSH* - 3.2 - Data Indexing and Selection 