# Tutorial 04 - `DataFrame` Indexing and Slicing

While performing data analysis, you often have the need to access specific rows of a `DataFrame`. This action is referred to as *indexing*.  If the rows you are accessing are contiguous, the action is referred to as *slicing*.

The purpose of this tutorial is to survey various methods for indexing and slicing.

## Importing Packages

Let's begin by importing the packages that we will be using in this tutorial.

In [None]:
##> import numpy as np
##> import pandas as pd



## Reading-In Sample Data

Next, we'll use the `read_csv()` function from `pandas` to read-in some sample data to work with.  We'll again use the SPY prices from December 2018.

In [None]:
##> df_spy = pd.read_csv("../data/spy_dec_2018.csv")
##> df_spy



It is often useful to look at the data type of each of the columns of a new data set.

That is what the following line of code accomplishes:

In [None]:
##> df_spy.dtypes



- Notice that the `date` column has a `dtype` of `object`.  

- This means that `pandas` is interpreting it as a text column, rather than a date.  
- Formatting problems with dates is very common in data analysis. 

- We'll address this issue a later in the tutorial.

## Simple Row Slicing

The simplest way to slice a `DataFrame` is to use square brackets.  The syntax `df[i:j]` will generate a `DataFrame` who's first row is the `i`th row of `df` and who's last row is the `(j-1)`th row of `df`.

Let's demonstrate with a couple of examples:

In [None]:
##> df_spy[0:1] # starting from the 0th row, and ending with the 0th row
##> df_spy[3:7] # starting with the 3rd row, and ending with the 6th row
##> df_spy # full dataframe




## `DataFrame` Index

Under the hood of `pandas`, a `DataFrame` has several `index` attributes associated with it, which helps to organize the data and keep it internally consistent.  There are several index attributes worth mentioning:

`columns` - the set of column names is an (explicit) index.

`row` - whenever a `DataFrame` is created, there is an explicit row index that is created.  If one isn't specified, then the the sequence of positive integers is used.

`implicit` - each row has an implicit row-number, and each column has an implicit column-number.

Let's take a look at the `columns` index of `df_spy`:

In [None]:
##> df_spy.columns
##> type(df_spy.columns)



Next, let's take a look at the explicit row `index` attribute of `df_spy`: 

In [None]:
##> df_spy.index
##> type(df_spy.index)



Since we didn't specify one when reading in the data, a `RangeIndex` object is used for the explicit row `index`.  You can think of a `RangeIndex` object as a glorified set of consecutive non-negative integers that start at 0.

We won't be too concerned with `index` attributes at the moment.  A lot of data analysis can be done without worrying about them.  However, it's good to be aware `indexes` exist becase they come into play for more advanced topics.

The reason I mention `indexes` now is that they are related to two built-in `DataFrame` attributes that are used for indexing and slicing data: `DataFrame.iloc` and `DataFrame.loc`.

## Indexing with `DataFrame.iloc`

The indexer attribute `DataFrame.iloc` can be used to access rows and columns using their implicit row and column numbers.

Here is an example of `iloc` that retrieves the first two rows of `df_spy`:

In [None]:
##> df_spy.iloc[0:2]



Notice, that because no column numbers were specified, all the columns are retrieved.

The following code grabs just the first three columns of the first two rows of `df_spy`:

In [None]:
##> df_spy.iloc[0:2, 0:3]



We can also supply `.iloc` with lists rather than ranges to specify custom sets of columns and rows:

In [None]:
##> lst_row = [0, 2] # first and third rows
##> lst_col = [0, 6] # date and adjusted columns
##> df_spy.iloc[lst_row, lst_col]




Using `lists` as a means of indexing is sometimes referred to as *fancy indexing*.

## Indexing with `DataFrame.loc`

Rather than using the implicit row or column numbers, it is often more useful to access data by using the explicit row or column indices.



As a preliminary step, let's first recast the `date` column as a `datetime`.  This can be done easily by using `pandas` built-in `to_datetime()` method:

In [None]:
##> df_spy['date'] = pd.to_datetime(df_spy['date']) 
##> df_spy.dtypes



Next we'll use the `DataFrame.set_index()` method to set the `date` column as our new index.  We are doing this because we ultimately want to use our dates as an index to access the data.

In [None]:
##> df_spy.set_index('date', inplace = True)
##> df_spy.head()



To see the effect of the above code, we can have a look at the `index` of `df_spy`.  Notice that `date` is no longer column:

In [None]:
##> df_spy.index



Now that we have successful set the row `index` of `df_spy` to be the `date` column, let's see how we can use this `index` to access the data via `.loc`.
        
Here is an example of how we can grab a slice of rows, associated with a date-range:

In [None]:
##> df_spy.loc['2018-12-21':'2018-12-28']



If we want to select only the `volume` and `adjusted` columns for these dates, we would type the following: 

In [None]:
##> df_spy.loc['2018-12-21':'2018-12-28', ['volume', 'adjusted']]



## Related Reading

*PDSH* - 2.6 - Comparisons, Masks, and Boolean Logic

*PDSH* - 2.7 - Fancy Indexing

*PDSH* - 3.2 - Data Indexing and Selection 