<a href="https://colab.research.google.com/github/gt-cse-6040/bootcamp/blob/main/Module%201/Session%201/s16nb1_pandas_loc_iloc_SP25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Selection on DataFrames using .loc[ ] and .iloc[ ]

#### Historically, significant numbers of students have struggled with the concept of the these two Pandas DataFrame attributes.

#### Specifically, when to use which one.

#### This is to present an introduction to the two attributes, along with some general guidance on when to use each.

Pandas DataFrames have the special attributes `loc` and `iloc` for label-based and integer-based indexing, respectively.

Since the DataFrame is two-dimensional, you can select a subset of the rows and columns with NumPy-like notation using either `axis labels (loc)` or `integers (iloc)`.

**The below cell simply loads the required modules into the notebook.**

In [None]:
import pandas as pd  # Standard idiom for loading pandas
from pandas import DataFrame, Series
import numpy as np

### Create a dataframe to work with.

In [None]:
# create a DataFrame
courses = DataFrame({'course': ['ISYE6501', 'CSE6040', 'MGT6203', 'ISYE6740', 'ISYE6644', 'CSE6242'],
                   'students': [1200, 1400, 1000, 400, 700, 900],
                   'instructor': ['Sokol', 'Vuduc', 'Bien', 'Xie', 'Goldsman', 'Chau'],
                    'credit_hours': ['3','3','3','3','3','3']})

### Now let's look at four ways of indexing rows in the dataframe.

There are other methods, but these are the main four that you will see, both in this course and "in the wild".

#### First is the default method.

The index is integer-based, with the first row being index `0`, and the indexes running up to the length of the dataframe -1. Students should be familiar with this indexing scheme, as it is the same as used by Python lists.

Notice in the view below, there is no column name above the index, and the indices are numbered 0-5, for the 6 rows.

In [None]:
courses

#### The second method is when the index is the same as one of the columns. The index has the same name as the column, from which it is derived.

The column is still a part of the dataframe, which means the index is duplicated by the column. See the first example below.

In [None]:
courses_name_index_1 = courses.copy()
courses_name_index_1.index = courses_name_index_1['course']
courses_name_index_1

#### The third method is when the index is the same as one of the columns, but it does not have a name.

The column is still a part of the dataframe, which means the index is duplicated by the column, but without the column name. See the example below.

In [None]:
courses_name_index_2 = courses.copy()
courses_name_index_2.index = courses_name_index_2['course']

# remove the index name
courses_name_index_2.index.name = None
courses_name_index_2

#### The final method is when the index retains the column name from which it is derived, and the column itself is removed.

See the example below.

In [None]:
courses_name_index_3 = courses.copy()
courses_name_index_3.index = courses_name_index_3['course']
courses_name_index_3 = courses_name_index_3.drop('course', axis=1)
courses_name_index_3

## Let's look at how .loc[ ] and .iloc[ ] work with the first two of these scenarios.

## The third and fourth scenarios are not materially different, so we leave those for the students to work through on their own.

### From above, we know that `.loc[ ]` selects rows (and colums) using the `axis labels`.

### Also from above, we know that `.iloc[ ]` selects rows (and colums) using the `integers`, which are the `index positions` of the row or column.

#### When the index is the default (integers with no column name), `.loc[ ]` and `.iloc[ ]` operate in the same manner.

This is because the integer index is the same as the row index label.

The result of selecting a single row is a Series with an index that contains the DataFrame's column labels.

In [None]:
courses

In [None]:
courses.loc[0]

In [None]:
courses.iloc[0]

#### To select multiple roles, creating a new DataFrame, pass a sequence of labels.

Notice that the sequence of labels is enclosed in its own set of brackets.

In [None]:
courses.loc[[0,1]]

In [None]:
courses.iloc[[0,1]]

### Now let's look at the second scenario.

In [None]:
courses_name_index_2

In [None]:
# errors out, uncomment to see why
# courses_name_index_2.loc[0]

In [None]:
courses_name_index_2.iloc[0]

#### As we can see in this scenario, `.iloc[ ]` returns the correct row, even though the index does not have the name of `0`.

#### This is because `iloc[ ]` is based on the position of the row in the dataframe. So we are calling the first row (index 0), and it returns the first row.

#### What is the takeaway?

### Using `.iloc[ ]` will always return the row at that position in the dataframe, no matter what the index is, or what the index is named. (0-based position).

### To use `.loc[ ]`, we have to address the row by the `index name` for that row.

In [None]:
courses_name_index_2.loc['ISYE6501']

### Calling multiple rows in each scenario is the same, creating a list sequence, enclosed in brackets.

In [None]:
courses_name_index_2.iloc[[0,1,2]]

In [None]:
courses_name_index_2.loc[['ISYE6501','CSE6040','MGT6203']]

### What if we want to return a subset of rows and a subset of columns?

#### You can combine both row and column selection in loc by separating the selections with a comma.

In [None]:
courses_name_index_2.iloc[[0,1,2],[0,2]]

In [None]:
courses_name_index_2.loc[['ISYE6501','CSE6040','MGT6203'],['course','instructor']]

### You can use slicing notation with both methods.

#### Note that with the slicing notation, you have both the rows and columns within the same/single set of brackets, separated by the comma.

In [None]:
# return the first 3 rows and the first 3 columns
courses_name_index_2.iloc[0:3,0:3]

In [None]:
# return all of the rows and the first 3 columns
courses_name_index_2.iloc[:,0:3]

In [None]:
# return the first 3 rows and the first 3 columns
courses_name_index_2.loc['ISYE6501':'MGT6203','course':'instructor']

In [None]:
# return all of the rows and the first 3 columns
courses_name_index_2.loc[:,'course':'instructor']

### The decision of when to use one or the other is completely up to the student.

#### -- If you are selecting columns, `.loc[ ]` is almost universally preferred, because you are going to be selecting the columns you want by their name.

#### -- In general, if you are selecting rows, and the row indexes are strings), then `.loc[ ]` is generally going to be better, because you are referencing the rows by their names.

#### -- If the rows are indexed by the default integer method, then either is fine, as they will return the same result.

#### -- Finally, if you want to select some count/subset of rows, then using `.iloc[ ]` is probably the best. For example, if you want the first 500 rows, then you can use slicing with the integer indexes -- `df.iloc[0:501]` or `df.iloc[0:501,:]`



## What are your questions concerning `.loc[ ]` and `.iloc[ ]`?