<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setting-a-column-as-the-index" data-toc-modified-id="Setting-a-column-as-the-index-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setting a column as the index</a></span></li><li><span><a href="#Why-is-custom-indexing-important?" data-toc-modified-id="Why-is-custom-indexing-important?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Why is custom indexing important?</a></span></li><li><span><a href="#Removing/Resetting-Custom-Index" data-toc-modified-id="Removing/Resetting-Custom-Index-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Removing/Resetting Custom Index</a></span></li><li><span><a href="#Multi-level-/-Hierarchical-Index" data-toc-modified-id="Multi-level-/-Hierarchical-Index-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Multi-level / Hierarchical Index</a></span></li><li><span><a href="#Problems-with-using-column-data-as-index" data-toc-modified-id="Problems-with-using-column-data-as-index-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Problems with using column data as index</a></span></li></ul></div>

In [1]:
import pandas as pd

In [2]:
# Data Frame from cars.csv
cars = pd.read_csv('../datasets/cars.csv') # We omitted index_col=0 here to make a point
cars

Unnamed: 0.1,Unnamed: 0,cars_per_cap,country,drives_right
0,US,809,United States,True
1,AUS,731,Australia,False
2,JPN,588,Japan,False
3,IN,18,India,False
4,RU,200,Russia,True
5,MOR,70,Morocco,True
6,EG,45,Egypt,True


## Setting a column as the index

If we want to make a column as the index of the dataframe, we can use a post-import operation for that.

In [3]:
# Rename the column to be used as index first
cars.rename(columns={'Unnamed: 0':'idx'}, inplace=True)
# cars.columns = ["idx", "cars_per_cap", "country", "drives_right"]

<span class="mark">**It is better to assign the dataframe with an index to a new dataframe instead of doing it in-place**</span>

In [4]:
# Then, assign it as the new index
# cars.set_index("idx", drop=True, inplace=True) # in-place
cars_with_idx = cars.set_index("idx", drop=True)
cars_with_idx.head()

Unnamed: 0_level_0,cars_per_cap,country,drives_right
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US,809,United States,True
AUS,731,Australia,False
JPN,588,Japan,False
IN,18,India,False
RU,200,Russia,True


**Note**: The values in the new index column do not need to be unique, but when subsetting with any duplicate index, all duplicate rows will be returned.

## Why is custom indexing important?

Because we can then use them with `.loc[]`

In [5]:
cars_with_idx.loc[['US', 'JPN']]

Unnamed: 0_level_0,cars_per_cap,country,drives_right
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
US,809,United States,True
JPN,588,Japan,False


## Removing/Resetting Custom Index

We can remove any custom indexing and reset back to Pandas' default with `reset_index()`

In [6]:
cars_with_idx.reset_index(drop=True, inplace=True)
cars_with_idx.head()

Unnamed: 0,cars_per_cap,country,drives_right
0,809,United States,True
1,731,Australia,False
2,588,Japan,False
3,18,India,False
4,200,Russia,True


## Multi-level / Hierarchical Index

We can have multiple index columns

In [7]:
# Then, assign it as the new index
cars_with_idx.set_index(["cars_per_cap", "country"], drop=True, inplace=True)
cars_with_idx.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,drives_right
cars_per_cap,country,Unnamed: 2_level_1
809,United States,True
731,Australia,False
588,Japan,False
18,India,False
200,Russia,True


To take the subset of rows in this case, we need to pass 2 keys to `.loc[]`

In [8]:
cars_with_idx.loc[[809, 18]]

Unnamed: 0_level_0,Unnamed: 1_level_0,drives_right
cars_per_cap,country,Unnamed: 2_level_1
809,United States,True
18,India,False


- To subset on a multiple hierarchy of key levels, we need to pass tuples for each level
- All the conditions from the tuples must match

In [9]:
cars_with_idx.loc[[(809, "United States"), (18, "India")]]

Unnamed: 0_level_0,Unnamed: 1_level_0,drives_right
cars_per_cap,country,Unnamed: 2_level_1
809,United States,True
18,India,False


By default, sorting with multiple index happens from outer to inner, in an ascending order

In [10]:
cars_with_idx.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,drives_right
cars_per_cap,country,Unnamed: 2_level_1
18,India,False
45,Egypt,True
70,Morocco,True
200,Russia,True
588,Japan,False
731,Australia,False
809,United States,True


But we can also control how we want to sort

In [11]:
# Sort from country, then cars_per_cap, in ascending and descending order
cars_with_idx.sort_index(level=["country", "cars_per_cap"], ascending=[True, False])

Unnamed: 0_level_0,Unnamed: 1_level_0,drives_right
cars_per_cap,country,Unnamed: 2_level_1
731,Australia,False
45,Egypt,True
18,India,False
588,Japan,False
70,Morocco,True
200,Russia,True
809,United States,True


## Problems with using column data as index

- Makes the dataframe harder to think about
- Violate the "tidy data" principles
- You need to learn two syntaxes: One for `iloc` and one for `loc`

If you decide you don't want to use custom indexes, that is perfectly reasonable. Simply use the default indexing. It is only good to know that they works and helpful when reading other people's codes.

In [12]:
cars = pd.read_csv('../datasets/cars.csv')

# Rename the unnamed column
cars.rename(columns={'Unnamed: 0':'country_abbr'}, inplace=True)
cars.head()

Unnamed: 0,country_abbr,cars_per_cap,country,drives_right
0,US,809,United States,True
1,AUS,731,Australia,False
2,JPN,588,Japan,False
3,IN,18,India,False
4,RU,200,Russia,True
