# __Slicing and Indexing DataFrames__

# Outline
- [1 Explicit Indexes](#exp-ind)
- [&nbsp;&nbsp; 1.1 Setting and removing indexes](#set-rm-ind)
- [&nbsp;&nbsp; 1.2 Subsetting with .loc[]](#sub-loc)
- [&nbsp;&nbsp; 1.3 Setting multi-level indexes](#sub-multi-lvl-ind)
- [&nbsp;&nbsp; 1.4 Sorting by index values](#sortby-ind-val)
- [2 Slicing and subsetting with .loc and .iloc](#slice-sub-loc-iloc)
- [&nbsp;&nbsp; 2.1 Slicing index values](#slice-ind-vals)
- [&nbsp;&nbsp; 2.2 Slicing in both directions](#slice-directions)
- [&nbsp;&nbsp; 2.3 Slicing time series](#slice-time)
- [&nbsp;&nbsp; 2.4 Subsetting by row/column number](#sub-row-col)
- [3 Working with pivot tables](#pvt-tbl)
- [&nbsp;&nbsp; 3.1 Pivot temperatures by city and year](#pvt-temp)
- [&nbsp;&nbsp; 3.2 Subsetting pivot tables](#sub-pvt-tbl)
- [&nbsp;&nbsp; 3.3 Calculating on a pivot table](#calc-pvt-tbl)

<a id="exp-ind"></a>
# 1 Explicit Indexes
<div align="middle">
<video width="60%" controls>
      <source src="./../../res/videos/3.slicing-indexing-dataframes/1.explicit_indexes.mp4" type="video/mp4">
</video></div>

<a id="set-rm_ind"></a>
## 1.1 Setting and removing indexes
pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).

In this chapter, you'll be exploring temperatures, a DataFrame of average temperatures in cities around the world.

In [None]:
import pandas as pd

temperatures = pd.read_csv("./../../data/temperatures.csv", index_col=0)

Set the index of temperatures to "city", assigning to temperatures_ind.

In [None]:
temperatures_ind = temperatures.set_index("city")
temperatures_ind

Look at temperatures_ind. How is it different from temperatures?

Reset the index of temperatures_ind, keeping its contents.

In [None]:
temperatures_ind.reset_index()

Reset the index of temperatures_ind, dropping its contents.

In [None]:
temperatures_ind.reset_index(drop=True)

<a id="sub-loc"></a>
## 1.2 Subsetting with .loc[]
The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.

The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.

Create a list called cities that contains "Moscow" and "Saint Petersburg".

In [None]:
cities = ["Moscow", "Saint Petersburg"]

Use [] subsetting to filter temperatures for rows where the city column takes a value in the cities list.

In [None]:
temperatures[temperatures["city"].isin(cities)]

Use .loc[] subsetting to filter temperatures_ind for rows where the city is in the cities list.

In [None]:
temperatures_ind.loc[cities]

<a id="sub-multi-lvl-ind"></a>
## 1.3 Setting multi-level indexes
Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial, you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside the treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside the country.

The main downside is that the code for manipulating indexes is different from the code for manipulating columns, so you have to learn two syntaxes and keep track of how your data is represented.

Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind.

In [None]:
temperatures_ind = temperatures.set_index(["country", "city"])
temperatures_ind

Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep.

In [None]:
rows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")]

Print and subset temperatures_ind for rows_to_keep using .loc[].

In [None]:
temperatures_ind.loc[rows_to_keep]

<a id="sortby-ind-val"></a>
## 1.4 Sorting by index values
Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It's also useful to be able to sort by elements in the index. For this, you need to use .sort_index().

Sort temperatures_ind by the index values.

In [None]:
temperatures_ind.sort_index()

Sort temperatures_ind by the index values at the "city" level.

In [None]:
temperatures_ind.sort_index(level="city")

Sort temperatures_ind by ascending country then descending city.

In [None]:
temperatures_ind.sort_index(level=["country", "city"], ascending=[True, False])

<a id="slice-sub-loc-iloc"></a>
# 2 Slicing and subsetting with .loc and .iloc
<div align="middle">
<video width="60%" controls>
      <source src="./../../res/videos/3.slicing-indexing-dataframes/2.slicing_and_subsetting_loc_iloc.mp4" type="video/mp4">
</video></div>

<a id="slice-ind-vals"></a>
## 2.1 Slicing index values
Slicing lets you select consecutive elements of an object using first:last syntax. DataFrames can be sliced by index values or by row/column number; we'll start with the first case. This involves slicing inside the .loc[] method.

Compared to slicing lists, there are a few things to remember.

You can only slice an index if the index is sorted (using .sort_index()).
To slice at the outer level, first and last can be strings.
To slice at inner levels, first and last should be tuples.
If you pass a single slice to .loc[], it will slice the rows.

Sort the index of temperatures_ind.

In [None]:
temperatures_srt = temperatures_ind.sort_index()

Use slicing with .loc[] to get these subsets:
from Pakistan to Russia.
from Lahore to Moscow. (This will return nonsense.)
from Pakistan, Lahore to Russia, Moscow.

In [None]:
# Subset rows from Pakistan to Russia.
temperatures_srt.loc["Pakistan" : "Russia"]

# Try to subset rows from Lahore to Moscow (returns nonsense.)
temperatures_srt.loc["Lahore" : "Moscow"]

# Subset rows from Pakistan, Lahore to Russia, Moscow
temperatures_srt.loc[("Pakistan", "Lahore") : ("Russia", "Moscow")]

<a id="slice-directions"></a>
## 2.2 Slicing in both directions
You've seen slicing DataFrames by rows and by columns, but since DataFrames are two-dimensional objects, it is often natural to slice both dimensions at once. That is, by passing two arguments to .loc[], you can subset by rows and columns in one go.

Use .loc[] slicing to subset rows from India, Hyderabad to Iraq, Baghdad.

In [None]:
temperatures_srt.loc[("India", "Hyderabad") : ("Iraq", "Baghdad")]

Use .loc[] slicing to subset columns from date to avg_temp_c.

In [None]:
temperatures_srt.loc[:, "date" : "avg_temp_c"]

Slice in both directions at once from Hyderabad to Baghdad, and date to avg_temp_c.

In [None]:
temperatures_srt.loc[("India","Hyderabad"):("Iraq","Baghdad"), "date":"avg_temp_c"]

<a id="slice-time"></a>
## 2.3 Slicing time series
Slicing is particularly useful for time series since it's a common thing to want to filter for data within a date range. Add the date column to the index, then use .loc[] to perform the subsetting. The important thing to remember is to keep your dates in ISO 8601 format, that is, "yyyy-mm-dd" for year-month-day, "yyyy-mm" for year-month, and "yyyy" for year.

Recall from Chapter 1 that you can combine multiple Boolean conditions using logical operators, such as &. To do so in one line of code, you'll need to add parentheses () around each condition.

Use Boolean conditions, not .isin() or .loc[], and the full date "yyyy-mm-dd", to subset temperatures for rows in 2010 and 2011 and print the results.

In [None]:
temperatures_bool = temperatures[(temperatures["date"] >= "2010") & (temperatures["date"] <= "2011")]
temperatures_bool

Set the index of temperatures to the date column and sort it.

In [None]:
temperatures_ind = temperatures.set_index("date").sort_index()
temperatures_ind

Use .loc[] to subset temperatures_ind for rows in 2010 and 2011.

In [None]:
temperatures_ind.loc["2010":"2011"]

Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011.

In [None]:
temperatures_ind.loc["2010-08":"2011-02"]

<a id="sub-row-col"></a>
## 2.4 Subsetting by row/column number
The most common ways to subset rows are the ways we've previously discussed: using a Boolean condition or by index labels. However, it is also occasionally useful to pass row numbers.

This is done using .iloc[], and like .loc[], it can take two arguments to let you subset by rows and columns.

Use .iloc[] on temperatures to take subsets.

Get the 23rd row, 2nd column (index positions 22 and 1).

In [None]:
temperatures.iloc[22, 1]

Get the first 5 rows (index positions 0 to 5).

In [None]:
temperatures.iloc[:5]

Get all rows, columns 3 and 4 (index positions 2 to 4).

In [None]:
temperatures.iloc[:, 2:4]

Get the first 5 rows, columns 3 and 4.

In [None]:
temperatures.iloc[:5, 2:4]

<a id="pvt-tbl"></a>
# 3 Working with pivot tables
<div align="middle">
<video width="60%" controls>
      <source src="./../../res/videos/3.slicing-indexing-dataframes/3.working_with_pivot_tables.mp4" type="video/mp4">
</video></div>

<a id="pvt-temp"></a>
## 3.1 Pivot temperatures by city and year
It's interesting to see how temperatures for each city change over time—looking at every month results in a big table, which can be tricky to reason about. Instead, let's look at how temperatures change by year.

You can access the components of a date (year, month and day) using code of the form dataframe["column"].dt.component. For example, the month component is dataframe["column"].dt.month, and the year component is dataframe["column"].dt.year.

Once you have the year column, you can create a pivot table with the data aggregated by city and year, which you'll explore in the coming exercises.

Convert the date column from string to datetime to be able to perform .dt property operations

In [None]:
temperatures["date"] = pd.to_datetime(temperatures["date"])

Add a year column to temperatures, from the year component of the date column.

In [None]:
temperatures["year"] = temperatures["date"].dt.year
temperatures["year"]

Make a pivot table of the avg_temp_c column, with country and city as rows, and year as columns. Assign to temp_by_country_city_vs_year, and look at the result.

In [None]:
temp_by_country_city_vs_year = temperatures.pivot_table("avg_temp_c", index=["country", "city"], columns="year")

<a id="sub-pvt-tbl"></a>
## 3.2 Subsetting pivot tables
A pivot table is just a DataFrame with sorted indexes, so the techniques you have learned already can be used to subset them. In particular, the .loc[] + slicing combination is often helpful.

Use .loc[] on temp_by_country_city_vs_year to take subsets.

From Egypt to India.

In [None]:
temp_by_country_city_vs_year.loc["Egypt":"India"]

From Egypt, Cairo to India, Delhi.

In [None]:
temp_by_country_city_vs_year.loc[("Egypt", "Cairo") : ("India", "Delhi")]

From Egypt, Cairo to India, Delhi, and 2005 to 2010.

In [None]:
temp_by_country_city_vs_year.loc[("Egypt", "Cairo") : ("India", "Delhi"), "2005":"2010"]

<a id="calc-pvt-tbl"></a>
## 3.3 Calculating on a pivot table
Pivot tables are filled with summary statistics, but they are only a first step to finding something insightful. Often you'll need to perform further calculations on them. A common thing to do is to find the rows or columns where the highest or lowest value occurs.

Recall from Chapter 1 that you can easily subset a Series or DataFrame to find rows of interest using a logical condition inside of square brackets. For example: series[series > value].

Calculate the mean temperature for each year, assigning to mean_temp_by_year.

In [None]:
mean_temp_by_year = temp_by_country_city_vs_year.mean()
mean_temp_by_year

Filter mean_temp_by_year for the year that had the highest mean temperature.

In [None]:
mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max(axis="index")]

Calculate the mean temperature for each city (across columns), assigning to mean_temp_by_city.

In [None]:
# temp_by_country_city_vs_year
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")
mean_temp_by_city

Filter mean_temp_by_city for the city that had the lowest mean temperature.

In [None]:
mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()]