# Explicit indexes

## Setting and removing indexes
pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).

In this chapter, you'll be exploring temperatures, a DataFrame of average temperatures in cities around the world. pandas is loaded as pd.
### Instructions
- Look at temperatures.
- Set the index of temperatures to "city", assigning to temperatures_ind.
- Look at temperatures_ind. How is it different from temperatures?
- Reset the index of temperatures_ind, keeping its contents.
- Reset the index of temperatures_ind, dropping its contents.

In [2]:
import pandas as pd
# Create a DataFrame of temperatures
temperatures = pd.read_csv('data/temperatures.csv')

In [3]:
# Look at temperatures
print(temperatures)

# Set the index of temperatures to city
temperatures_ind = temperatures.set_index('city')

# Look at temperatures_ind
print(temperatures_ind)

# Reset the temperatures_ind index, keeping its contents
print(temperatures_ind.reset_index())

# Reset the temperatures_ind index, dropping its contents
print(temperatures_ind.reset_index(drop=True))

             date     city        country  avg_temp_c  year
0      2000-01-01  Abidjan  Côte D'Ivoire      27.293  2000
1      2000-02-01  Abidjan  Côte D'Ivoire      27.685  2000
2      2000-03-01  Abidjan  Côte D'Ivoire      29.061  2000
3      2000-04-01  Abidjan  Côte D'Ivoire      28.162  2000
4      2000-05-01  Abidjan  Côte D'Ivoire      27.547  2000
...           ...      ...            ...         ...   ...
16495  2013-05-01     Xian          China      18.979  2013
16496  2013-06-01     Xian          China      23.522  2013
16497  2013-07-01     Xian          China      25.251  2013
16498  2013-08-01     Xian          China      24.528  2013
16499  2013-09-01     Xian          China         NaN  2013

[16500 rows x 5 columns]
               date        country  avg_temp_c  year
city                                                
Abidjan  2000-01-01  Côte D'Ivoire      27.293  2000
Abidjan  2000-02-01  Côte D'Ivoire      27.685  2000
Abidjan  2000-03-01  Côte D'Ivoire      29

## Subsetting with .loc[]
The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.

The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.

pandas is loaded as pd. temperatures and temperatures_ind are available; the latter is indexed by city.
### Instructions
- Create a list called cities that contains "London" and "Paris".
- Use [] subsetting to filter temperatures for rows where the city column takes a value in the cities list.
- Use .loc[] subsetting to filter temperatures_ind for rows where the city is in the cities list.

In [4]:
# Make a list of cities to subset on
cities = ["London", "Paris"]

# Subset temperatures using square brackets
print(temperatures[temperatures['city'].isin(cities)])

# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[temperatures_ind.index.isin(cities)])

             date    city         country  avg_temp_c  year
8910   2000-01-01  London  United Kingdom       4.693  2000
8911   2000-02-01  London  United Kingdom       6.115  2000
8912   2000-03-01  London  United Kingdom       7.422  2000
8913   2000-04-01  London  United Kingdom       8.246  2000
8914   2000-05-01  London  United Kingdom      12.491  2000
...           ...     ...             ...         ...   ...
12040  2013-05-01   Paris          France      11.703  2013
12041  2013-06-01   Paris          France      16.340  2013
12042  2013-07-01   Paris          France      21.186  2013
12043  2013-08-01   Paris          France      19.235  2013
12044  2013-09-01   Paris          France         NaN  2013

[330 rows x 5 columns]
              date         country  avg_temp_c  year
city                                                
London  2000-01-01  United Kingdom       4.693  2000
London  2000-02-01  United Kingdom       6.115  2000
London  2000-03-01  United Kingdom       7.4

## Setting multi-level indexes
Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial, you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside the treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside the country.

The main downside is that the code for manipulating indexes is different from the code for manipulating columns, so you have to learn two syntaxes and keep track of how your data is represented.

pandas is loaded as pd. temperatures is available.
### Instructions
- Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind.
- Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep.
- Print and subset temperatures_ind for rows_to_keep using .loc[].

In [5]:
# Index temperatures by country & city
temperatures_ind = temperatures.set_index(['country', 'city'])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [('Brazil', 'Rio De Janeiro'), ('Pakistan', 'Lahore')]

# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])


                               date  avg_temp_c  year
country  city                                        
Brazil   Rio De Janeiro  2000-01-01      25.974  2000
         Rio De Janeiro  2000-02-01      26.699  2000
         Rio De Janeiro  2000-03-01      26.270  2000
         Rio De Janeiro  2000-04-01      25.750  2000
         Rio De Janeiro  2000-05-01      24.356  2000
...                             ...         ...   ...
Pakistan Lahore          2013-05-01      33.457  2013
         Lahore          2013-06-01      34.456  2013
         Lahore          2013-07-01      33.279  2013
         Lahore          2013-08-01      31.511  2013
         Lahore          2013-09-01         NaN  2013

[330 rows x 3 columns]


## Sorting by index values
Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It's also useful to be able to sort by elements in the index. For this, you need to use .sort_index().

pandas is loaded as pd. temperatures_ind has a multi-level index of country and city, and is available.
### Instructions
- Sort temperatures_ind by the index values.
- Sort temperatures_ind by the index values at the "city" level.
- Sort temperatures_ind by ascending country then descending city.

In [6]:
# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level='city'))

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level=['country', 'city'], ascending=[True, False]))

                          date  avg_temp_c  year
country     city                                
Afghanistan Kabul   2000-01-01       3.326  2000
            Kabul   2000-02-01       3.454  2000
            Kabul   2000-03-01       9.612  2000
            Kabul   2000-04-01      17.925  2000
            Kabul   2000-05-01      24.658  2000
...                        ...         ...   ...
Zimbabwe    Harare  2013-05-01      18.298  2013
            Harare  2013-06-01      17.020  2013
            Harare  2013-07-01      16.299  2013
            Harare  2013-08-01      19.232  2013
            Harare  2013-09-01         NaN  2013

[16500 rows x 3 columns]
                             date  avg_temp_c  year
country       city                                 
Côte D'Ivoire Abidjan  2000-01-01      27.293  2000
              Abidjan  2000-02-01      27.685  2000
              Abidjan  2000-03-01      29.061  2000
              Abidjan  2000-04-01      28.162  2000
              Abidjan  20

# Slicing and subsetting with .loc and .iloc

## Slicing index values
Slicing lets you select consecutive elements of an object using first:last syntax. DataFrames can be sliced by index values or by row/column number; we'll start with the first case. This involves slicing inside the .loc[] method.

Compared to slicing lists, there are a few things to remember.

- You can only slice an index if the index is sorted (using .sort_index()).
- To slice at the outer level, first and last can be strings.
- To slice at inner levels, first and last should be tuples.
- If you pass a single slice to .loc[], it will slice the rows.

pandas is loaded as pd. temperatures_ind has country and city in the index, and is available.
### Instructions
- Sort the index of temperatures_ind.
- Use slicing with .loc[] to get these subsets:
    - from Pakistan to Philippines.
    - from Lahore to Manila. (This will return nonsense.)
    - from Pakistan, Lahore to Philippines, Manila.

In [7]:
# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Subset rows from Pakistan to Philippines
print(temperatures_srt.loc['Pakistan':'Philippines'])

# Try to subset rows from Lahore to Manila
print(temperatures_srt.loc['Lahore':'Manila'])

# Subset rows from Pakistan, Lahore to Philippines, Manila
print(temperatures_srt.loc[('Pakistan', 'Lahore'):('Philippines', 'Manila')])

                              date  avg_temp_c  year
country     city                                    
Pakistan    Faisalabad  2000-01-01      12.792  2000
            Faisalabad  2000-02-01      14.339  2000
            Faisalabad  2000-03-01      20.309  2000
            Faisalabad  2000-04-01      29.072  2000
            Faisalabad  2000-05-01      34.845  2000
...                            ...         ...   ...
Philippines Manila      2013-05-01      29.552  2013
            Manila      2013-06-01      28.572  2013
            Manila      2013-07-01      27.266  2013
            Manila      2013-08-01      26.754  2013
            Manila      2013-09-01         NaN  2013

[825 rows x 3 columns]
Empty DataFrame
Columns: [date, avg_temp_c, year]
Index: []
                          date  avg_temp_c  year
country     city                                
Pakistan    Lahore  2000-01-01      12.792  2000
            Lahore  2000-02-01      14.339  2000
            Lahore  2000-03-01 

## Slicing in both directions
You've seen slicing DataFrames by rows and by columns, but since DataFrames are two-dimensional objects, it is often natural to slice both dimensions at once. That is, by passing two arguments to .loc[], you can subset by rows and columns in one go.

pandas is loaded as pd. temperatures_srt is indexed by country and city, has a sorted index, and is available.
### Instructions
- Use .loc[] slicing to subset rows from India, Hyderabad to Iraq, Baghdad.
- Use .loc[] slicing to subset columns from date to avg_temp_c.
- Slice in both directions at once from Hyderabad to Baghdad, and date to avg_temp_c.

In [8]:
# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[('India','Hyderabad'):('Iraq','Baghdad')])

                         date  avg_temp_c  year
country city                                   
India   Hyderabad  2000-01-01      23.779  2000
        Hyderabad  2000-02-01      25.826  2000
        Hyderabad  2000-03-01      28.821  2000
        Hyderabad  2000-04-01      32.698  2000
        Hyderabad  2000-05-01      32.438  2000
...                       ...         ...   ...
Iraq    Baghdad    2013-05-01      28.673  2013
        Baghdad    2013-06-01      33.803  2013
        Baghdad    2013-07-01      36.392  2013
        Baghdad    2013-08-01      35.463  2013
        Baghdad    2013-09-01         NaN  2013

[2145 rows x 3 columns]


In [9]:
# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:, 'date':'avg_temp_c'])

                          date  avg_temp_c
country     city                          
Afghanistan Kabul   2000-01-01       3.326
            Kabul   2000-02-01       3.454
            Kabul   2000-03-01       9.612
            Kabul   2000-04-01      17.925
            Kabul   2000-05-01      24.658
...                        ...         ...
Zimbabwe    Harare  2013-05-01      18.298
            Harare  2013-06-01      17.020
            Harare  2013-07-01      16.299
            Harare  2013-08-01      19.232
            Harare  2013-09-01         NaN

[16500 rows x 2 columns]


In [10]:
# Subset in both directions at once
# Subset in both directions at once
print(temperatures_srt.loc[('India', 'Hyderabad'):('Iraq', 'Baghdad'), 'date':'avg_temp_c'])

                         date  avg_temp_c
country city                             
India   Hyderabad  2000-01-01      23.779
        Hyderabad  2000-02-01      25.826
        Hyderabad  2000-03-01      28.821
        Hyderabad  2000-04-01      32.698
        Hyderabad  2000-05-01      32.438
...                       ...         ...
Iraq    Baghdad    2013-05-01      28.673
        Baghdad    2013-06-01      33.803
        Baghdad    2013-07-01      36.392
        Baghdad    2013-08-01      35.463
        Baghdad    2013-09-01         NaN

[2145 rows x 2 columns]


## Slicing time series
Slicing is particularly useful for time series since it's a common thing to want to filter for data within a date range. Add the date column to the index, then use .loc[] to perform the subsetting. The important thing to remember is to keep your dates in ISO 8601 format, that is, "yyyy-mm-dd" for year-month-day, "yyyy-mm" for year-month, and "yyyy" for year.

Recall from Chapter 1 that you can combine multiple Boolean conditions using logical operators, such as &. To do so in one line of code, you'll need to add parentheses () around each condition.

pandas is loaded as pd and temperatures, with no index, is available.
### Instructions
- Use Boolean conditions, not .isin() or .loc[], and the full date "yyyy-mm-dd", to subset temperatures for rows where the date column is in 2010 and 2011 and print the results.
- Set the index of temperatures to the date column and sort it.
- Use .loc[] to subset temperatures_ind for rows in 2010 and 2011.
- Use .loc[] to subset temperatures_ind for rows from August 2010 to February 2011.

In [11]:
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
temperatures_bool = temperatures[(temperatures['date'] >= '2010-01-01') & (temperatures['date'] <= '2011-12-31')]
print(temperatures_bool)

             date     city        country  avg_temp_c  year
120    2010-01-01  Abidjan  Côte D'Ivoire      28.270  2010
121    2010-02-01  Abidjan  Côte D'Ivoire      29.262  2010
122    2010-03-01  Abidjan  Côte D'Ivoire      29.596  2010
123    2010-04-01  Abidjan  Côte D'Ivoire      29.068  2010
124    2010-05-01  Abidjan  Côte D'Ivoire      28.258  2010
...           ...      ...            ...         ...   ...
16474  2011-08-01     Xian          China      23.069  2011
16475  2011-09-01     Xian          China      16.775  2011
16476  2011-10-01     Xian          China      12.587  2011
16477  2011-11-01     Xian          China       7.543  2011
16478  2011-12-01     Xian          China      -0.490  2011

[2400 rows x 5 columns]


In [12]:

# Set date as the index and sort the index
temperatures_ind = temperatures.set_index('date').sort_index()

# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc['2010':'2011'])

# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc['2010-08':'2011-02'])

                  city    country  avg_temp_c  year
date                                               
2010-01-01  Faisalabad   Pakistan      11.810  2010
2010-01-01   Melbourne  Australia      20.016  2010
2010-01-01   Chongqing      China       7.921  2010
2010-01-01   São Paulo     Brazil      23.738  2010
2010-01-01   Guangzhou      China      14.136  2010
...                ...        ...         ...   ...
2010-12-01     Jakarta  Indonesia      26.602  2010
2010-12-01       Gizeh      Egypt      16.530  2010
2010-12-01      Nagpur      India      19.120  2010
2010-12-01      Sydney  Australia      19.559  2010
2010-12-01    Salvador     Brazil      26.265  2010

[1200 rows x 4 columns]
                     city        country  avg_temp_c  year
date                                                      
2010-08-01       Calcutta          India      30.226  2010
2010-08-01           Pune          India      24.941  2010
2010-08-01          Izmir         Turkey      28.352  2010
2010

## Subsetting by row/column number
The most common ways to subset rows are the ways we've previously discussed: using a Boolean condition or by index labels. However, it is also occasionally useful to pass row numbers.

This is done using .iloc[], and like .loc[], it can take two arguments to let you subset by rows and columns.

pandas is loaded as pd. temperatures (without an index) is available.
### Instructions
+ Use .iloc[] on temperatures to take subsets.
    - Get the 23rd row, 2nd column (index positions 22 and 1).
    - Get the first 5 rows (index positions 0 to 5).
    - Get all rows, columns 3 and 4 (index positions 2 to 4).
    - Get the first 5 rows, columns 3 and 4.

In [13]:
# Get 23rd row, 2nd column (index 22, 1)
print(temperatures.iloc[22, 1])

Abidjan


In [14]:
# Use slicing to get the first 5 rows
print(temperatures.iloc[0:5])

         date     city        country  avg_temp_c  year
0  2000-01-01  Abidjan  Côte D'Ivoire      27.293  2000
1  2000-02-01  Abidjan  Côte D'Ivoire      27.685  2000
2  2000-03-01  Abidjan  Côte D'Ivoire      29.061  2000
3  2000-04-01  Abidjan  Côte D'Ivoire      28.162  2000
4  2000-05-01  Abidjan  Côte D'Ivoire      27.547  2000


In [15]:
# Use slicing to get columns 3 to 4
print(temperatures.iloc[:, 2:4])

             country  avg_temp_c
0      Côte D'Ivoire      27.293
1      Côte D'Ivoire      27.685
2      Côte D'Ivoire      29.061
3      Côte D'Ivoire      28.162
4      Côte D'Ivoire      27.547
...              ...         ...
16495          China      18.979
16496          China      23.522
16497          China      25.251
16498          China      24.528
16499          China         NaN

[16500 rows x 2 columns]


In [16]:

# Use slicing in both directions at once
print(temperatures.iloc[0:5, 2:4])

         country  avg_temp_c
0  Côte D'Ivoire      27.293
1  Côte D'Ivoire      27.685
2  Côte D'Ivoire      29.061
3  Côte D'Ivoire      28.162
4  Côte D'Ivoire      27.547


# Working with pivot tables

## Pivot temperature by city and year
It's interesting to see how temperatures for each city change over time—looking at every month results in a big table, which can be tricky to reason about. Instead, let's look at how temperatures change by year.

You can access the components of a date (year, month and day) using code of the form dataframe["column"].dt.component. For example, the month component is dataframe["column"].dt.month, and the year component is dataframe["column"].dt.year.

Once you have the year column, you can create a pivot table with the data aggregated by city and year, which you'll explore in the coming exercises.

pandas is loaded as pd. temperatures is available.
### Instructions
- Add a year column to temperatures, from the year component of the date column.
- Make a pivot table of the avg_temp_c column, with country and city as rows, and year as columns. Assign to temp_by_country_city_vs_year, and look at the result.

In [17]:
# Add a year column to temperatures
temperatures['year'] = pd.to_datetime(temperatures['date']).dt.year

# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table(
    values='avg_temp_c',
    index=['country', 'city'],
    columns='date',
)

# See the result
print(temp_by_country_city_vs_year)

date                            2000-01-01  2000-02-01  2000-03-01  \
country       city                                                   
Afghanistan   Kabul                  3.326       3.454       9.612   
Angola        Luanda                25.077      25.493      26.496   
Australia     Melbourne             18.527      22.095      18.945   
              Sydney                18.470      20.713      20.220   
Bangladesh    Dhaka                 18.829      20.947      26.035   
...                                    ...         ...         ...   
United States Chicago                0.137       4.083       8.274   
              Los Angeles           10.772      10.262      12.335   
              New York              -3.168      -0.162       6.391   
Vietnam       Ho Chi Minh City      26.647      26.672      27.655   
Zimbabwe      Harare                22.119      21.569      22.370   

date                            2000-04-01  2000-05-01  2000-06-01  \
country       city 

## Subsetting pivot tables
A pivot table is just a DataFrame with sorted indexes, so the techniques you have learned already can be used to subset them. In particular, the .loc[] + slicing combination is often helpful.

pandas is loaded as pd. temp_by_country_city_vs_year is available.
### Instructions
- Use .loc[] on temp_by_country_city_vs_year to take subsets.

    - From Egypt to India.
    - From Egypt, Cairo to India, Delhi.
    - From Egypt, Cairo to India, Delhi, and 2005 to 2010.

In [18]:
# Subset for Egypt to India
print(temp_by_country_city_vs_year.loc['Egypt':'India'])

date                  2000-01-01  2000-02-01  2000-03-01  2000-04-01  \
country  city                                                          
Egypt    Alexandria       13.579      14.300      15.266      19.556   
         Cairo            12.669      13.728      16.026      22.396   
         Gizeh            12.669      13.728      16.026      22.396   
Ethiopia Addis Abeba      17.391      19.183      20.230      20.398   
France   Paris             3.845       6.587       7.872      10.067   
Germany  Berlin            1.324       4.718       5.806      11.805   
India    Ahmadabad        20.781      21.246      26.565      32.275   
         Bangalore        23.673      25.351      27.238      28.501   
         Bombay           25.599      24.076      25.489      28.188   
         Calcutta         19.196      21.275      26.881      30.165   
         Delhi            15.201      16.388      22.921      31.266   
         Hyderabad        23.779      25.826      28.821      32

In [19]:
# Subset for Egypt, Cairo to India, Delhi
print(temp_by_country_city_vs_year.loc[('Egypt', 'Cairo'):('India', 'Delhi')])

date                  2000-01-01  2000-02-01  2000-03-01  2000-04-01  \
country  city                                                          
Egypt    Cairo            12.669      13.728      16.026      22.396   
         Gizeh            12.669      13.728      16.026      22.396   
Ethiopia Addis Abeba      17.391      19.183      20.230      20.398   
France   Paris             3.845       6.587       7.872      10.067   
Germany  Berlin            1.324       4.718       5.806      11.805   
India    Ahmadabad        20.781      21.246      26.565      32.275   
         Bangalore        23.673      25.351      27.238      28.501   
         Bombay           25.599      24.076      25.489      28.188   
         Calcutta         19.196      21.275      26.881      30.165   
         Delhi            15.201      16.388      22.921      31.266   

date                  2000-05-01  2000-06-01  2000-07-01  2000-08-01  \
country  city                                                  

In [20]:
# Subset for Egypt, Cairo to India, Delhi, and 2005 to 2010
print(temp_by_country_city_vs_year.loc[('Egypt', 'Cairo'):('India', 'Delhi'), '2005':'2010'])

date                  2005-01-01  2005-02-01  2005-03-01  2005-04-01  \
country  city                                                          
Egypt    Cairo            14.030      14.640      18.033      21.585   
         Gizeh            14.030      14.640      18.033      21.585   
Ethiopia Addis Abeba      17.940      19.785      20.685      20.464   
France   Paris             5.053       2.884       7.415      10.884   
Germany  Berlin            3.103      -0.236       3.808      10.174   
India    Ahmadabad        18.898      21.815      27.380      30.601   
         Bangalore        23.972      25.618      27.825      27.795   
         Bombay           24.447      24.707      25.961      28.128   
         Calcutta         19.341      23.576      27.695      30.152   
         Delhi            14.383      17.669      24.469      29.322   

date                  2005-05-01  2005-06-01  2005-07-01  2005-08-01  \
country  city                                                  

## Calculating on a pivot table
Pivot tables are filled with summary statistics, but they are only a first step to finding something insightful. Often you'll need to perform further calculations on them. A common thing to do is to find the rows or columns where the highest or lowest value occurs.

Recall from Chapter 1 that you can easily subset a Series or DataFrame to find rows of interest using a logical condition inside of square brackets. For example: series[series > value].

pandas is loaded as pd and the DataFrame temp_by_country_city_vs_year is available. The .head() for this DataFrame is shown below, with only a few of the year columns displayed:

![Calculating on a pivot table](matplotlib/data/Calculating on a pivot table.png)


In [21]:
# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean()

# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

date
2002-07-01    25.35836
dtype: float64
country  city  
China    Harbin    4.858494
dtype: float64


In [22]:
# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis=1)

# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])

country  city  
China    Harbin    4.858494
dtype: float64
