In [None]:
<a href="https://colab.research.google.com/github/joseeden/joeden/blob/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/001-Sample-Notebooks/005-indexes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting and removing indexes  

In pandas, you can use a column as an index to simplify subsetting and sometimes improve lookup performance.

Todo: 

- Set the index of temperatures to "city", assigning to `temperatures_ind`.
- Look at `temperatures_ind`. How is it different from temperatures?
- Reset the index of `temperatures_ind`, keeping its contents.
- Reset the index of `temperatures_ind`, dropping its contents.

Import the dataset and save as `temperatures` dataframe.

In [105]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/data-manipulation-using-pandas/temperatures.csv'
temperatures = pd.read_csv(url)

print(temperatures.head())

   Unnamed: 0        date     city        country  avg_temp_c
0           0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1           1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2           2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3           3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4           4  2000-05-01  Abidjan  Côte D'Ivoire      27.547


Set the index to "city".

In [106]:
temperatures_ind = temperatures.set_index("city")
print(temperatures_ind)

         Unnamed: 0        date        country  avg_temp_c
city                                                      
Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
Abidjan           1  2000-02-01  Côte D'Ivoire      27.685
Abidjan           2  2000-03-01  Côte D'Ivoire      29.061
Abidjan           3  2000-04-01  Côte D'Ivoire      28.162
Abidjan           4  2000-05-01  Côte D'Ivoire      27.547
...             ...         ...            ...         ...
Xian          16495  2013-05-01          China      18.979
Xian          16496  2013-06-01          China      23.522
Xian          16497  2013-07-01          China      25.251
Xian          16498  2013-08-01          China      24.528
Xian          16499  2013-09-01          China         NaN

[16500 rows x 4 columns]


Reset the `temperatures_ind` index, keeping its contents

In [107]:
print(temperatures_ind.reset_index())

          city  Unnamed: 0        date        country  avg_temp_c
0      Abidjan           0  2000-01-01  Côte D'Ivoire      27.293
1      Abidjan           1  2000-02-01  Côte D'Ivoire      27.685
2      Abidjan           2  2000-03-01  Côte D'Ivoire      29.061
3      Abidjan           3  2000-04-01  Côte D'Ivoire      28.162
4      Abidjan           4  2000-05-01  Côte D'Ivoire      27.547
...        ...         ...         ...            ...         ...
16495     Xian       16495  2013-05-01          China      18.979
16496     Xian       16496  2013-06-01          China      23.522
16497     Xian       16497  2013-07-01          China      25.251
16498     Xian       16498  2013-08-01          China      24.528
16499     Xian       16499  2013-09-01          China         NaN

[16500 rows x 5 columns]


Reset the `temperatures_ind` index, dropping its contents

In [108]:
print(temperatures_ind.reset_index(drop=True))

       Unnamed: 0        date        country  avg_temp_c
0               0  2000-01-01  Côte D'Ivoire      27.293
1               1  2000-02-01  Côte D'Ivoire      27.685
2               2  2000-03-01  Côte D'Ivoire      29.061
3               3  2000-04-01  Côte D'Ivoire      28.162
4               4  2000-05-01  Côte D'Ivoire      27.547
...           ...         ...            ...         ...
16495       16495  2013-05-01          China      18.979
16496       16496  2013-06-01          China      23.522
16497       16497  2013-07-01          China      25.251
16498       16498  2013-08-01          China      24.528
16499       16499  2013-09-01          China         NaN

[16500 rows x 4 columns]


# Subsetting 

The `.loc[]` method is a powerful way to subset rows using index values. It simplifies code compared to standard square bracket subsetting, making it easier to read and maintain.

Todo: 

- Create a list called cities that contains "Moscow" and "Saint Petersburg".
- Use [] subsetting to filter temperatures for rows where the city column takes a value in the cities list.
- Use .loc[] subsetting to filter temperatures_ind for rows where the city is in the cities list.

Make a list of cities to subset on. then subset temperatures using square brackets

In [109]:
cities = ["Moscow", "Saint Petersburg"]

print(temperatures[
    temperatures["city"].isin(cities)
])


       Unnamed: 0        date              city country  avg_temp_c
10725       10725  2000-01-01            Moscow  Russia      -7.313
10726       10726  2000-02-01            Moscow  Russia      -3.551
10727       10727  2000-03-01            Moscow  Russia      -1.661
10728       10728  2000-04-01            Moscow  Russia      10.096
10729       10729  2000-05-01            Moscow  Russia      10.357
...           ...         ...               ...     ...         ...
13360       13360  2013-05-01  Saint Petersburg  Russia      12.355
13361       13361  2013-06-01  Saint Petersburg  Russia      17.185
13362       13362  2013-07-01  Saint Petersburg  Russia      17.234
13363       13363  2013-08-01  Saint Petersburg  Russia      17.153
13364       13364  2013-09-01  Saint Petersburg  Russia         NaN

[330 rows x 5 columns]


Subset `temperatures_ind` using `.loc[]`

In [110]:
print(temperatures_ind.loc[cities])

                  Unnamed: 0        date country  avg_temp_c
city                                                        
Moscow                 10725  2000-01-01  Russia      -7.313
Moscow                 10726  2000-02-01  Russia      -3.551
Moscow                 10727  2000-03-01  Russia      -1.661
Moscow                 10728  2000-04-01  Russia      10.096
Moscow                 10729  2000-05-01  Russia      10.357
...                      ...         ...     ...         ...
Saint Petersburg       13360  2013-05-01  Russia      12.355
Saint Petersburg       13361  2013-06-01  Russia      17.185
Saint Petersburg       13362  2013-07-01  Russia      17.234
Saint Petersburg       13363  2013-08-01  Russia      17.153
Saint Petersburg       13364  2013-09-01  Russia         NaN

[330 rows x 4 columns]


# Setting Multi-Level Indexes  

A multi-level index, also called a **hierarchical index**, uses multiple columns as the index. This approach can simplify working with nested categories. For example, in a clinical trial, test subjects can be grouped under control or treatment groups, and in a dataset, cities can be grouped within countries. 

However, working with indexes has a downside: the syntax for handling them differs from working with columns, so you need to learn two approaches and keep track of the data structure.

Todo:

- Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind.
- Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep.
- Print and subset temperatures_ind for rows_to_keep using .loc[].

Index temperatures by country & city

In [111]:
temperatures_ind = temperatures.set_index([
    "country",
    "city"
])
print(temperatures_ind)

                       Unnamed: 0        date  avg_temp_c
country       city                                       
Côte D'Ivoire Abidjan           0  2000-01-01      27.293
              Abidjan           1  2000-02-01      27.685
              Abidjan           2  2000-03-01      29.061
              Abidjan           3  2000-04-01      28.162
              Abidjan           4  2000-05-01      27.547
...                           ...         ...         ...
China         Xian          16495  2013-05-01      18.979
              Xian          16496  2013-06-01      23.522
              Xian          16497  2013-07-01      25.251
              Xian          16498  2013-08-01      24.528
              Xian          16499  2013-09-01         NaN

[16500 rows x 3 columns]


Create the tuples and then use it to subset for rows to keep.

In [112]:
rows_to_keep = [
    ("Brazil", "Rio De Janeiro"),
    ("Pakistan", "Lahore")
]

print(temperatures_ind.loc[rows_to_keep])

                         Unnamed: 0        date  avg_temp_c
country  city                                              
Brazil   Rio De Janeiro       12540  2000-01-01      25.974
         Rio De Janeiro       12541  2000-02-01      26.699
         Rio De Janeiro       12542  2000-03-01      26.270
         Rio De Janeiro       12543  2000-04-01      25.750
         Rio De Janeiro       12544  2000-05-01      24.356
...                             ...         ...         ...
Pakistan Lahore                8575  2013-05-01      33.457
         Lahore                8576  2013-06-01      34.456
         Lahore                8577  2013-07-01      33.279
         Lahore                8578  2013-08-01      31.511
         Lahore                8579  2013-09-01         NaN

[330 rows x 3 columns]


# Sorting by Index  

In addition to sorting rows with `.sort_values()`, you can rearrange rows based on index values using `.sort_index()`. This helps organize data more effectively when working with indexed DataFrames.

Todo:

- Sort `temperatures_ind` by the index values.
- Sort `temperatures_ind` by the index values at the "city" level.
- Sort `temperatures_ind` by ascending country then descending city.

Sort `temperatures_ind` by the index values.

In [113]:
print(temperatures_ind.sort_index())

                    Unnamed: 0        date  avg_temp_c
country     city                                      
Afghanistan Kabul         7260  2000-01-01       3.326
            Kabul         7261  2000-02-01       3.454
            Kabul         7262  2000-03-01       9.612
            Kabul         7263  2000-04-01      17.925
            Kabul         7264  2000-05-01      24.658
...                        ...         ...         ...
Zimbabwe    Harare        5605  2013-05-01      18.298
            Harare        5606  2013-06-01      17.020
            Harare        5607  2013-07-01      16.299
            Harare        5608  2013-08-01      19.232
            Harare        5609  2013-09-01         NaN

[16500 rows x 3 columns]


Sort `temperatures_ind` by the index values at the "city" level.

In [114]:
print(temperatures_ind.sort_index(level="city"))

                       Unnamed: 0        date  avg_temp_c
country       city                                       
Côte D'Ivoire Abidjan           0  2000-01-01      27.293
              Abidjan           1  2000-02-01      27.685
              Abidjan           2  2000-03-01      29.061
              Abidjan           3  2000-04-01      28.162
              Abidjan           4  2000-05-01      27.547
...                           ...         ...         ...
China         Xian          16495  2013-05-01      18.979
              Xian          16496  2013-06-01      23.522
              Xian          16497  2013-07-01      25.251
              Xian          16498  2013-08-01      24.528
              Xian          16499  2013-09-01         NaN

[16500 rows x 3 columns]


Sort `temperatures_ind` by ascending country then descending city.

In [115]:
print(temperatures_ind.sort_index(
    level=["country", "city"],
    ascending=[True, False]
))


                    Unnamed: 0        date  avg_temp_c
country     city                                      
Afghanistan Kabul         7260  2000-01-01       3.326
            Kabul         7261  2000-02-01       3.454
            Kabul         7262  2000-03-01       9.612
            Kabul         7263  2000-04-01      17.925
            Kabul         7264  2000-05-01      24.658
...                        ...         ...         ...
Zimbabwe    Harare        5605  2013-05-01      18.298
            Harare        5606  2013-06-01      17.020
            Harare        5607  2013-07-01      16.299
            Harare        5608  2013-08-01      19.232
            Harare        5609  2013-09-01         NaN

[16500 rows x 3 columns]


# Slicing index values

Slicing allows you to select consecutive elements using the `first:last` syntax. For DataFrames, slicing by index values is done inside the `.loc[]` method.

- The index must be sorted (`.sort_index()`) before slicing.
- Use strings for slicing outer-level indexes.
- Use tuples for slicing inner-level indexes.
- A single slice passed to `.loc[]` slices the rows.

Todo:

- Sort the index of `temperatures_ind`.
- Use slicing with `.loc[]` to get these subsets:
  - from Pakistan to Russia.
  - from Lahore to Moscow. (This will return nonsense.)
  - from Pakistan, Lahore to Russia, Moscow.

Sort the index of `temperatures_ind`

In [116]:
temperatures_srt = temperatures_ind.sort_index()
print(temperatures_srt)

                    Unnamed: 0        date  avg_temp_c
country     city                                      
Afghanistan Kabul         7260  2000-01-01       3.326
            Kabul         7261  2000-02-01       3.454
            Kabul         7262  2000-03-01       9.612
            Kabul         7263  2000-04-01      17.925
            Kabul         7264  2000-05-01      24.658
...                        ...         ...         ...
Zimbabwe    Harare        5605  2013-05-01      18.298
            Harare        5606  2013-06-01      17.020
            Harare        5607  2013-07-01      16.299
            Harare        5608  2013-08-01      19.232
            Harare        5609  2013-09-01         NaN

[16500 rows x 3 columns]


Subset rows from Pakistan to Russia. You'll be slicing at the outer level index, which is the country.

In [117]:
print(temperatures_srt.loc["Pakistan":"Russia"])

                           Unnamed: 0        date  avg_temp_c
country  city                                                
Pakistan Faisalabad              4785  2000-01-01      12.792
         Faisalabad              4786  2000-02-01      14.339
         Faisalabad              4787  2000-03-01      20.309
         Faisalabad              4788  2000-04-01      29.072
         Faisalabad              4789  2000-05-01      34.845
...                               ...         ...         ...
Russia   Saint Petersburg       13360  2013-05-01      12.355
         Saint Petersburg       13361  2013-06-01      17.185
         Saint Petersburg       13362  2013-07-01      17.234
         Saint Petersburg       13363  2013-08-01      17.153
         Saint Petersburg       13364  2013-09-01         NaN

[1155 rows x 3 columns]


Try to subset rows from Lahore to Moscow. Since the indexing is on the inner level, it requires that first and last argument are tuples. The command below will return incorrect values.

In [118]:
print(temperatures_srt.loc["Lahore":"Moscow"])

                    Unnamed: 0        date  avg_temp_c
country city                                          
Mexico  Mexico           10230  2000-01-01      12.694
        Mexico           10231  2000-02-01      14.677
        Mexico           10232  2000-03-01      17.376
        Mexico           10233  2000-04-01      18.294
        Mexico           10234  2000-05-01      18.562
...                        ...         ...         ...
Morocco Casablanca        3130  2013-05-01      19.217
        Casablanca        3131  2013-06-01      23.649
        Casablanca        3132  2013-07-01      27.488
        Casablanca        3133  2013-08-01      27.952
        Casablanca        3134  2013-09-01         NaN

[330 rows x 3 columns]


Subset rows from Pakistan, Lahore to Russia, Moscow. Use the same command above, but this time use the correct tuples.

In [119]:
print(temperatures_srt.loc[("Pakistan", "Lahore"):("Russia", "Moscow")])

                 Unnamed: 0        date  avg_temp_c
country  city                                      
Pakistan Lahore        8415  2000-01-01      12.792
         Lahore        8416  2000-02-01      14.339
         Lahore        8417  2000-03-01      20.309
         Lahore        8418  2000-04-01      29.072
         Lahore        8419  2000-05-01      34.845
...                     ...         ...         ...
Russia   Moscow       10885  2013-05-01      16.152
         Moscow       10886  2013-06-01      18.718
         Moscow       10887  2013-07-01      18.136
         Moscow       10888  2013-08-01      17.485
         Moscow       10889  2013-09-01         NaN

[660 rows x 3 columns]


# Slicing in both directions

You can slice both rows and columns at once in a DataFrame. By passing two arguments to `.loc[]`, you can subset the DataFrame by both rows and columns in one step.

Todo:

- Use `.loc[]` slicing to subset rows from India, Hyderabad to Iraq, Baghdad.
- Use `.loc[]` slicing to subset columns from `date` to `avg_temp_c`.
- Slice in both directions at once from Hyderabad to Baghdad, and `date` to `avg_temp_c`.

Print the dataframe that has been indexed by country and city, and then sorted.

In [120]:
print(temperatures_srt)

                    Unnamed: 0        date  avg_temp_c
country     city                                      
Afghanistan Kabul         7260  2000-01-01       3.326
            Kabul         7261  2000-02-01       3.454
            Kabul         7262  2000-03-01       9.612
            Kabul         7263  2000-04-01      17.925
            Kabul         7264  2000-05-01      24.658
...                        ...         ...         ...
Zimbabwe    Harare        5605  2013-05-01      18.298
            Harare        5606  2013-06-01      17.020
            Harare        5607  2013-07-01      16.299
            Harare        5608  2013-08-01      19.232
            Harare        5609  2013-09-01         NaN

[16500 rows x 3 columns]


Subset rows from India, Hyderabad to Iraq, Baghdad

In [121]:
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad")])

                   Unnamed: 0        date  avg_temp_c
country city                                         
India   Hyderabad        5940  2000-01-01      23.779
        Hyderabad        5941  2000-02-01      25.826
        Hyderabad        5942  2000-03-01      28.821
        Hyderabad        5943  2000-04-01      32.698
        Hyderabad        5944  2000-05-01      32.438
...                       ...         ...         ...
Iraq    Baghdad          1150  2013-05-01      28.673
        Baghdad          1151  2013-06-01      33.803
        Baghdad          1152  2013-07-01      36.392
        Baghdad          1153  2013-08-01      35.463
        Baghdad          1154  2013-09-01         NaN

[2145 rows x 3 columns]


Subset columns from date to avg_temp_c

In [122]:
print(temperatures_srt.loc[:, "date":"avg_temp_c"])

                          date  avg_temp_c
country     city                          
Afghanistan Kabul   2000-01-01       3.326
            Kabul   2000-02-01       3.454
            Kabul   2000-03-01       9.612
            Kabul   2000-04-01      17.925
            Kabul   2000-05-01      24.658
...                        ...         ...
Zimbabwe    Harare  2013-05-01      18.298
            Harare  2013-06-01      17.020
            Harare  2013-07-01      16.299
            Harare  2013-08-01      19.232
            Harare  2013-09-01         NaN

[16500 rows x 2 columns]


Subset in both directions at once

In [123]:
print(temperatures_srt.loc[
    ("India", "Hyderabad"):("Iraq", "Baghdad"), 
    "date":"avg_temp_c"
    ])

                         date  avg_temp_c
country city                             
India   Hyderabad  2000-01-01      23.779
        Hyderabad  2000-02-01      25.826
        Hyderabad  2000-03-01      28.821
        Hyderabad  2000-04-01      32.698
        Hyderabad  2000-05-01      32.438
...                       ...         ...
Iraq    Baghdad    2013-05-01      28.673
        Baghdad    2013-06-01      33.803
        Baghdad    2013-07-01      36.392
        Baghdad    2013-08-01      35.463
        Baghdad    2013-09-01         NaN

[2145 rows x 2 columns]


# Slicing time series

Slicing is helpful for time series data, especially when filtering by a date range. Set the date column as the index and use `.loc[]` for subsetting. Ensure your dates are in ISO 8601 format: "yyyy-mm-dd" for full dates, "yyyy-mm" for months, and "yyyy" for years.

Todo:

- Use Boolean conditions, not .isin() or .loc[], and the full date "yyyy-mm-dd", to subset temperatures for rows where the date column is in 2010 and 2011 and print the results.
- Set the index of temperatures to the date column and sort it.
- Use .loc[] to subset temperatures_ind for rows in 2010 and 2011.
- Use .loc[] to subset temperatures_ind for rows from August 2010 to February 2011.

Use Boolean conditions to subset temperatures for rows in 2010 and 2011

In [124]:
temperatures_bool = temperatures[
    (temperatures["date"] >= "2010-01-01") & 
    (temperatures["date"] <= "2011-12-31") 
    ]
print(temperatures_bool)

       Unnamed: 0        date     city        country  avg_temp_c
120           120  2010-01-01  Abidjan  Côte D'Ivoire      28.270
121           121  2010-02-01  Abidjan  Côte D'Ivoire      29.262
122           122  2010-03-01  Abidjan  Côte D'Ivoire      29.596
123           123  2010-04-01  Abidjan  Côte D'Ivoire      29.068
124           124  2010-05-01  Abidjan  Côte D'Ivoire      28.258
...           ...         ...      ...            ...         ...
16474       16474  2011-08-01     Xian          China      23.069
16475       16475  2011-09-01     Xian          China      16.775
16476       16476  2011-10-01     Xian          China      12.587
16477       16477  2011-11-01     Xian          China       7.543
16478       16478  2011-12-01     Xian          China      -0.490

[2400 rows x 5 columns]


Set date as the index and sort the index.

In [125]:
temperatures_ind = temperatures.set_index("date").sort_index()
print(temperatures_ind)

            Unnamed: 0       city        country  avg_temp_c
date                                                        
2000-01-01           0    Abidjan  Côte D'Ivoire      27.293
2000-01-01        8415     Lahore       Pakistan      12.792
2000-01-01       15345   Tangshan          China      -5.406
2000-01-01        5115      Gizeh          Egypt      12.669
2000-01-01        8580    Lakhnau          India      15.152
...                ...        ...            ...         ...
2013-09-01       11549    Nanjing          China         NaN
2013-09-01       11714  New Delhi          India         NaN
2013-09-01       11879   New York  United States      17.408
2013-09-01       12209     Peking          China         NaN
2013-09-01       16499       Xian          China         NaN

[16500 rows x 4 columns]


Use .loc[] to subset temperatures_ind for rows in 2010 and 2011.

In [126]:
print(temperatures_ind.loc["2010-01-01":"2011-12-31"])

            Unnamed: 0        city    country  avg_temp_c
date                                                     
2010-01-01        4905  Faisalabad   Pakistan      11.810
2010-01-01       10185   Melbourne  Australia      20.016
2010-01-01        3750   Chongqing      China       7.921
2010-01-01       13155   São Paulo     Brazil      23.738
2010-01-01        5400   Guangzhou      China      14.136
...                ...         ...        ...         ...
2011-12-01       11033      Nagoya      Japan       6.476
2011-12-01        6083   Hyderabad      India      23.613
2011-12-01        2783        Cali   Colombia      21.559
2011-12-01        8888        Lima       Peru      18.293
2011-12-01        1463     Bangkok   Thailand      25.021

[2400 rows x 4 columns]


Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011.

In [127]:
print(temperatures_ind.loc["2010-08-01":"2011-02-28"])

            Unnamed: 0      city        country  avg_temp_c
date                                                       
2010-08-01        2602  Calcutta          India      30.226
2010-08-01       12337      Pune          India      24.941
2010-08-01        6562     Izmir         Turkey      28.352
2010-08-01       15637   Tianjin          China      25.543
2010-08-01        9862    Manila    Philippines      27.101
...                ...       ...            ...         ...
2011-02-01        7393     Kabul    Afghanistan       3.914
2011-02-01        3598   Chicago  United States       0.276
2011-02-01         628    Aleppo          Syria       8.246
2011-02-01        4423     Delhi          India      18.136
2011-02-01       12508   Rangoon          Burma      26.631

[700 rows x 4 columns]


# Subsetting by row/column number

Subsetting by row/column number is another way to filter data. Instead of using index labels or conditions, you can use row numbers with `.iloc[]`. Like `.loc[]`, `.iloc[]` accepts two arguments to subset both rows and columns.

Todo:

- Get the 23rd row, 2nd column (index positions 22 and 1).
- Get the first 5 rows (index positions 0 to 5).
- Get all rows, columns 3 and 4 (index positions 2 to 4).
- Get the first 5 rows, columns 3 and 4.

Get 23rd row, 2nd column (index 22, 1).

In [128]:
print(temperatures.iloc[22, 1])

2001-11-01


Use slicing to get the first 5 rows.

In [129]:
print(temperatures.iloc[:6, :])

   Unnamed: 0        date     city        country  avg_temp_c
0           0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1           1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2           2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3           3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4           4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
5           5  2000-06-01  Abidjan  Côte D'Ivoire      25.812


Use slicing to get columns 3 to 4.

In [130]:
print(temperatures.iloc[:, 2:4])

          city        country
0      Abidjan  Côte D'Ivoire
1      Abidjan  Côte D'Ivoire
2      Abidjan  Côte D'Ivoire
3      Abidjan  Côte D'Ivoire
4      Abidjan  Côte D'Ivoire
...        ...            ...
16495     Xian          China
16496     Xian          China
16497     Xian          China
16498     Xian          China
16499     Xian          China

[16500 rows x 2 columns]


Use slicing in both directions at once.

In [131]:
print(temperatures.iloc[:5, 2:4])

      city        country
0  Abidjan  Côte D'Ivoire
1  Abidjan  Côte D'Ivoire
2  Abidjan  Côte D'Ivoire
3  Abidjan  Côte D'Ivoire
4  Abidjan  Côte D'Ivoire


# Pivot temperature 

To observe how temperatures change over time, looking at monthly data can be overwhelming. Instead, we can focus on how temperatures vary by year.

You can extract components like year and month from a date using `dataframe["column"].dt.component`. For example, use `dataframe["column"].dt.month` for the month and `dataframe["column"].dt.year` for the year.

Once you have the year column, you can create a pivot table to aggregate data by city and year.

Add a year column to temperatures.

In [136]:
temperatures["date"] = pd.to_datetime(temperatures["date"])
temperatures["year"] = temperatures["date"].dt.year 
print(temperatures.head())

   Unnamed: 0       date     city        country  avg_temp_c  year
0           0 2000-01-01  Abidjan  Côte D'Ivoire      27.293  2000
1           1 2000-02-01  Abidjan  Côte D'Ivoire      27.685  2000
2           2 2000-03-01  Abidjan  Côte D'Ivoire      29.061  2000
3           3 2000-04-01  Abidjan  Côte D'Ivoire      28.162  2000
4           4 2000-05-01  Abidjan  Côte D'Ivoire      27.547  2000


Make a pivot table of the avg_temp_c column, with country and city as rows, and year as columns. Assign to temp_by_country_city_vs_year, and look at the result.

In [138]:
temp_by_country_city_vs_year = temperatures.pivot_table(
    values='avg_temp_c', 
    index=['country','city'],
    columns='year'
    
)

print(temp_by_country_city_vs_year)


year                                 2000       2001       2002       2003  \
country       city                                                           
Afghanistan   Kabul             15.822667  15.847917  15.714583  15.132583   
Angola        Luanda            24.410333  24.427083  24.790917  24.867167   
Australia     Melbourne         14.320083  14.180000  14.075833  13.985583   
              Sydney            17.567417  17.854500  17.733833  17.592333   
Bangladesh    Dhaka             25.905250  25.931250  26.095000  25.927417   
...                                   ...        ...        ...        ...   
United States Chicago           11.089667  11.703083  11.532083  10.481583   
              Los Angeles       16.643333  16.466250  16.430250  16.944667   
              New York           9.969083  10.931000  11.252167   9.836000   
Vietnam       Ho Chi Minh City  27.588917  27.831750  28.064750  27.827667   
Zimbabwe      Harare            20.283667  20.861000  21.079333 

Subset for Egypt to India.

In [139]:
print(temp_by_country_city_vs_year.loc["Egypt":"India"])

year                       2000       2001       2002       2003       2004  \
country  city                                                                 
Egypt    Alexandria   20.744500  21.454583  21.456167  21.221417  21.064167   
         Cairo        21.486167  22.330833  22.414083  22.170500  22.081917   
         Gizeh        21.486167  22.330833  22.414083  22.170500  22.081917   
Ethiopia Addis Abeba  18.241250  18.296417  18.469750  18.320917  18.292750   
France   Paris        11.739667  11.371250  11.871333  11.909500  11.338833   
Germany  Berlin       10.963667   9.690250  10.264417  10.065750   9.822583   
India    Ahmadabad    27.436000  27.198083  27.719083  27.403833  27.628333   
         Bangalore    25.337917  25.528167  25.755333  25.924750  25.252083   
         Bombay       27.203667  27.243667  27.628667  27.578417  27.318750   
         Calcutta     26.491333  26.515167  26.703917  26.561333  26.634333   
         Delhi        26.048333  25.862917  26.63433

Subset for Egypt, Cairo to India, Delhi.

In [140]:
print(temp_by_country_city_vs_year.loc[
    ("Egypt","Cairo"):("India","Delhi")
    ])

year                       2000       2001       2002       2003       2004  \
country  city                                                                 
Egypt    Cairo        21.486167  22.330833  22.414083  22.170500  22.081917   
         Gizeh        21.486167  22.330833  22.414083  22.170500  22.081917   
Ethiopia Addis Abeba  18.241250  18.296417  18.469750  18.320917  18.292750   
France   Paris        11.739667  11.371250  11.871333  11.909500  11.338833   
Germany  Berlin       10.963667   9.690250  10.264417  10.065750   9.822583   
India    Ahmadabad    27.436000  27.198083  27.719083  27.403833  27.628333   
         Bangalore    25.337917  25.528167  25.755333  25.924750  25.252083   
         Bombay       27.203667  27.243667  27.628667  27.578417  27.318750   
         Calcutta     26.491333  26.515167  26.703917  26.561333  26.634333   
         Delhi        26.048333  25.862917  26.634333  25.721083  26.239917   

year                       2005       2006       20

Subset for Egypt, Cairo to India, Delhi, and 2005 to 2010.

In [141]:
print(temp_by_country_city_vs_year.loc[
    ("Egypt","Cairo"):("India","Delhi"),
    "2005":"2010"
    ])    

year                       2005       2006       2007       2008       2009  \
country  city                                                                 
Egypt    Cairo        22.006500  22.050000  22.361000  22.644500  22.625000   
         Gizeh        22.006500  22.050000  22.361000  22.644500  22.625000   
Ethiopia Addis Abeba  18.312833  18.427083  18.142583  18.165000  18.765333   
France   Paris        11.552917  11.788500  11.750833  11.278250  11.464083   
Germany  Berlin        9.919083  10.545333  10.883167  10.657750  10.062500   
India    Ahmadabad    26.828083  27.282833  27.511167  27.048500  28.095833   
         Bangalore    25.476500  25.418250  25.464333  25.352583  25.725750   
         Bombay       27.035750  27.381500  27.634667  27.177750  27.844500   
         Calcutta     26.729167  26.986250  26.584583  26.522333  27.153250   
         Delhi        25.716083  26.365917  26.145667  25.675000  26.554250   

year                       2010  
country  city    

# Calculating on a Pivot Table

Pivot tables summarize data, but often more calculations are needed to uncover insights. A common task is identifying rows or columns with the highest or lowest values.

You can easily filter a Series or DataFrame using a condition inside square brackets, as shown in the example: `series[series > value]`.

Below is the `temp_by_country_city_vs_year` DataFrame, available with pandas as `pd`, displaying only a few year columns using `.head()`:

| country      | city      | 2000   | 2001   | 2002   | ... | 2013   |
|--------------|-----------|--------|--------|--------|-----|--------|
| Afghanistan  | Kabul     | 15.823 | 15.848 | 15.715 | ... | 16.206 |
| Angola       | Luanda    | 24.410 | 24.427 | 24.791 | ... | 24.554 |
| Australia    | Melbourne | 14.320 | 14.180 | 14.076 | ... | 14.742 |
| Australia    | Sydney    | 17.567 | 17.854 | 17.734 | ... | 18.090 |
| Bangladesh   | Dhaka     | 25.905 | 25.931 | 26.095 | ... | 26.587 |

Get the worldwide mean temp by year.

In [142]:
mean_temp_by_year = temp_by_country_city_vs_year.mean()
print(mean_temp_by_year)

year
2000    19.506243
2001    19.679352
2002    19.855685
2003    19.630197
2004    19.672204
2005    19.607239
2006    19.793993
2007    19.854270
2008    19.608778
2009    19.833752
2010    19.911734
2011    19.549197
2012    19.668239
2013    20.312285
dtype: float64


Filter for the year that had the highest mean temp.

In [143]:
highest_temp = mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()]
print(highest_temp)

year
2013    20.312285
dtype: float64


Get the mean temp by city.

In [145]:
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")
print(mean_temp_by_city)

country        city            
Afghanistan    Kabul               15.541955
Angola         Luanda              24.391616
Australia      Melbourne           14.275411
               Sydney              17.799250
Bangladesh     Dhaka               26.174440
                                     ...    
United States  Chicago             11.330825
               Los Angeles         16.675399
               New York            10.911034
Vietnam        Ho Chi Minh City    27.922857
Zimbabwe       Harare              20.699000
Length: 100, dtype: float64


Filter for the city that had the lowest mean temp.

In [146]:
lowest_temp = mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()]
print(lowest_temp)

country  city  
China    Harbin    4.876551
dtype: float64
