<a href="https://colab.research.google.com/github/joseeden/joeden/blob/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/001-Sample-Notebooks/005-indexes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting and removing indexes  

In pandas, you can use a column as an index to simplify subsetting and sometimes improve lookup performance.

Todo: 

- Set the index of temperatures to "city", assigning to `temperatures_ind`.
- Look at `temperatures_ind`. How is it different from temperatures?
- Reset the index of `temperatures_ind`, keeping its contents.
- Reset the index of `temperatures_ind`, dropping its contents.

Import the dataset and save as `temperatures` dataframe.

In [None]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/data-manipulation-using-pandas/temperatures.csv'
temperatures = pd.read_csv(url)

print(temperatures.head())

Set the index to "city".

In [None]:
temperatures_ind = temperatures.set_index("city")
print(temperatures_ind)

Reset the `temperatures_ind` index, keeping its contents

In [None]:
print(temperatures_ind.reset_index())

Reset the `temperatures_ind` index, dropping its contents

In [None]:
print(temperatures_ind.reset_index(drop=True))

# Subsetting 

The `.loc[]` method is a powerful way to subset rows using index values. It simplifies code compared to standard square bracket subsetting, making it easier to read and maintain.

Todo: 

- Create a list called cities that contains "Moscow" and "Saint Petersburg".
- Use [] subsetting to filter temperatures for rows where the city column takes a value in the cities list.
- Use .loc[] subsetting to filter temperatures_ind for rows where the city is in the cities list.

Make a list of cities to subset on. then subset temperatures using square brackets

In [None]:
cities = ["Moscow", "Saint Petersburg"]

print(temperatures[
    temperatures["city"].isin(cities)
])


Subset `temperatures_ind` using `.loc[]`

In [None]:
print(temperatures_ind.loc[cities])

# Setting Multi-Level Indexes  

A multi-level index, also called a **hierarchical index**, uses multiple columns as the index. This approach can simplify working with nested categories. For example, in a clinical trial, test subjects can be grouped under control or treatment groups, and in a dataset, cities can be grouped within countries. 

However, working with indexes has a downside: the syntax for handling them differs from working with columns, so you need to learn two approaches and keep track of the data structure.

Todo:

- Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind.
- Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep.
- Print and subset temperatures_ind for rows_to_keep using .loc[].

Index temperatures by country & city

In [None]:
temperatures_ind = temperatures.set_index([
    "country",
    "city"
])
print(temperatures_ind)

Create the tuples and then use it to subset for rows to keep.

In [None]:
rows_to_keep = [
    ("Brazil", "Rio De Janeiro"),
    ("Pakistan", "Lahore")
]

print(temperatures_ind.loc[rows_to_keep])

# Sorting by Index  

In addition to sorting rows with `.sort_values()`, you can rearrange rows based on index values using `.sort_index()`. This helps organize data more effectively when working with indexed DataFrames.

Todo:

- Sort `temperatures_ind` by the index values.
- Sort `temperatures_ind` by the index values at the "city" level.
- Sort `temperatures_ind` by ascending country then descending city.

Sort `temperatures_ind` by the index values.

In [None]:
print(temperatures_ind.sort_index())

Sort `temperatures_ind` by the index values at the "city" level.

In [None]:
print(temperatures_ind.sort_index(level="city"))

Sort `temperatures_ind` by ascending country then descending city.

In [None]:
print(temperatures_ind.sort_index(
    level=["country", "city"],
    ascending=[True, False]
))


# Slicing index values

Slicing allows you to select consecutive elements using the `first:last` syntax. For DataFrames, slicing by index values is done inside the `.loc[]` method.

- The index must be sorted (`.sort_index()`) before slicing.
- Use strings for slicing outer-level indexes.
- Use tuples for slicing inner-level indexes.
- A single slice passed to `.loc[]` slices the rows.

Todo:

- Sort the index of `temperatures_ind`.
- Use slicing with `.loc[]` to get these subsets:
  - from Pakistan to Russia.
  - from Lahore to Moscow. (This will return nonsense.)
  - from Pakistan, Lahore to Russia, Moscow.

Sort the index of `temperatures_ind`

In [None]:
temperatures_srt = temperatures_ind.sort_index()
print(temperatures_srt)

Subset rows from Pakistan to Russia. You'll be slicing at the outer level index, which is the country.

In [None]:
print(temperatures_srt.loc["Pakistan":"Russia"])

Try to subset rows from Lahore to Moscow. Since the indexing is on the inner level, it requires that first and last argument are tuples. The command below will return incorrect values.

In [None]:
print(temperatures_srt.loc["Lahore":"Moscow"])

Subset rows from Pakistan, Lahore to Russia, Moscow. Use the same command above, but this time use the correct tuples.

In [None]:
print(temperatures_srt.loc[("Pakistan", "Lahore"):("Russia", "Moscow")])

# Slicing in both directions

You can slice both rows and columns at once in a DataFrame. By passing two arguments to `.loc[]`, you can subset the DataFrame by both rows and columns in one step.

Todo:

- Use `.loc[]` slicing to subset rows from India, Hyderabad to Iraq, Baghdad.
- Use `.loc[]` slicing to subset columns from `date` to `avg_temp_c`.
- Slice in both directions at once from Hyderabad to Baghdad, and `date` to `avg_temp_c`.

Print the dataframe that has been indexed by country and city, and then sorted.

In [None]:
print(temperatures_srt)

Subset rows from India, Hyderabad to Iraq, Baghdad

In [None]:
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad")])

Subset columns from date to avg_temp_c

In [None]:
print(temperatures_srt.loc[:, "date":"avg_temp_c"])

Subset in both directions at once

In [None]:
print(temperatures_srt.loc[
    ("India", "Hyderabad"):("Iraq", "Baghdad"), 
    "date":"avg_temp_c"
    ])

# Slicing time series

Slicing is helpful for time series data, especially when filtering by a date range. Set the date column as the index and use `.loc[]` for subsetting. Ensure your dates are in ISO 8601 format: "yyyy-mm-dd" for full dates, "yyyy-mm" for months, and "yyyy" for years.

Todo:

- Use Boolean conditions, not .isin() or .loc[], and the full date "yyyy-mm-dd", to subset temperatures for rows where the date column is in 2010 and 2011 and print the results.
- Set the index of temperatures to the date column and sort it.
- Use .loc[] to subset temperatures_ind for rows in 2010 and 2011.
- Use .loc[] to subset temperatures_ind for rows from August 2010 to February 2011.

Use Boolean conditions to subset temperatures for rows in 2010 and 2011

In [None]:
temperatures_bool = temperatures[
    (temperatures["date"] >= "2010-01-01") & 
    (temperatures["date"] <= "2011-12-31") 
    ]
print(temperatures_bool)

Set date as the index and sort the index.

In [None]:
temperatures_ind = temperatures.set_index("date").sort_index()
print(temperatures_ind)

Use .loc[] to subset temperatures_ind for rows in 2010 and 2011.

In [None]:
print(temperatures_ind.loc["2010-01-01":"2011-12-31"])

Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011.

In [None]:
print(temperatures_ind.loc["2010-08-01":"2011-02-28"])

# Subsetting by row/column number

Subsetting by row/column number is another way to filter data. Instead of using index labels or conditions, you can use row numbers with `.iloc[]`. Like `.loc[]`, `.iloc[]` accepts two arguments to subset both rows and columns.

Todo:

- Get the 23rd row, 2nd column (index positions 22 and 1).
- Get the first 5 rows (index positions 0 to 5).
- Get all rows, columns 3 and 4 (index positions 2 to 4).
- Get the first 5 rows, columns 3 and 4.

Get 23rd row, 2nd column (index 22, 1).

In [73]:
print(temperatures.iloc[22, 1])

2001-11-01


Use slicing to get the first 5 rows.

In [74]:
print(temperatures.iloc[:6, :])

   Unnamed: 0        date     city        country  avg_temp_c
0           0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1           1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2           2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3           3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4           4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
5           5  2000-06-01  Abidjan  Côte D'Ivoire      25.812


Use slicing to get columns 3 to 4.

In [75]:
print(temperatures.iloc[:, 2:4])

          city        country
0      Abidjan  Côte D'Ivoire
1      Abidjan  Côte D'Ivoire
2      Abidjan  Côte D'Ivoire
3      Abidjan  Côte D'Ivoire
4      Abidjan  Côte D'Ivoire
...        ...            ...
16495     Xian          China
16496     Xian          China
16497     Xian          China
16498     Xian          China
16499     Xian          China

[16500 rows x 2 columns]


Use slicing in both directions at once.

In [76]:
print(temperatures.iloc[:5, 2:4])

      city        country
0  Abidjan  Côte D'Ivoire
1  Abidjan  Côte D'Ivoire
2  Abidjan  Côte D'Ivoire
3  Abidjan  Côte D'Ivoire
4  Abidjan  Côte D'Ivoire
