# Working with Multi-Index Dataframes

In [3]:
import pandas as pd

file_name = "data/nyc-parking-violations-2020.csv"

voilations_df = pd.read_csv(file_name, usecols = ["Date First Observed", "Registration State", "Plate ID", "Issue Date", "Vehicle Make", "Street Name", "Vehicle Color"])

# Preview the Data
voilations_df.head()

Unnamed: 0,Plate ID,Registration State,Issue Date,Vehicle Make,Street Name,Date First Observed,Vehicle Color
0,J58JKX,NJ,05/08/1972 12:00:00 AM,HONDA,43 ST,0,BK
1,KRE6058,PA,08/29/1977 12:00:00 AM,ME/BE,UNION ST,0,BLK
2,444326R,NJ,10/03/1988 12:00:00 AM,LEXUS,CLERMONT AVENUE,0,BLACK
3,F728330,OH,01/03/1990 12:00:00 AM,CHEVR,DIVISION AVE,0,
4,FMY9090,NY,02/14/1990 12:00:00 AM,JEEP,GRAND ST,0,GREY


> Once the data frame was loaded, we were going to perform several queries based on the parking
tickets' issue date. As a result, it made sense to set the index to the Issue Date column:

In [4]:
voilations_df = voilations_df.set_index("Issue Date")

voilations_df.head()

Unnamed: 0_level_0,Plate ID,Registration State,Vehicle Make,Street Name,Date First Observed,Vehicle Color
Issue Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
05/08/1972 12:00:00 AM,J58JKX,NJ,HONDA,43 ST,0,BK
08/29/1977 12:00:00 AM,KRE6058,PA,ME/BE,UNION ST,0,BLK
10/03/1988 12:00:00 AM,444326R,NJ,LEXUS,CLERMONT AVENUE,0,BLACK
01/03/1990 12:00:00 AM,F728330,OH,CHEVR,DIVISION AVE,0,
02/14/1990 12:00:00 AM,FMY9090,NY,JEEP,GRAND ST,0,GREY


Notice that `set_index` returns a new data frame, based on the original one, which we assign
back to df. As of this point, if we make queries that involve the index (typically using loc), it’ll
be based on the value of issue date. Also: As far as the data frame is concerned, there is no
longer an Issue Date column! Its identity as a named column is gone, at least for now.

In [9]:
# Query the index

voilations_df.loc["01/02/2020 12:00:00 AM", ["Vehicle Make"]].value_counts().head(3)

Vehicle Make
TOYOT           3829
HONDA           3593
FORD            3164
dtype: int64

In [10]:
# Reset the Index

voilations_df = voilations_df.reset_index().set_index("Vehicle Color")

In [13]:
voilations_df.loc[["BLUE", "RED"], ["Vehicle Make"]].value_counts().head(3)

Vehicle Make
HONDA           39353
FORD            30990
TOYOT           30925
dtype: int64

## Multi-Index DF

In [15]:
# Building a DF for Multi-Indexing

import numpy as np

np.random.seed(0)

df = pd.DataFrame(np.random.randint(0, 100, [36,3]), columns = ["A", "B", "C"])

df["Year"] = ([2018] * 12) + ([2019] * 12) + ([2020] * 12)

df["Month"] = "Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec".split() * 3

df.head()

Unnamed: 0,A,B,C,Year,Month
0,44,47,64,2018,Jan
1,67,67,9,2018,Feb
2,83,21,36,2018,Mar
3,87,70,88,2018,Apr
4,88,12,58,2018,May


In [16]:
# Create a Multi-Index

df = df.set_index(["Year", "Month"])

df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
Year,Month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018,Jan,44,47,64
2018,Feb,67,67,9
2018,Mar,83,21,36
2018,Apr,87,70,88
2018,May,88,12,58


Remember that when you’re creating a multi-index, you want the most
general part to be on the outside, and thus be mentioned first. If you were to
create a multi-index with dates, you would do it using year, month, and day, in
that order. If you were to create a multi-index for your company’s sales data,
you might use region, country, department, and product, in that order.

In [19]:
# Query the Multi-Index for 2019 Sales Data for Products A and C

df.loc[2019, ["A", "C"]]

Unnamed: 0_level_0,A,C
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
Jan,29,19
Feb,14,32
Mar,65,57
Apr,32,74
May,23,75
Jun,55,34
Jul,0,36
Aug,53,38
Sep,17,4
Oct,42,31


In [22]:
# Query the Multi-Index for Dec 2019 Sales Data for Products A and C

df.loc[(2019, "Dec"), ["A", "C"]]

A    57
C    11
Name: (2019, Dec), dtype: int32

In [25]:
# More than 1 year at a time

df.loc[[2018, 2019], ["A", "C"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,C
Year,Month,Unnamed: 2_level_1,Unnamed: 3_level_1
2018,Jan,44,64
2018,Feb,67,9
2018,Mar,83,36
2018,Apr,87,88
2018,May,88,58
2018,Jun,65,87
2018,Jul,46,81
2018,Aug,37,77
2018,Sep,72,20
2018,Oct,80,79


In [26]:
# Query Multiple Index over Years and Months for specific columns

df.loc[([2019, 2020], ["Jul", "Aug", "Sep", "Oct"]), ["A", "C"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,C
Year,Month,Unnamed: 2_level_1,Unnamed: 3_level_1
2019,Jul,0,36
2019,Aug,53,38
2019,Sep,17,4
2019,Oct,42,31
2020,Jul,78,20
2020,Aug,99,23
2020,Sep,79,85
2020,Oct,48,69


In [27]:
# Query Multiple Index over Years and Months for specific columns

df.loc[([2019, 2020], ["Jul", "Aug", "Sep", "Oct"]), : ]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
Year,Month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019,Jul,0,0,36
2019,Aug,53,5,38
2019,Sep,17,79,4
2019,Oct,42,58,31
2020,Jul,78,15,20
2020,Aug,99,58,23
2020,Sep,79,13,85
2020,Oct,48,49,69


In [28]:
data_file = "data/sat-scores.csv"

df = pd.read_csv(data_file, usecols = ["Year", "State.Code", "Total.Math", "Total.Test-takers", "Total.Verbal"], index_col = ["Year", "State.Code"])
                                       
df.head()                                       
                                       

Unnamed: 0_level_0,Unnamed: 1_level_0,Total.Math,Total.Test-takers,Total.Verbal
Year,State.Code,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005,AL,559,3985,567
2005,AK,519,3996,523
2005,AZ,530,18184,526
2005,AR,552,1600,563
2005,CA,522,186552,504


In [30]:
# I wanted to find out the mean math score for students in four states—New York, New
# Jersey, Massachusetts, and Illinois, in the year 2010.

df.loc[(2010, ["NY", "NJ", "MA", "IL"]), "Total.Math"].mean()

535.25

The next question asks for a similar calculation, but on several years, as well as several states.
Once again, that’s not an issue, if we think carefully about how to construct the query:

- From the first part (Year) of the multi-index, we want 2012, 2013, 2014, and 2015.
- From the second part (State.Code) of the multi-index, we want AZ, CA, and TX.
- From the columns, we are again interested in Total.Math

In [32]:
df.loc[([2012, 2013, 2014, 2015], ["AZ", "CA", "TX"]), "Total.Math"].mean()

511.5833333333333

In this exercise, we’re going to build a deep multi-index, allowing us to retrieve data from a
variety of levels and in a number of ways. Specifically, I want you to find the following:
- Read the data file (olympic_athlete_events.csv) into a data frame. We only care
about some of the columns: Age, Height, Team, Year, Season, City, Sport, Event, and
Medal. And the multi-index should be based on Year, Season, Sport, and Event.
- What is the average age for winning athletes in summer games held between 1936 and
2000?
- What team has won the greatest number of medals for all archery events?
- Starting in 1980, what is the average height of the event known as "Table Tennis
Women’s Team"?
- Starting in 1980, what is the average height of either "Table Tennis Women’s Team" or
"Table Tennis Men’s Team"?
- How tall was the tallest-ever tennis player in Olympic games from 1980 until 2020?

In [42]:
data_file = "data/olympic_athlete_events.csv"

df = pd.read_csv(data_file, usecols = ["Age", "Height", "Team", "Year", "Season", "City", "Sport", "Event", "Medal"], index_col = ["Year", "Season", "Sport", "Event"])

# To preform slicing on multi-index, please perform sort_index
df = df.sort_index()

df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Age,Height,Team,City,Medal
Year,Season,Sport,Event,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1896,Summer,Athletics,"Athletics Men's 1,500 metres",24.0,,United States,Athina,Silver
1896,Summer,Athletics,"Athletics Men's 1,500 metres",,,Greece,Athina,
1896,Summer,Athletics,"Athletics Men's 1,500 metres",22.0,,Australia,Athina,Gold
1896,Summer,Athletics,"Athletics Men's 1,500 metres",23.0,154.0,Germany,Athina,
1896,Summer,Athletics,"Athletics Men's 1,500 metres",21.0,,Greece,Athina,


In [43]:
# What is the average age for winning athletes in summer games held between 1936 and 2000?

df.loc[(slice(1936, 2000), "Summer"), "Age"].mean()

25.026883940421765

Next, I asked you to find which team has won the greatest number of medals for all archery
events. How will we construct this query? We need to think through each of the levels in our
multi-index:
- We’re interested in all years, which means that we’ll specify slice(None) for the first
index level
- Archery is only a summer sport, so we can either indicate Summer for the second level or slice(None)
- In the third level, we’ll explicitly specify Archery, so that we only get those rows for
archery events.
- Finally, we’ll ignore the fourth level, effectively making it a wildcard

In [47]:
# find which team has won the greatest number of medals for all archery events

df.loc[(slice(None), ["Summer"], ["Archery"]), ["Team"]].value_counts().head(1)

Team         
United States    155
dtype: int64

Next, I asked you to find the average height of athletes in one specific event, namely Table
Tennis Women’s Team. Once again, we can consider all of the parts of our multi-index:
- We want to get results from all years 
- Table tennis is only played in the summer games, so we can either specify Summer or slice(None)
- The sport is "Table tennis," so we can specify that if we want—but given that all of these
events fall under the same sport, we can also leave it as a wildcard with slice(None).
- Finally, we specify Table Tennis Women’s Team for the event.

In [54]:
df.loc[(slice(None), ["Summer"], slice(None), ["Table Tennis Women's Team"]), ["Height"]].mean()

Height    165.048276
dtype: float64

In [56]:
df.loc[(slice(None), ["Summer"], slice(None), ["Table Tennis Men's Team","Table Tennis Women's Team"]), ["Height"]].mean()

Height    171.266436
dtype: float64

## Using xs[]

In [59]:
df.xs("Table Tennis Women's Team", level = "Event")["Height"].mean()

165.04827586206898

In [61]:
df.xs(('Summer', "Table Tennis Women's Team"), level=['Season',
'Event'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Height,Team,City,Medal
Year,Sport,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008,Table Tennis,26.0,181.0,Croatia,Beijing,
2008,Table Tennis,21.0,164.0,Germany,Beijing,
2008,Table Tennis,36.0,171.0,Croatia,Beijing,
2008,Table Tennis,27.0,158.0,South Korea,Beijing,Bronze
2008,Table Tennis,20.0,169.0,Spain,Beijing,
...,...,...,...,...,...,...
2016,Table Tennis,26.0,166.0,Singapore,Rio de Janeiro,
2016,Table Tennis,20.0,166.0,United States,Rio de Janeiro,
2016,Table Tennis,25.0,162.0,Australia,Rio de Janeiro,
2016,Table Tennis,28.0,165.0,United States,Rio de Janeiro,


## Pivot Tables

In [64]:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 100, [36, 3]),
columns=list('ABC'))
df['year'] = [2018] * 12 + [2019] * 12 + [2020] * 12
df['month'] = 'Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec'.split() * 3

df.head()

Unnamed: 0,A,B,C,year,month
0,44,47,64,2018,Jan
1,67,67,9,2018,Feb
2,83,21,36,2018,Mar
3,87,70,88,2018,Apr
4,88,12,58,2018,May


In [66]:
df.pivot_table(index = "month", columns = "year", values = "A", sort = False)

year,2018,2019,2020
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,44,29,46
Feb,67,14,0
Mar,83,65,53
Apr,87,32,84
May,88,23,6
Jun,65,55,3
Jul,46,0,78
Aug,37,53,99
Sep,72,17,79
Oct,80,42,48


In [67]:
df.pivot_table(index='month', columns='year', values='A', sort=False,
aggfunc=np.size)

year,2018,2019,2020
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jan,1,1,1
Feb,1,1,1
Mar,1,1,1
Apr,1,1,1
May,1,1,1
Jun,1,1,1
Jul,1,1,1
Aug,1,1,1
Sep,1,1,1
Oct,1,1,1


In [76]:
# Olympic Pivots

data_file = "data/olympic_athlete_events.csv"

df = pd.read_csv(data_file, usecols = ["Age", "Height", "Team", "Year", "Season", "City", "Sport", "Event", "Medal"])

df.head()

Unnamed: 0,Age,Height,Team,Year,Season,City,Sport,Event,Medal
0,24.0,180.0,China,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,23.0,170.0,China,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,24.0,,Denmark,1920,Summer,Antwerpen,Football,Football Men's Football,
3,34.0,,Denmark/Sweden,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,21.0,185.0,Netherlands,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [77]:
# Select Specified Countries
df = df.loc[df["Team"].isin(['Great Britain', 'France', 'United States', 'Switzerland', 'China', 'India'])]

# Remove rows before 1980
df = df.loc[(df["Year"] >= 1980)]

# Remove rows before 1980
df.head()

Unnamed: 0,Age,Height,Team,Year,Season,City,Sport,Event,Medal
0,24.0,180.0,China,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,23.0,170.0,China,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
10,31.0,188.0,United States,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 10 kilometres,
11,31.0,188.0,United States,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 50 kilometres,
12,31.0,188.0,United States,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 10/15 kilometres Pu...,


In [78]:
pd.pivot_table(df, index="Year", columns="Team", values="Age")

Team,China,France,Great Britain,India,Switzerland,United States
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1980,21.868421,23.52459,22.882507,25.506667,24.557823,22.770992
1984,22.076336,24.36983,24.445423,24.90566,23.589744,24.437118
1988,22.358447,24.520076,25.43956,24.0,26.218868,24.904977
1992,21.955752,25.140187,25.584055,24.184615,25.413194,25.474866
1994,20.627907,24.601307,25.282051,,25.5,24.976744
1996,22.021531,25.296629,26.746032,24.62963,27.122093,26.273277
1998,21.784091,25.462069,27.243902,16.0,25.641509,25.146154
2000,22.515306,25.982833,26.406948,25.4,27.376812,26.576203
2002,23.127451,25.737805,26.833333,20.0,26.23871,25.726316
2004,23.006122,26.139073,26.303977,24.728395,27.343284,26.439093


Next, I asked you to find how many medals each country received at each of the games. Once
again, let’s do a bit of planning before creating our pivot table:
- The rows (index) will be the unique values from the Year column
- The columns will be the unique values from the Team column
- The values themselves will come from the Medal column. However, we’re interested in
counting the medals, not in getting their average values (as if that’s even possible). This
means that we’ll need to provide a function argument to the aggfunc parameter, namely
np.size.

In [79]:
pd.pivot_table(df, index="Year", columns="Team", values="Medal", aggfunc=np.size)

Team,China,France,Great Britain,India,Switzerland,United States
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1980,38.0,244.0,384.0,78.0,147.0,131.0
1984,393.0,411.0,569.0,53.0,312.0,821.0
1988,438.0,524.0,547.0,58.0,265.0,886.0
1992,452.0,642.0,578.0,65.0,288.0,936.0
1994,43.0,153.0,39.0,,94.0,215.0
1996,418.0,445.0,379.0,54.0,172.0,827.0
1998,88.0,145.0,41.0,1.0,106.0,260.0
2000,392.0,466.0,403.0,70.0,138.0,748.0
2002,102.0,164.0,54.0,1.0,155.0,285.0
2004,490.0,453.0,352.0,81.0,134.0,706.0


Finally, I wanted to find the tallest players in each sport from each year. Given that we are
looking at a large number of sports, and a relatively small number of years, I thought that it
would be wise to use the years in the columns this time around:
- The rows (index) will be the unique values from the Sport column
- The columns will be the unique values from the Year column
- The values themselves will come from the Height column. We’re interested in the
highest value, and will thus provide a function argument to the aggfunc parameter,
namely np.max.

In [80]:
pd.pivot_table(df, index="Sport", columns="Year", values="Height", aggfunc=np.max)

Year,1980,1984,1988,1992,1994,1996,1998,2000,2002,2004,2006,2008,2010,2012,2014,2016
Sport,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Alpine Skiing,184.0,184.0,185.0,185.0,188.0,,188.0,,189.0,,193.0,,193.0,,200.0,
Archery,185.0,188.0,188.0,191.0,,191.0,,191.0,,193.0,,193.0,,193.0,,188.0
Athletics,197.0,203.0,203.0,200.0,,198.0,,197.0,,203.0,,203.0,,208.0,,203.0
Badminton,,,,186.0,,189.0,,187.0,,190.0,,190.0,,191.0,,191.0
Baseball,,,,198.0,,195.0,,206.0,,,,198.0,,,,
Basketball,196.0,216.0,216.0,216.0,,216.0,,226.0,,226.0,,226.0,,221.0,,218.0
Beach Volleyball,,,,,,193.0,,195.0,,192.0,,202.0,,202.0,,188.0
Biathlon,190.0,190.0,188.0,192.0,192.0,,192.0,,192.0,,193.0,,193.0,,193.0,
Bobsleigh,,184.0,,,,,198.0,,190.0,,193.0,,191.0,,189.0,
Boxing,190.0,195.0,196.0,193.0,,191.0,,198.0,,190.0,,203.0,,201.0,,200.0


In [83]:
# Medals won each year

pd.pivot_table(df, index="Year", columns="Team", values="Medal", aggfunc=np.size)

Team,China,France,Great Britain,India,Switzerland,United States
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1980,38.0,244.0,384.0,78.0,147.0,131.0
1984,393.0,411.0,569.0,53.0,312.0,821.0
1988,438.0,524.0,547.0,58.0,265.0,886.0
1992,452.0,642.0,578.0,65.0,288.0,936.0
1994,43.0,153.0,39.0,,94.0,215.0
1996,418.0,445.0,379.0,54.0,172.0,827.0
1998,88.0,145.0,41.0,1.0,106.0,260.0
2000,392.0,466.0,403.0,70.0,138.0,748.0
2002,102.0,164.0,54.0,1.0,155.0,285.0
2004,490.0,453.0,352.0,81.0,134.0,706.0
