# Hierachical indexing (MultiIndex) with GroupBy

本節我們要學習由GroupBy operation所創造出的結果，其 「多重指標(MultiIndex)A.K.A Hierachical Index」 的概念與操作

In [1]:
import pandas as pd

In [2]:
titanic = pd.read_csv("../data/titanic_ver01.csv", usecols = ["Survived", "Pclass", "Sex", "Age", "Fare"])

In [3]:
titanic

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.2500
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.9250
3,1,1,female,35.0,53.1000
4,0,3,male,35.0,8.0500
5,0,3,male,,8.4583
6,0,1,male,54.0,51.8625
7,0,3,male,2.0,21.0750
8,1,3,female,27.0,11.1333
9,1,2,female,14.0,30.0708


In [28]:
# summary = titanic.groupby(["Sex", "Pclass"]).mean()
# 依照 Sex, Pclass 分組，就會出現六組
# 然後計算每一組的 Survived, Age, Fare
#  
# 同樣效果的寫法，而且這樣寫可以納入更多的統計量。
summary = titanic.groupby(["Sex", "Pclass"]).agg("mean")
# summary = titanic.groupby(["Sex", "Pclass"]).agg(["mean", "max"])

In [29]:
summary

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Age,Fare
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,1,0.968085,34.611765,106.125798
female,2,0.921053,28.722973,21.970121
female,3,0.5,21.75,16.11881
male,1,0.368852,41.281386,67.226127
male,2,0.157407,30.740707,19.741782
male,3,0.135447,26.507589,12.661633


可以看到上述的結果，其實是在 Sex, Pclass 的分組之下，所計算的統計量  
所以可以想成 Sex, Pclass 的組合就是我們的index

影片中是 outer index level = Sex 和 inner index level = Pclass 稱呼

In [30]:
summary.index

MultiIndex(levels=[['female', 'male'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=['Sex', 'Pclass'])

既然知道 MultiIndex 的樣子與編排方式  
自然我們就可以呼叫任意一行

In [31]:
summary.loc[("female", 2), :]
# 效果相同
# summary.loc[("female", 2), ] 

Survived     0.921053
Age         28.722973
Fare        21.970121
Name: (female, 2), dtype: float64

In [32]:
summary.loc[("female", 2), "Age"]

28.722972972972972

---
- swap 
    - (vt.)交換; 以...作交換; 與...交換[（+for/with）]
    - (vi.)交換, 交易
    - (n.)交換; 交換的東西

---
如果要交換 MultiIndex 的順序  
可以使用 資料集dataframe.swaplevel()

In [38]:
# 最笨的方法，也是理想中的結果
titanic.groupby(["Pclass", "Sex"]).agg("mean")

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Age,Fare
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,female,0.968085,34.611765,106.125798
1,male,0.368852,41.281386,67.226127
2,female,0.921053,28.722973,21.970121
2,male,0.157407,30.740707,19.741782
3,female,0.5,21.75,16.11881
3,male,0.135447,26.507589,12.661633


In [39]:
summary.swaplevel()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Age,Fare
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,female,0.968085,34.611765,106.125798
2,female,0.921053,28.722973,21.970121
3,female,0.5,21.75,16.11881
1,male,0.368852,41.281386,67.226127
2,male,0.157407,30.740707,19.741782
3,male,0.135447,26.507589,12.661633


In [41]:
type(summary.swaplevel())
# 還是dataframe唷!!

pandas.core.frame.DataFrame

發現這樣直接 swap 交換後，和想像的不一樣  
因為想像中，應該是Pclass = 1先分 male與female這樣  
但是上述的結果是直接交換~

---
因此要多一個步驟叫做 sort_index()  
他就會先把 MultiIndex 最外層(最左邊)的 index先排序，然後再依序往內排。

In [43]:
summary.swaplevel().sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Age,Fare
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,female,0.968085,34.611765,106.125798
1,male,0.368852,41.281386,67.226127
2,female,0.921053,28.722973,21.970121
2,male,0.157407,30.740707,19.741782
3,female,0.5,21.75,16.11881
3,male,0.135447,26.507589,12.661633


In [44]:
type(summary.swaplevel().sort_index())

pandas.core.frame.DataFrame

如果我們對新分組整理好的 dataframe 感到滿意  
我們可以重新設定 dataframe 的 index。

In [47]:
summary.reset_index()

Unnamed: 0,Sex,Pclass,Survived,Age,Fare
0,female,1,0.968085,34.611765,106.125798
1,female,2,0.921053,28.722973,21.970121
2,female,3,0.5,21.75,16.11881
3,male,1,0.368852,41.281386,67.226127
4,male,2,0.157407,30.740707,19.741782
5,male,3,0.135447,26.507589,12.661633


---
如果之後要輸出這樣新排序好的 dataframe 就再給他一個變數名就好了  
然後其實整理好的數據，有時候在觀察上比較方便  
所以這是一個蠻重要的技巧唷!!