# Generalizing split-apply-combine with apply()

---
- 我們可以在 grouyby 後使用寫好的 aggregation function，比如mean, max, min等等
- 但是如果我們想要找其他特徵，是必須自己些 function 的那要怎麼應用，融入在這個 split-apply-combine的流程中呢?

所以我們必須寫出 user_der function，並且 use our user_der function to our groupby object ，這是利用apply()函式

In [1]:
import pandas as pd

In [2]:
titanic = pd.read_csv("../data/titanic_ver01.csv", usecols = ["Survived", "Pclass", "Sex", "Age", "Fare"])

In [3]:
titanic.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [4]:
titanic.groupby("Sex").mean()

Unnamed: 0_level_0,Survived,Pclass,Age,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.742038,2.159236,27.915709,44.479818
male,0.188908,2.389948,30.726645,25.523893


In [5]:
list(titanic.groupby("Sex"))

[('female',      Survived  Pclass     Sex   Age      Fare
  1           1       1  female  38.0   71.2833
  2           1       3  female  26.0    7.9250
  3           1       1  female  35.0   53.1000
  8           1       3  female  27.0   11.1333
  9           1       2  female  14.0   30.0708
  10          1       3  female   4.0   16.7000
  11          1       1  female  58.0   26.5500
  14          0       3  female  14.0    7.8542
  15          1       2  female  55.0   16.0000
  18          0       3  female  31.0   18.0000
  19          1       3  female   NaN    7.2250
  22          1       3  female  15.0    8.0292
  24          0       3  female   8.0   21.0750
  25          1       3  female  38.0   31.3875
  28          1       3  female   NaN    7.8792
  31          1       1  female   NaN  146.5208
  32          1       3  female   NaN    7.7500
  38          0       3  female  18.0   18.0000
  39          1       3  female  14.0   11.2417
  40          0       3  femal

In [6]:
# type(list(titanic.groupby("Sex")))
# list
# 
# list(titanic.groupby("Sex"))[0][0]
# list(titanic.groupby("Sex"))[0][1]
female_group = list(titanic.groupby("Sex"))[0][1]
female_group

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.9250
3,1,1,female,35.0,53.1000
8,1,3,female,27.0,11.1333
9,1,2,female,14.0,30.0708
10,1,3,female,4.0,16.7000
11,1,1,female,58.0,26.5500
14,0,3,female,14.0,7.8542
15,1,2,female,55.0,16.0000
18,0,3,female,31.0,18.0000


In [7]:
# 檢查一下上面的數值對不對
# female_group.mean().astype("float")
female_group.mean()

Survived     0.742038
Pclass       2.159236
Age         27.915709
Fare        44.479818
dtype: float64

---
## 自定義函式

In [8]:
def group_mean(group):
    return group.mean()
# 上面的函式代表我們要傳入一個 groupby object 然後回傳 這個物件.mean()的結果

In [9]:
group_mean(female_group)

Survived     0.742038
Pclass       2.159236
Age         27.915709
Fare        44.479818
dtype: float64

這樣就可以輕鬆做到了!!

---
## 自定義函式，融入 group-apply-combine 的流程

In [10]:
titanic.groupby("Sex").apply(group_mean)

Unnamed: 0_level_0,Survived,Pclass,Age,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.742038,2.159236,27.915709,44.479818
male,0.188908,2.389948,30.726645,25.523893


---
## 進階的自定義函式

如果我們想要分組後，對每一組取出Age最大的前五名，要怎麼做呢?

In [11]:
titanic.nlargest(n = 5, columns = "Age")

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
630,1,1,male,80.0,30.0
851,0,3,male,74.0,7.775
96,0,1,male,71.0,34.6542
493,0,1,male,71.0,49.5042
116,0,3,male,70.5,7.75


In [12]:
def five_elder_surv(group):
    return group[group.Survived == 1].nlargest(n = 5, columns = "Age")

In [13]:
titanic.groupby("Sex").apply(five_elder_surv)

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Pclass,Sex,Age,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,275,1,1,female,63.0,77.9583
female,483,1,3,female,63.0,9.5875
female,829,1,1,female,62.0,80.0
female,366,1,1,female,60.0,75.25
female,11,1,1,female,58.0,26.55
male,630,1,1,male,80.0,30.0
male,570,1,2,male,62.0,10.5
male,587,1,1,male,60.0,79.2
male,647,1,1,male,56.0,35.5
male,449,1,1,male,52.0,30.5


---
### 備註

- **注意到，一開始簡單的例子 mean() 確實可以不用額外寫一個自定義 function**

In [14]:
# 以下兩種結果會相同
# 
# titanic.groupby("Sex").mean()
titanic.groupby("Sex").apply(group_mean)

Unnamed: 0_level_0,Survived,Pclass,Age,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,0.742038,2.159236,27.915709,44.479818
male,0.188908,2.389948,30.726645,25.523893


- **但是，比較麻欄一點的函式，就要寫一個自定義 function 了**

In [15]:
# 以下兩種結果會不同!!!
# 
# titanic.groupby("Sex").nlargest(n = 5, columns = "Age") #會有錯誤
titanic.groupby("Sex").apply(five_elder_surv)

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Pclass,Sex,Age,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,275,1,1,female,63.0,77.9583
female,483,1,3,female,63.0,9.5875
female,829,1,1,female,62.0,80.0
female,366,1,1,female,60.0,75.25
female,11,1,1,female,58.0,26.55
male,630,1,1,male,80.0,30.0
male,570,1,2,male,62.0,10.5
male,587,1,1,male,60.0,79.2
male,647,1,1,male,56.0,35.5
male,449,1,1,male,52.0,30.5


---
至於為什麼會有這樣的差別呢?  
我現在也還不是很懂!!