## **데이터 가공**
### **▶*apply***

- 행 또는 열 기준 복잡한 처리를 할 때 사용
- 행 또는 열을 입력값으로 넣으면 행 또는 열에 반복적으로 수행
- apply(함수, axis = 0 or 1) 형태로 사용
- 람다 함수 : 즉시 정의하여 사용하는 익명 함수

## **데이터 요약**
### **▶*Group by***

- 엑셀의 pivot과 동일한 기능
- 계층적 구조의 데이터 탐색에 유용

## **Data : CAR_CRASHES 내장 데이터**

### **데이터 설명**
- total: Number of drivers involved in fatal collisions per billion miles (5.900–23.900)
- speeding: Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding (1.792–
9.450)
- alcohol: Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired
(1.593–10.038)
- not_distracted: Percentage Of Drivers Involved In Fatal Collisions Who Were Not
Distracted (1.760–23.661)
- no_previous: Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been
Involved In Any Previous Accidents (5.900–21.280)
- ins_premium: Car Insurance Premiums (641.960–1301.520)
- ins_losses: Losses incurred by insurance companies for collisions per insured driver
(82.75–194.780)


## **apply**

In [4]:
from seaborn import load_dataset
import numpy as np
import pandas as pd

## **함수를 행 별로 반복 적용이 가능한 apply 함수**

In [5]:
car_df = load_dataset('car_crashes')
car_df.head()

Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev
0,18.8,7.332,5.64,18.048,15.04,784.55,145.08,AL
1,18.1,7.421,4.525,16.29,17.014,1053.48,133.93,AK
2,18.6,6.51,5.208,15.624,17.856,899.47,110.35,AZ
3,22.4,4.032,5.824,21.056,21.28,827.34,142.39,AR
4,12.0,4.2,3.36,10.92,10.68,878.41,165.63,CA


In [6]:
car_df[car_df['total']>20]

Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev
3,22.4,4.032,5.824,21.056,21.28,827.34,142.39,AR
17,21.4,4.066,4.922,16.692,16.264,872.51,137.13,KY
18,20.5,7.175,6.765,14.965,20.09,1281.55,194.78,LA
26,21.4,8.346,9.416,17.976,18.19,816.21,85.15,MT
34,23.9,5.497,10.038,23.661,20.554,688.75,109.72,ND
40,23.9,9.082,9.799,22.944,19.359,858.97,116.29,SC
48,23.8,8.092,6.664,23.086,20.706,992.61,152.56,WV


In [7]:
def plue(x,y):
    x+y

In [8]:
car_df['alcohol'].apply(np.square).head(5) #np.square : 제곱함수

0    31.809600
1    20.475625
2    27.123264
3    33.918976
4    11.289600
Name: alcohol, dtype: float64

## **lambda 함수 활용**
- 익명 함수를 즉시 적용해 활용 가능

In [9]:
car_df.apply(lambda x:np.square(x)
            if x.name in ['alcohol', 'total'] else x).head(5)

Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev
0,353.44,7.332,31.8096,18.048,15.04,784.55,145.08,AL
1,327.61,7.421,20.475625,16.29,17.014,1053.48,133.93,AK
2,345.96,6.51,27.123264,15.624,17.856,899.47,110.35,AZ
3,501.76,4.032,33.918976,21.056,21.28,827.34,142.39,AR
4,144.0,4.2,11.2896,10.92,10.68,878.41,165.63,CA


In [10]:
car_df[['total','speeding','alcohol']].apply(lambda x: x.max()-x.min())

total       18.000
speeding     7.658
alcohol      8.445
dtype: float64

In [11]:
# 행 기준 적용하고 싶다면 axis = 1
car_df[['total','speeding','alcohol']].apply(
    lambda x: x.max() - x.min(), axis =1 ).head(5)

0    13.160
1    13.575
2    13.392
3    18.368
4     8.640
dtype: float64

In [12]:
car_df.head()

Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev
0,18.8,7.332,5.64,18.048,15.04,784.55,145.08,AL
1,18.1,7.421,4.525,16.29,17.014,1053.48,133.93,AK
2,18.6,6.51,5.208,15.624,17.856,899.47,110.35,AZ
3,22.4,4.032,5.824,21.056,21.28,827.34,142.39,AR
4,12.0,4.2,3.36,10.92,10.68,878.41,165.63,CA


In [13]:
car_df['alcohol'].apply(
    lambda x : np.square(x) if x>5 else x).head(5)

0    31.809600
1     4.525000
2    27.123264
3    33.918976
4     3.360000
Name: alcohol, dtype: float64

In [14]:
car_df['abbrev'].apply(
    lambda x: (x,x.count('A'))).head(10)

0    (AL, 1)
1    (AK, 1)
2    (AZ, 1)
3    (AR, 1)
4    (CA, 1)
5    (CO, 0)
6    (CT, 0)
7    (DE, 0)
8    (DC, 0)
9    (FL, 0)
Name: abbrev, dtype: object

In [16]:
def ac_score(x):
    if x>5:
        return 'bad'
    elif x> 3:
        return 'normal'
    else :
        return 'good'

car_df['alcohol'].apply(lambda x: ac_score(x)).head(10)

0       bad
1    normal
2       bad
3       bad
4    normal
5    normal
6    normal
7    normal
8      good
9       bad
Name: alcohol, dtype: object

In [19]:
new_car_df = car_df
new_car_df['ac_score'] = new_car_df['alcohol'].apply(
    lambda x : 'bad' if x >5 else ('normal' if x>3 else 'good'))
new_car_df.head(10)

Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev,ac_score
0,18.8,7.332,5.64,18.048,15.04,784.55,145.08,AL,bad
1,18.1,7.421,4.525,16.29,17.014,1053.48,133.93,AK,normal
2,18.6,6.51,5.208,15.624,17.856,899.47,110.35,AZ,bad
3,22.4,4.032,5.824,21.056,21.28,827.34,142.39,AR,bad
4,12.0,4.2,3.36,10.92,10.68,878.41,165.63,CA,normal
5,13.6,5.032,3.808,10.744,12.92,835.5,139.91,CO,normal
6,10.8,4.968,3.888,9.396,8.856,1068.73,167.02,CT,normal
7,16.2,6.156,4.86,14.094,16.038,1137.87,151.48,DE,normal
8,5.9,2.006,1.593,5.9,5.9,1273.89,136.05,DC,good
9,17.9,3.759,5.191,16.468,16.826,1160.13,144.18,FL,bad


## **Group by 함수**
- count : 행의 개수
- sum : 값의 합
- mean : 값의 평균
- median : 값의 중앙값
- prod : 값의 곱
- first, last : 값들 중 첫 번째, 마지막 번 째 값 
- max, min
- agg : 연산 함수와 동일한 효과
- transform() : 함수의 결과값을 Data Frame에 Mapping

In [28]:
# new_car_df.groupby(['ac_score']).median() # 이거 오류남. 중앙값할려면 dtype이 숫자형이어야만함

In [26]:
new_car_df.groupby(['ac_score'])[new_car_df.select_dtypes(include='number').columns].median()

Unnamed: 0_level_0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses
ac_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
bad,19.1,6.7375,5.713,16.97,16.852,843.155,143.285
good,8.9,2.107,2.296,7.791,7.504,910.26,134.49
normal,13.8,4.2,4.08,12.328,12.375,869.85,137.13


In [29]:
new_car_df.groupby(['ac_score']).count()

Unnamed: 0_level_0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev
ac_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
bad,20,20,20,20,20,20,20,20
good,4,4,4,4,4,4,4,4
normal,27,27,27,27,27,27,27,27


In [32]:
new_car_df.groupby(['ac_score'])[new_car_df.select_dtypes(include='number').columns].median()['total']

ac_score
bad       19.1
good       8.9
normal    13.8
Name: total, dtype: float64

In [33]:
new_car_df['median_by_alchol'] = new_car_df.groupby(
    ['ac_score'])['total'].transform('median')

In [38]:
new_car_df.groupby(['ac_score'])[new_car_df.select_dtypes(include='number').columns].agg('mean')

Unnamed: 0_level_0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,median_by_alchol
ac_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
bad,19.395,6.3986,6.47415,16.84645,17.15035,865.914,136.4155,19.1
good,8.75,2.73975,2.26375,7.8565,7.939,967.8975,128.6275,8.9
normal,14.162963,4.295444,4.099556,11.995444,12.573556,890.554444,133.938148,13.8
