# 0. `mutate` 에 대하여 

`mutate`는 새로운 열을 `추가`할 때 유용하게 사용할 수 있다. 

열의 내용을 `update`할때는 그다지 적절하지 않다. 

왜냐하면 열의 일부내용만 `update` 하기 위해서는 오히려 `modify`가 더 적절하다. 

# 1. 데이터프레임 `tb`를 만든다. 

In [40]:
library(tidyverse)
tb<-tibble(x1=1:10,x2=rnorm(10))
tb

x1,x2
<int>,<dbl>
1,0.5180116
2,0.9744405
3,0.6251501
4,-1.33082
5,-0.8325119
6,-0.1634481
7,0.5440492
8,-0.4732586
9,-0.3014033
10,-1.1256243


# 2. `x3`=x1+x2

벡터+벡터 연산임. 

In [41]:
mutate(tb,x3=x1+x2)

x1,x2,x3
<int>,<dbl>,<dbl>
1,0.5180116,1.518012
2,0.9744405,2.974441
3,0.6251501,3.62515
4,-1.33082,2.66918
5,-0.8325119,4.167488
6,-0.1634481,5.836552
7,0.5440492,7.544049
8,-0.4732586,7.526741
9,-0.3014033,8.698597
10,-1.1256243,8.874376


# 3. `mutate`결과의 저장 

## (1) `mutate`는 따로 결과가 저장되지는 않음.

In [42]:
mutate(tb,x3=x1+x2)

x1,x2,x3
<int>,<dbl>,<dbl>
1,0.5180116,1.518012
2,0.9744405,2.974441
3,0.6251501,3.62515
4,-1.33082,2.66918
5,-0.8325119,4.167488
6,-0.1634481,5.836552
7,0.5440492,7.544049
8,-0.4732586,7.526741
9,-0.3014033,8.698597
10,-1.1256243,8.874376


In [43]:
print(tb)

[38;5;246m# A tibble: 10 x 2[39m
      x1     x2
   [3m[38;5;246m<int>[39m[23m  [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m     1  0.518
[38;5;250m 2[39m     2  0.974
[38;5;250m 3[39m     3  0.625
[38;5;250m 4[39m     4 -[31m1[39m[31m.[39m[31m33[39m 
[38;5;250m 5[39m     5 -[31m0[39m[31m.[39m[31m833[39m
[38;5;250m 6[39m     6 -[31m0[39m[31m.[39m[31m163[39m
[38;5;250m 7[39m     7  0.544
[38;5;250m 8[39m     8 -[31m0[39m[31m.[39m[31m473[39m
[38;5;250m 9[39m     9 -[31m0[39m[31m.[39m[31m301[39m
[38;5;250m10[39m    10 -[31m1[39m[31m.[39m[31m13[39m 


## (2) 굳이 결과를 저장하고 싶다면 아래와 같은 방법이 있다. 

In [32]:
tb2 <- tb %>% mutate(x3=x1+x2) 

In [33]:
tb2

x1,x2,x3
<int>,<dbl>,<dbl>
1,0.6980523,1.698052
2,-0.126694,1.873306
3,0.4529776,3.452978
4,0.5192793,4.519279
5,0.4563271,5.456327
6,-1.7494421,4.250558
7,0.0902628,7.090263
8,-0.4175539,7.582446
9,1.1028135,10.102814
10,0.2276045,10.227604


# 4. 새로운 변수를 추가하지 않고 자기자신에 결과를 업데이트 할 수도 있다. 

## (1) 예를들면 `x1+x2`의 계산결과를 `x2`에 `update`하기 위해서는 아래와 같이 한다. 

In [34]:
tb 

x1,x2
<int>,<dbl>
1,0.6980523
2,-0.126694
3,0.4529776
4,0.5192793
5,0.4563271
6,-1.7494421
7,0.0902628
8,-0.4175539
9,1.1028135
10,0.2276045


In [35]:
tb %>% mutate(x2=x1+x2)

x1,x2
<int>,<dbl>
1,1.698052
2,1.873306
3,3.452978
4,4.519279
5,5.456327
6,4.250558
7,7.090263
8,7.582446
9,10.102814
10,10.227604


## (2) 데이터 프레임의 subset만을 `update`하는 방법 ($\star\star\star$)

### **방법1**: `filter(...) %>% mutate(...)` 를 사용 

#### `update`까지는 가능하지만 `save`가 안된다!!! $\to$ 사용할 수 없다. 

***update***

In [72]:
tb %>% filter(x1>5, x1<8) %>% mutate(x2=x2+x1)

x1,x2
<int>,<dbl>
6,5.836552
7,7.544049


***but can't save***

In [92]:
tb2<-tb
tb2 %>% filter(x1>5, x1<8) <- tb2 %>% filter(x1>5, x1<8) %>% mutate(x2=x2+x1)

ERROR: Error in tb2 %>% filter(x1 > 5, x1 < 8) <- tb2 %>% filter(x1 > 5, x1 < : could not find function "%>%<-"


### **방법2**: `더미함수` 와 `mutate` 를 활용하는 방법 (정말 나쁜코딩) 

***update***

In [86]:
dum<-function(x,y){
    idx= (x>5 & x<8)
    y[idx]<-x[idx]+y[idx]
    y
} 
tb %>% mutate(x2=dum(x1,x2))

x1,x2
<int>,<dbl>
1,0.5180116
2,0.9744405
3,0.6251501
4,-1.33082
5,-0.8325119
6,5.8365519
7,7.5440492
8,-0.4732586
9,-0.3014033
10,-1.1256243


***save***

In [94]:
tb2<- tb %>% mutate(x2=dum(x1,x2))
tb2

x1,x2
<int>,<dbl>
1,0.5180116
2,0.9744405
3,0.6251501
4,-1.33082
5,-0.8325119
6,5.8365519
7,7.5440492
8,-0.4732586
9,-0.3014033
10,-1.1256243


## **방법3:** `mutate(...=ifelse(...))` 

#### `현재 내가 알아낸 방법중에 가장 깔끔한 방법이다!!!`

***update***

In [88]:
tb %>% mutate(x2=ifelse(x1>5 & x1<8, x1+x2 ,x2))

x1,x2
<int>,<dbl>
1,0.5180116
2,0.9744405
3,0.6251501
4,-1.33082
5,-0.8325119
6,5.8365519
7,7.5440492
8,-0.4732586
9,-0.3014033
10,-1.1256243


***save***

In [96]:
tb2 <- tb %>% mutate(x2=ifelse(x1>5 & x1<8, x1+x2 ,x2))
tb2

x1,x2
<int>,<dbl>
1,0.5180116
2,0.9744405
3,0.6251501
4,-1.33082
5,-0.8325119
6,5.8365519
7,7.5440492
8,-0.4732586
9,-0.3014033
10,-1.1256243


## **방법4**: `mutate(...=replace(...))`

바꾸고 싶은 것이 하나의 `값`이라면 `replace`를 사용해도 무방하다. 

In [106]:
tb %>% mutate(x2=replace(x2, x1>5 & x1<8,100))

x1,x2
<int>,<dbl>
1,0.5180116
2,0.9744405
3,0.6251501
4,-1.33082
5,-0.8325119
6,100.0
7,100.0
8,-0.4732586
9,-0.3014033
10,-1.1256243


하지만 아래와 같은 식으로는 변환할 수 없다. 

In [105]:
tb %>% mutate(x2=replace(x2, x1>5 & x1<8,x1+x2))

“Problem with `mutate()` input `x2`.
[34mℹ[39m number of items to replace is not a multiple of replacement length
[34mℹ[39m Input `x2` is `replace(x2, x1 > 5 & x1 < 8, x1 + x2)`.”
“number of items to replace is not a multiple of replacement length”


x1,x2
<int>,<dbl>
1,0.5180116
2,0.9744405
3,0.6251501
4,-1.33082
5,-0.8325119
6,1.5180116
7,2.9744405
8,-0.4732586
9,-0.3014033
10,-1.1256243


결국 `mutate(...=replace(...))`으로 구현하고 싶은것은 모두 `mutate(...=ifelse(...))`으로 구현가능하다. 

In [107]:
tb %>% mutate(x2=ifelse(x1>5 & x1<8,100, x2))

x1,x2
<int>,<dbl>
1,0.5180116
2,0.9744405
3,0.6251501
4,-1.33082
5,-0.8325119
6,100.0
7,100.0
8,-0.4732586
9,-0.3014033
10,-1.1256243


# 5. `mutate`로 방금 생성한 열을 참조할 수 있다. 
- `x3=x1+x2`
- `x4=x1-mean(x3)`

In [38]:
mutate(tb,x3=x1+x2,x4=x1-mean(x3))

x1,x2,x3,x4
<int>,<dbl>,<dbl>,<dbl>
6,4.250558,10.25056,-9.850737
7,7.090263,14.09026,-8.850737
8,7.582446,15.58245,-7.850737
9,10.102814,19.10281,-6.850737
10,10.227604,20.2276,-5.850737


# 6. 생성된결과만 보고 싶다면 `transmute`를 써라. 

In [39]:
transmute(tb,x3=x1+x2,x4=x1-mean(x3))

x3,x4
<dbl>,<dbl>
10.25056,-9.850737
14.09026,-8.850737
15.58245,-7.850737
19.10281,-6.850737
20.2276,-5.850737


# 7. `lag`함수와 `mutate`를 쓰면 매우 강력하다. 이유는 `NA`를 잘 처리하기 때문임. 

In [9]:
mutate(tb,x3=lag(x2))

x1,x2,x3
<int>,<dbl>,<dbl>
1,-0.66774172,
2,0.97955613,-0.66774172
3,-0.23543652,0.97955613
4,-0.05123645,-0.23543652
5,-0.92831697,-0.05123645
6,-0.2415183,-0.92831697
7,-2.29371835,-0.2415183
8,1.84967922,-2.29371835
9,-0.85967047,1.84967922
10,-0.28646204,-0.85967047


In [10]:
mutate(tb,x3=lag(x2),x4=x2-x3)

x1,x2,x3,x4
<int>,<dbl>,<dbl>,<dbl>
1,-0.66774172,,
2,0.97955613,-0.66774172,1.6472979
3,-0.23543652,0.97955613,-1.2149926
4,-0.05123645,-0.23543652,0.1842001
5,-0.92831697,-0.05123645,-0.8770805
6,-0.2415183,-0.92831697,0.6867987
7,-2.29371835,-0.2415183,-2.0522001
8,1.84967922,-2.29371835,4.1433976
9,-0.85967047,1.84967922,-2.7093497
10,-0.28646204,-0.85967047,0.5732084


차분을 구하기 위해서 아래와 같이 더 간단히 쓸 수도 있다. 

In [11]:
mutate(tb,x3=x2-lag(x2))

x1,x2,x3
<int>,<dbl>,<dbl>
1,-0.66774172,
2,0.97955613,1.6472979
3,-0.23543652,-1.2149926
4,-0.05123645,0.1842001
5,-0.92831697,-0.8770805
6,-0.2415183,0.6867987
7,-2.29371835,-2.0522001
8,1.84967922,4.1433976
9,-0.85967047,-2.7093497
10,-0.28646204,0.5732084


### 참고로 위를 구현하기위해서 `mutate(tb,x3=diff(x2))`와 같이 입력하면 에러가 난다. 
### 결론: `diff`함수보다 `lag`함수를 활용하여 `NA`를 처리하기에 좋다. 그러니까 `diff`를 쓰지말자. 

# 8. 변화점을 찾을때 `x!=lag(x)`와 같이 활용하면 유용하다. 

아래와 같이 새로운 데이터셋 `tb2`를 만들자. 

In [12]:
tb2<-as_tibble(cbind(tb,comp=c(0,0,0,0,0,40,40,40,40,40)))
tb2

x1,x2,comp
<int>,<dbl>,<dbl>
1,-0.66774172,0
2,0.97955613,0
3,-0.23543652,0
4,-0.05123645,0
5,-0.92831697,0
6,-0.2415183,40
7,-2.29371835,40
8,1.84967922,40
9,-0.85967047,40
10,-0.28646204,40


`tb2`에서 컴프값이 `0`에서 `40`으로 바뀌는 시점에 관심이 있다고 하자. 

In [13]:
mutate(tb2,comp!=lag(comp))

x1,x2,comp,comp != lag(comp)
<int>,<dbl>,<dbl>,<lgl>
1,-0.66774172,0,
2,0.97955613,0,False
3,-0.23543652,0,False
4,-0.05123645,0,False
5,-0.92831697,0,False
6,-0.2415183,40,True
7,-2.29371835,40,False
8,1.84967922,40,False
9,-0.85967047,40,False
10,-0.28646204,40,False


해당시점의 값들만 뽑아내고 싶다면 아래와 같이 하면 된다. 

In [14]:
filter(tb2,comp!=lag(comp))

x1,x2,comp
<int>,<dbl>,<dbl>
6,-0.2415183,40
