# Aula 05 - Transformando dados

## 5.1. Objetivos de aprendizagem

### 5.1.1. Básico

1. Utilizar os cinco verbos básicos do dplyr para uma tabela:
    * `select()`
    * `filter()`
    * `arrange()`
    * `mutate()`
    * `summarise()`
    * `group_by()`
    
### 5.1.2. Intermediário

2. Alguns verbos adicionais para uma tabela
    * `rename()`
    * `distinct()`
    * `slice()`
    * `pull()`

### 5.1.3. Avançado

3. Controle mais apurado das operações de `select()`
4. Utilização de _window functions_ 

## 5.2. Recursos

* [Capítulo 5: Transformação de dados](http://r4ds.had.co.nz/transform.html) do livro _R for Data Science_
* [Capítulo 16: Data e hora](http://r4ds.had.co.nz/dates-and-times.html) do livro _R for Data Science_
* [ _cheat sheet_ da transformação de dados](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)

## 5.3. Ajustes

In [1]:
home <- path.expand("~")
lib_dir <- file.path(file.path(home, "R"), "lib")
dir.create(lib_dir, showWarnings = FALSE)

library(utils)
.libPaths(c(lib_dir, .libPaths()))

# libraries needed for these examples
install.packages('tidyverse')
library(tidyverse)
install.packages('lubridate')
library(lubridate)
set.seed(8675309) # makes sure random numbers are reproducible

Installing package into ‘/home/eduardo/R/lib’
(as ‘lib’ is unspecified)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Installing package into ‘/home/eduardo/R/lib’
(as ‘lib’ is unspecified)


Attaching package: ‘lubridate’


The following object is masked from ‘package:base’:

    date




## 5.4. O conjunto de dados `disgust`

Os exemplos da seção utilizarão dados do dataset [disgust.csv](https://psyteachr.github.io/data/disgust.csv). Cada participante da pesquisa é identificado com um único `user_id` e cada questionário completo recebe um único `id`.

In [2]:
disgust <- read_csv("https://psyteachr.github.io/msc-data-skills/data/disgust.csv")

Parsed with column specification:
cols(
  .default = col_double(),
  date = [34mcol_date(format = "")[39m
)

See spec(...) for full column specifications.



**Instruções sobre o questionário**: os itens a seguir descrevem uma série de conceitos. Avalie quão repugnante você acha os conceitos descritos em cada um dos itens, onde 0 significa que você não acha o item nada repugnante e 6 significa que você acha extremamente repugnante.

colname |	question
--------|-------------
moral1 |	Roubar doces de uma loja de conveniência
moral2 |	Roubar de um vizinho
moral3 |	Estudante colando na prova
moral4 |	Enganar o amigo
moral5 |	Falsificar a assinatura de alguém em um documento
moral6 |	Furar fila para comprar os últimos ingressos de um show
moral7 |	Mentir intencionalmente durante uma transação de negócios
sexual1 |	Ouvir dois estranhos fazendo sexo
sexual2 |	Fazer sexo oral
sexual3 |	Assistir um vídeo pornográfico
sexual4 |	Descobrir que alguém que você não gosta possui fantasias sexuais sobre você
sexual5 |	Levar alguúem que você acabou de conhecer para o seu quarto e fazer sexo com essa pessoa
sexual6 |	Um estranho do sexo oposto intencionalmente passar a mão nas suas pernas no elevador
sexual7 |	Fazer sexo anal com alguém do sexo oposto
pathogen1 |	Pisar em cocô de cachorro
pathogen2 |	Sentar-se perto de uma pessoa que tem feridas vermelhas no braço
pathogen3 |	Apertar as mãos de um estranho com a mão suada
pathogen4 |	Ver musgo verde saindo de comida estragada na geladeira
pathogen5 |	Sentar-se perto de uma pessoa que está fedendo
pathogen6 |	Ver uma barata passeando pelo chão
pathogen7 |	Tocar uma ferida ensanguentada de outra pessoa acidentalmente

## 5.5. Os seis principais verbos do dplyr

A maior parte das transforções de dados que serão feitas com dados psicológicos envolverão os verbos do `tydr` que foram introduzidos na Aula 03 e os seis principais verbos do dplyr: `select`, `filter`, `arrange`, `mutate`, `summarise` e `group_by`.

### 5.5.1. select()

Seleciona os elementos pelo nome ou número. É possível selecionar individualmente cada uma das colunas, separadas por vírgulas (ex.: `col1, col2`) ou um intervalo de colunas utilizando `:` (ex.: `start_col:end_col`).

In [3]:
moral <- disgust %>% select(user_id, moral1:moral7)
names(moral)

É possível também selecionar as colunas pelo número, o que se torna útil quando o nome das colunas for longo demais ou complicado demais.

In [4]:
sexual <- disgust %>% select(2, 11:17)
names(sexual)

É possível utilizar o símbolo de `-` para excluir colunas, mostrando todas as outras selecionadas. Se quiser excluir, utilize o parênteses ao definir o intervalo (ex.: `-(moral1:moral7)` e não `-moral1:moral7`).

In [5]:
pathogen <- disgust %>% select(-id, -date, -(moral1:sexual7))
names(pathogen)

#### 5.5.1.1. `starts_with()`

Selecione colunas que iniciam com um caractere

In [6]:
u <- disgust %>% select(starts_with("u"))
names(u)

#### 5.5.1.2 `ends_with()`

Selecione colunas que terminam com um caractere

In [7]:
firstq <- disgust %>% select(ends_with("1"))
names(firstq)

#### 5.5.1.3 `contains()`

Seleciona colunas que contém um caracatere

In [8]:
pathogen <- disgust %>% select(contains("pathogen"))
names(pathogen)

#### 5.5.1.4 `num_range()`

Seleciona colunas cujo nome casa com o padrão `prefix`

In [9]:
moral2_4 <- disgust %>% select(num_range("moral", 2:4))
names(moral2_4)

Utilize `width` para ajustar o número de dígitos com zeros à esquerda. A chamada `num_range('var_', 8:10, width=2)` seleciona as colunas `var_08`, `var_09` e `var_10`

### 5.5.2. `filter()`

Seleciona as linhas de acordo com um critério. Ex.: Selecione todas as linhas onde `user_id == 1`.

In [11]:
disgust %>% filter(user_id == 1)

id,user_id,date,moral1,moral2,moral3,moral4,moral5,moral6,moral7,⋯,sexual5,sexual6,sexual7,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,2008-07-10,2,2,1,2,1,1,1,⋯,1,2,2,3,2,3,3,2,3,3


É possível utilizar múltiplos critérios separando-os por vírgulas.

In [13]:
amoral <- disgust %>% filter(
  moral1 == 0, 
  moral2 == 0,
  moral3 == 0, 
  moral4 == 0,
  moral5 == 0,
  moral6 == 0,
  moral7 == 0
)
amoral

id,user_id,date,moral1,moral2,moral3,moral4,moral5,moral6,moral7,⋯,sexual5,sexual6,sexual7,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
710,44759,2008-08-26,0,0,0,0,0,0,0,⋯,0,0,0,0,6,0,0,2,2,1
7036,88227,2011-01-16,0,0,0,0,0,0,0,⋯,0,0,0,6,1,1,4,3,0,1
43,156076,2008-07-18,0,0,0,0,0,0,0,⋯,1,1,3,5,0,3,6,2,6,1
89,156634,2008-07-22,0,0,0,0,0,0,0,⋯,0,0,6,2,1,1,1,1,6,3
93,156722,2008-07-23,0,0,0,0,0,0,0,⋯,2,2,3,5,3,2,4,3,6,5
170,157975,2008-08-03,0,0,0,0,0,0,0,⋯,6,3,6,0,1,2,0,6,0,0
414,161765,2008-08-21,0,0,0,0,0,0,0,⋯,1,5,2,4,0,2,2,2,2,3
475,162435,2008-08-22,0,0,0,0,0,0,0,⋯,2,5,1,3,2,4,2,2,1,4
560,163796,2008-08-24,0,0,0,0,0,0,0,⋯,0,1,4,2,1,1,1,1,3,2
638,164674,2008-08-25,0,0,0,0,0,0,0,⋯,5,5,0,4,4,3,4,6,3,6


É possível utilizar os símbolos `&`, `|` e `!` que significam _and_ , _or_ e _not_. É possível utilizar os operadores para construir equações.

In [16]:
# everyone who chose either 0 or 7 for question moral1
moral_extremes <- disgust %>% 
  filter(moral1 == 0 | moral1 == 7)

# everyone who chose the same answer for all moral questions
moral_consistent <- disgust %>% 
  filter(
    moral2 == moral1 & 
      moral3 == moral1 & 
      moral4 == moral1 &
      moral5 == moral1 &
      moral6 == moral1 &
      moral7 == moral1
  )

moral_consistent

id,user_id,date,moral1,moral2,moral3,moral4,moral5,moral6,moral7,⋯,sexual5,sexual6,sexual7,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
23,2311,2008-07-15,4,4,4,4,4,4,4,⋯,1,1,5,5,5,4,4,5,4,3
26188,34951,2014-01-04,6,6,6,6,6,6,6,⋯,6,3,6,6,6,6,6,6,6,6
710,44759,2008-08-26,0,0,0,0,0,0,0,⋯,0,0,0,0,6,0,0,2,2,1
2673,67077,2009-01-05,6,6,6,6,6,6,6,⋯,6,4,3,,4,0,5,4,3,6
152,72482,2008-07-30,6,6,6,6,6,6,6,⋯,6,6,6,5,1,2,3,4,1,0
7036,88227,2011-01-16,0,0,0,0,0,0,0,⋯,0,0,0,6,1,1,4,3,0,1
18898,93123,2012-11-03,6,6,6,6,6,6,6,⋯,6,6,6,5,4,3,5,6,4,0
3560,95309,2009-05-04,5,5,5,5,5,5,5,⋯,0,0,1,4,4,4,4,4,3,5
15649,104323,2012-05-01,6,6,6,6,6,6,6,⋯,6,0,6,6,6,0,6,6,6,6
35250,133155,2015-07-16,6,6,6,6,6,6,6,⋯,0,6,0,6,3,3,6,3,4,6


Mudando o filtro para quem não respondeu todas as 7 questões de moral

In [17]:
# everyone who did not answer 7 for all 7 moral questions
moral_no_ceiling <- disgust %>%
  filter(moral1+moral2+moral3+moral4+moral5+moral6+moral7 != 7*7)

moral_no_ceiling

id,user_id,date,moral1,moral2,moral3,moral4,moral5,moral6,moral7,⋯,sexual5,sexual6,sexual7,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1199,0,2008-10-07,5,6,4,6,5,5,6,⋯,1,4,5,6,1,6,5,4,5,6
1,1,2008-07-10,2,2,1,2,1,1,1,⋯,1,2,2,3,2,3,3,2,3,3
13332,2118,2012-01-02,0,1,1,1,1,2,1,⋯,0,3,5,5,6,4,6,5,5,4
23,2311,2008-07-15,4,4,4,4,4,4,4,⋯,1,1,5,5,5,4,4,5,4,3
7980,4458,2011-09-05,3,4,3,4,4,3,3,⋯,1,5,4,6,4,4,3,3,2,3
552,4651,2008-08-23,2,4,3,5,5,5,3,⋯,6,6,2,5,6,6,4,6,1,6
37829,4976,2016-03-22,6,6,6,0,6,0,0,⋯,0,0,0,6,6,6,6,0,0,6
6902,5469,2010-12-06,0,1,3,4,1,0,1,⋯,6,6,5,5,2,4,4,2,2,6
6158,6066,2010-04-18,4,5,6,5,5,4,4,⋯,3,5,3,6,5,5,5,5,5,5
4850,6093,2009-11-09,1,2,2,2,1,2,1,⋯,0,4,4,4,3,1,1,4,1,3


Algumas vezes é necessário excluir alguns IDs de participantes por razões que não estão diretamente relacionadas à lógica de programação. O operador `%in%` é útil para testar se um valor está na lista. Coloque a equação entre parênteses e utilize o operador `!` na frente dos parênteses para testar se um valor não está na lista.

In [18]:
no_researchers <- disgust %>%
  filter(!(user_id %in% c(1,2)))

no_researchers

id,user_id,date,moral1,moral2,moral3,moral4,moral5,moral6,moral7,⋯,sexual5,sexual6,sexual7,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1199,0,2008-10-07,5,6,4,6,5,5,6,⋯,1,4,5,6,1,6,5,4,5,6
13332,2118,2012-01-02,0,1,1,1,1,2,1,⋯,0,3,5,5,6,4,6,5,5,4
23,2311,2008-07-15,4,4,4,4,4,4,4,⋯,1,1,5,5,5,4,4,5,4,3
1160,3630,2008-10-06,1,5,,5,5,5,1,⋯,0,1,0,6,3,1,1,3,1,0
7980,4458,2011-09-05,3,4,3,4,4,3,3,⋯,1,5,4,6,4,4,3,3,2,3
552,4651,2008-08-23,2,4,3,5,5,5,3,⋯,6,6,2,5,6,6,4,6,1,6
37829,4976,2016-03-22,6,6,6,0,6,0,0,⋯,0,0,0,6,6,6,6,0,0,6
6902,5469,2010-12-06,0,1,3,4,1,0,1,⋯,6,6,5,5,2,4,4,2,2,6
6158,6066,2010-04-18,4,5,6,5,5,4,4,⋯,3,5,3,6,5,5,5,5,5,5
4850,6093,2009-11-09,1,2,2,2,1,2,1,⋯,0,4,4,4,3,1,1,4,1,3


#### 5.5.2.1. Datas

O pacote `lubridate` é muito útil para trabalhar com datas. Como exemplo, vamos utilizar a função `year()` para retornar somente o ano da coluna `date`, para depois selecionar somente os dados que foram coletados em 2010.

In [19]:
disgust2010 <- disgust  %>%
  filter(year(date) == 2010)

disgust2010

id,user_id,date,moral1,moral2,moral3,moral4,moral5,moral6,moral7,⋯,sexual5,sexual6,sexual7,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
6902,5469,2010-12-06,0,1,3,4,1,0,1,⋯,6,6,5,5,2,4,4,2,2,6
6158,6066,2010-04-18,4,5,6,5,5,4,4,⋯,3,5,3,6,5,5,5,5,5,5
6362,7129,2010-06-09,4,4,4,4,3,3,2,⋯,2,3,6,5,2,0,4,5,5,4
6302,39318,2010-05-20,2,4,1,4,5,6,0,⋯,0,0,1,3,2,3,2,3,2,4
5429,43029,2010-01-02,1,1,1,3,6,4,2,⋯,6,6,6,4,6,6,6,6,6,4
6732,71955,2010-10-15,2,5,3,6,3,2,5,⋯,6,6,5,4,2,6,5,6,6,3
6367,84622,2010-06-13,4,6,6,6,6,6,6,⋯,1,0,0,6,5,6,2,6,5,6
6476,93120,2010-07-12,3,6,4,6,5,3,4,⋯,5,4,3,5,6,4,5,6,2,6
5778,96537,2010-03-05,5,5,3,4,5,5,5,⋯,0,4,3,6,0,1,4,5,1,2
6181,131633,2010-04-23,0,6,4,6,0,6,6,⋯,0,0,6,4,4,0,6,6,6,6


A função `range` permite utilizar intervalos máximos e mínimos para descobrir dados de 5 anos atrás.

In [21]:
disgust_5ago <- disgust %>%
  filter(date < today() - dyears(5))

range(disgust_5ago$date)

disgust_5ago

id,user_id,date,moral1,moral2,moral3,moral4,moral5,moral6,moral7,⋯,sexual5,sexual6,sexual7,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1199,0,2008-10-07,5,6,4,6,5,5,6,⋯,1,4,5,6,1,6,5,4,5,6
1,1,2008-07-10,2,2,1,2,1,1,1,⋯,1,2,2,3,2,3,3,2,3,3
1599,2,2008-10-27,1,1,1,1,,,1,⋯,1,,,,,1,,,,
13332,2118,2012-01-02,0,1,1,1,1,2,1,⋯,0,3,5,5,6,4,6,5,5,4
23,2311,2008-07-15,4,4,4,4,4,4,4,⋯,1,1,5,5,5,4,4,5,4,3
1160,3630,2008-10-06,1,5,,5,5,5,1,⋯,0,1,0,6,3,1,1,3,1,0
7980,4458,2011-09-05,3,4,3,4,4,3,3,⋯,1,5,4,6,4,4,3,3,2,3
552,4651,2008-08-23,2,4,3,5,5,5,3,⋯,6,6,2,5,6,6,4,6,1,6
6902,5469,2010-12-06,0,1,3,4,1,0,1,⋯,6,6,5,5,2,4,4,2,2,6
6158,6066,2010-04-18,4,5,6,5,5,4,4,⋯,3,5,3,6,5,5,5,5,5,5


### 5.5.3. `arrange()`

Ordene os dados utilizando `arrange()`

In [22]:
disgust_order <- disgust %>%
  arrange(id)

head(disgust_order)

id,user_id,date,moral1,moral2,moral3,moral4,moral5,moral6,moral7,⋯,sexual5,sexual6,sexual7,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,2008-07-10,2,2,1,2,1,1,1,⋯,1,2,2,3,2,3,3,2,3,3
3,155324,2008-07-11,2,4,3,5,2,1,4,⋯,2,6,1,4,3,1,0,4,4,2
4,155366,2008-07-12,6,6,6,3,6,6,6,⋯,0,0,3,4,4,5,5,4,6,0
5,155370,2008-07-12,6,6,4,6,6,6,6,⋯,6,6,6,6,6,6,2,4,4,6
6,155386,2008-07-12,2,4,0,4,0,0,0,⋯,4,4,6,4,5,5,1,6,4,2
7,155409,2008-07-12,4,5,5,4,5,1,5,⋯,2,0,0,5,5,3,4,4,2,6


`desc()` aplica a ordem reversa.

In [23]:
disgust_order <- disgust %>%
  arrange(desc(id))

head(disgust_order)

id,user_id,date,moral1,moral2,moral3,moral4,moral5,moral6,moral7,⋯,sexual5,sexual6,sexual7,pathogen1,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
39456,356866,2017-08-21,1,1,1,1,1,1,1,⋯,1,1,1,1,1,1,1,1,1,1
39447,128727,2017-08-13,2,4,1,2,2,5,3,⋯,0,2,1,2,0,2,1,1,1,1
39371,152955,2017-06-13,6,6,3,6,6,6,6,⋯,1,4,4,5,0,5,4,3,6,3
39342,48303,2017-05-22,4,5,4,4,6,4,5,⋯,1,3,1,5,5,4,4,4,4,5
39159,151633,2017-04-04,4,5,6,5,3,6,2,⋯,0,3,6,4,4,6,6,6,6,4
38942,370464,2017-02-01,1,5,0,6,5,5,5,⋯,0,0,0,5,0,3,3,1,6,3


### 5.5.4. `mutate()`

Adiciona novas colunas. Provavelmente uma das funções mais úteis do _tidyverse_

Refira-se às colunas pelo nome, sem aspas. É possível adicionar mais de uma coluna utilizando a vírgula. Uma vez que a coluna tenha sido criada é possível utilizá-la em outras definições de coluna. Observe a coluna `total` no exemplo abaixo:

In [27]:
disgust_total <- disgust %>%
  mutate(
    pathogen = pathogen1 + pathogen2 + pathogen3 + pathogen4 + pathogen5 + pathogen6 + pathogen7,
    moral = moral1 + moral2 + moral3 + moral4 + moral5 + moral6 + moral7,
    sexual = sexual1 + sexual2 + sexual3 + sexual4 + sexual5 + sexual6 + sexual7,
    total = pathogen + moral + sexual,
    user_id = paste0("U", user_id)
  )

disgust_total

id,user_id,date,moral1,moral2,moral3,moral4,moral5,moral6,moral7,⋯,pathogen2,pathogen3,pathogen4,pathogen5,pathogen6,pathogen7,pathogen,moral,sexual,total
<dbl>,<chr>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1199,U0,2008-10-07,5,6,4,6,5,5,6,⋯,1,6,5,4,5,6,33,37,15,85
1,U1,2008-07-10,2,2,1,2,1,1,1,⋯,2,3,3,2,3,3,19,10,12,41
1599,U2,2008-10-27,1,1,1,1,,,1,⋯,,1,,,,,,,,
13332,U2118,2012-01-02,0,1,1,1,1,2,1,⋯,6,4,6,5,5,4,35,7,21,63
23,U2311,2008-07-15,4,4,4,4,4,4,4,⋯,5,4,4,5,4,3,30,28,13,71
1160,U3630,2008-10-06,1,5,,5,5,5,1,⋯,3,1,1,3,1,0,15,,8,
7980,U4458,2011-09-05,3,4,3,4,4,3,3,⋯,4,4,3,3,2,3,25,24,21,70
552,U4651,2008-08-23,2,4,3,5,5,5,3,⋯,6,6,4,6,1,6,34,27,30,91
37829,U4976,2016-03-22,6,6,6,0,6,0,0,⋯,6,6,6,0,0,6,30,24,0,54
6902,U5469,2010-12-06,0,1,3,4,1,0,1,⋯,2,4,4,2,2,6,25,10,31,66


### 5.5.5. `summarise()`

Crie sumários estatísticos para o conjunto de dados. Os _cheat sheets_ fornecidos podem ser utilizados para auxiliar a consulta das diferentes funções disponíveis: [Ajuste de Dados](https://www.rstudio.org/links/data_wrangling_cheat_sheet) e [Transformação de Dados](https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf). Algumas das funções mais utilizadas são `mean()`, `sd()`, `n()`, `sum()` e `quantile()`.

In [29]:
disgust_total %>%
  summarise(
    n = n(),
    q25 = quantile(total, .25, na.rm = TRUE),
    q50 = quantile(total, .50, na.rm = TRUE),
    q75 = quantile(total, .75, na.rm = TRUE),
    avg_total = mean(total, na.rm = TRUE),
    sd_total  = sd(total, na.rm = TRUE),
    min_total = min(total, na.rm = TRUE),
    max_total = max(total, na.rm = TRUE)
  )


n,q25,q50,q75,avg_total,sd_total,min_total,max_total
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
20000,59,71,83,70.6868,18.24253,0,126


### 5.5.6. `group_by()`

Cria subconjuntos de dados. É possível utilizar a função para criar sumários, como a média em todos os grupos de dados experimentais.

Aqui utilizamos a função `mutate` para criar uma nova coluna chamada `year`, agrupar por `year` e calcular os scores médios.

In [30]:
disgust_total %>%
  mutate(year = year(date)) %>%
  group_by(year) %>%
  summarise(
    n = n(),
    avg_total = mean(total, na.rm = TRUE),
    sd_total  = sd(total, na.rm = TRUE),
    min_total = min(total, na.rm = TRUE),
    max_total = max(total, na.rm = TRUE)
  )

year,n,avg_total,sd_total,min_total,max_total
<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
2008,2578,70.29975,18.46251,0,126
2009,2580,69.74481,18.61959,3,126
2010,1514,70.59238,18.86846,6,126
2011,6046,71.34425,17.79446,0,126
2012,5938,70.4253,18.35782,0,126
2013,1251,71.59574,17.61375,0,126
2014,58,70.46296,17.23502,19,113
2015,21,74.26316,16.89787,43,107
2016,8,67.875,32.62531,0,110
2017,6,57.16667,27.93862,21,90


É possível utilizar `filter` depois de `group_by`. O exemplo a seguir retorna o menor score total em cada ano.

In [31]:
disgust_total %>%
  mutate(year = year(date)) %>%
  select(user_id, year, total) %>%
  group_by(year) %>%
  filter(rank(total) == 1) %>%
  arrange(year)

user_id,year,total
<chr>,<dbl>,<dbl>
U236585,2009,3
U292359,2010,6
U245384,2013,0
U206293,2014,19
U407089,2015,43
U453237,2016,0
U356866,2017,21


Também é possível utilizar `mutate` depois de `group_by`. O exemplo a seguir o score _subject-mean-centered_ agrupando os scores pela coluna `user_id` e então subtraindo a média do grupo para cada score. Perceba o uso de `gather` para ajustar os dados no formato longo primeiro.

In [33]:
disgust_smc <- disgust %>%
  gather("question", "score", moral1:pathogen7) %>%
  group_by(user_id) %>%
  mutate(score_smc = score - mean(score, na.rm = TRUE))

disgust_smc

id,user_id,date,question,score,score_smc
<dbl>,<dbl>,<date>,<chr>,<dbl>,<dbl>
1199,0,2008-10-07,moral1,5,0.95238095
1,1,2008-07-10,moral1,2,0.04761905
1599,2,2008-10-27,moral1,1,0.00000000
13332,2118,2012-01-02,moral1,0,-3.00000000
23,2311,2008-07-15,moral1,4,0.61904762
1160,3630,2008-10-06,moral1,1,-1.25000000
7980,4458,2011-09-05,moral1,3,-0.33333333
552,4651,2008-08-23,moral1,2,-2.33333333
37829,4976,2016-03-22,moral1,6,3.42857143
6902,5469,2010-12-06,moral1,0,-3.14285714


### 5.5.7. Juntando tudo

Grande parte do que foi feito na aula seria mais fácil se os dados estivessem ajustados, então vamos começar por aí. Depois utilizamos `group_by` para calcular os scores.

É uma boa prática utilizar `ungroup()` depois de aplicar `group_by` e `summarise`. Esquecer de desagrupar o dataset não necessariamente afeta os próximos passos do script, mas pode bagunçar os dados originalmente utilizados. 

Agora vamos espalhar os três domínios, calcular o score total, remover todas as linhas que não possuem total (`NA`) e calcular as médias por ano.

In [36]:
disgust_tidy <- read_csv("https://psyteachr.github.io/msc-data-skills/data/disgust.csv") %>%
  gather("question", "score", moral1:pathogen7) %>%
  separate(question, c("domain","q_num"), sep = -1) %>%
  group_by(id, user_id, date, domain) %>%
  summarise(score = mean(score)) %>%
  ungroup() 

disgust_tidy

Parsed with column specification:
cols(
  .default = col_double(),
  date = [34mcol_date(format = "")[39m
)

See spec(...) for full column specifications.



id,user_id,date,domain,score
<dbl>,<dbl>,<date>,<chr>,<dbl>
1,1,2008-07-10,moral,1.4285714
1,1,2008-07-10,pathogen,2.7142857
1,1,2008-07-10,sexual,1.7142857
3,155324,2008-07-11,moral,3.0000000
3,155324,2008-07-11,pathogen,2.5714286
3,155324,2008-07-11,sexual,1.8571429
4,155366,2008-07-12,moral,5.5714286
4,155366,2008-07-12,pathogen,4.0000000
4,155366,2008-07-12,sexual,0.4285714
5,155370,2008-07-12,moral,5.7142857


In [37]:
disgust_tidy2 <- disgust_tidy %>%
  spread(domain, score) %>%
  mutate(
    total = moral + sexual + pathogen,
    year = year(date)
  ) %>%
  filter(!is.na(total)) %>%
  arrange(user_id) 

disgust_tidy2

id,user_id,date,moral,pathogen,sexual,total,year
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1199,0,2008-10-07,5.2857143,4.714286,2.1428571,12.142857,2008
1,1,2008-07-10,1.4285714,2.714286,1.7142857,5.857143,2008
13332,2118,2012-01-02,1.0000000,5.000000,3.0000000,9.000000,2012
23,2311,2008-07-15,4.0000000,4.285714,1.8571429,10.142857,2008
7980,4458,2011-09-05,3.4285714,3.571429,3.0000000,10.000000,2011
552,4651,2008-08-23,3.8571429,4.857143,4.2857143,13.000000,2008
37829,4976,2016-03-22,3.4285714,4.285714,0.0000000,7.714286,2016
6902,5469,2010-12-06,1.4285714,3.571429,4.4285714,9.428571,2010
6158,6066,2010-04-18,4.7142857,5.142857,3.0000000,12.857143,2010
4850,6093,2009-11-09,1.5714286,2.428571,1.7142857,5.714286,2009


In [38]:
disgust_tidy3 <- disgust_tidy2 %>%
  group_by(year) %>%
  summarise(
    n = n(),
    avg_pathogen = mean(pathogen),
    avg_moral = mean(moral),
    avg_sexual = mean(sexual),
    first_user = first(user_id),
    last_user = last(user_id)
  )

disgust_tidy3

year,n,avg_pathogen,avg_moral,avg_sexual,first_user,last_user
<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2008,2392,3.697265,3.806259,2.539298,0,188708
2009,2410,3.674333,3.760937,2.528275,6093,251959
2010,1418,3.731412,3.843139,2.510075,5469,319641
2011,5586,3.756918,3.806506,2.628612,4458,406569
2012,5375,3.740465,3.774591,2.545701,2118,458194
2013,1222,3.77192,3.906944,2.5491,7646,462428
2014,54,3.759259,4.0,2.306878,11090,461307
2015,19,3.781955,4.451128,2.37594,102699,460283
2016,8,3.696429,3.625,2.375,4976,453237
2017,6,3.071429,3.690476,1.404762,48303,370464


## 5.6. Outros verbos para uma tabela do dplyr

Utilize os exemplos a seguir e a ajuda embutida para descobrir o que cada uma das funções faz. A maior possui nomes auto-explanatórios.

### 5.6.1. `rename()`

In [39]:
iris_underscore <- iris %>%
  rename(sepal_length = Sepal.Length,
         sepal_width = Sepal.Width,
         petal_length = Petal.Length,
         petal_width = Petal.Width)

names(iris_underscore)

### 5.6.2 `distinct()`

In [40]:
# create a data table with duplicated values
dupes <- tibble(
  id = rep(1:5, 2),
  dv = rep(LETTERS[1:5], 2)
)

distinct(dupes)

id,dv
<int>,<chr>
1,A
2,B
3,C
4,D
5,E


### 5.6.3 count()

In [41]:
# how many observations from each species are in iris?
count(iris, Species)

Species,n
<fct>,<int>
setosa,50
versicolor,50
virginica,50


### 5.6.4 `slice()`

In [42]:
tibble(
  id = 1:10,
  condition = rep(c("A","B"), 5)
) %>%
  slice(3:6, 9)

id,condition
<int>,<chr>
3,A
4,B
5,A
6,B
9,A


### 5.6.5. `pull()`

In [43]:
iris %>%
  group_by(Species) %>%
  summarise_all(mean) %>%
  pull(Sepal.Length)

## 5.7.  _Window functions_

As _window functions_ utilizam a ordem das colunas para calcular o valor. É possível utilizá-las em elementos que requerem ordem ou índice, como por exemplo escolher os _top scores_ em cada classe.

### 5.7.1. Funções de classificação

In [44]:
tibble(
  id = 1:5,
  "Data Skills" = c(16, 17, 17, 19, 20), 
  "Statistics"  = c(14, 16, 18, 18, 19)
) %>%
  gather(class, grade, 2:3) %>%
  group_by(class) %>%
  mutate(row_number = row_number(),
         rank       = rank(grade),
         min_rank   = min_rank(grade),
         dense_rank = dense_rank(grade),
         quartile   = ntile(grade, 4),
         percentile = ntile(grade, 100))

id,class,grade,row_number,rank,min_rank,dense_rank,quartile,percentile
<int>,<chr>,<dbl>,<int>,<dbl>,<int>,<int>,<int>,<int>
1,Data Skills,16,1,1.0,1,1,1,1
2,Data Skills,17,2,2.5,2,2,1,21
3,Data Skills,17,3,2.5,2,2,2,41
4,Data Skills,19,4,4.0,4,3,3,61
5,Data Skills,20,5,5.0,5,4,4,81
1,Statistics,14,1,1.0,1,1,1,1
2,Statistics,16,2,2.0,2,2,1,21
3,Statistics,18,3,3.5,3,3,2,41
4,Statistics,18,4,3.5,3,3,3,61
5,Statistics,19,5,5.0,5,4,4,81


As _window functions_ podem ser utilizadas para agrupar os dados em quantidades.

In [45]:
iris %>%
  group_by(tertile = ntile(Sepal.Length, 3)) %>%
  summarise(mean.Sepal.Length = mean(Sepal.Length))

tertile,mean.Sepal.Length
<int>,<dbl>
1,4.936
2,5.81
3,6.784


### 5.7.2 Offset 

In [46]:
tibble(
  trial = 1:10,
  cond = rep(c("exp", "ctrl"), c(6, 4)),
  score = rpois(10, 4)
) %>%
  mutate(
    score_change = score - lag(score, order_by = trial),
    last_cond_trial = cond != lead(cond, default = TRUE)
  )

trial,cond,score,score_change,last_cond_trial
<int>,<chr>,<int>,<int>,<lgl>
1,exp,2,,False
2,exp,4,2.0,False
3,exp,5,1.0,False
4,exp,5,0.0,False
5,exp,3,-2.0,False
6,exp,5,2.0,True
7,ctrl,9,4.0,False
8,ctrl,6,-3.0,False
9,ctrl,6,0.0,False
10,ctrl,4,-2.0,True


### 5.7.3. Agregações cumulativas

`cumsum()`, `cummin()` e `cummax()`  são funções básicas do R para o cálculo da média cumulativa, mínimo e máixmo. O pacote dplyr introduz `cumany()` e `cumall()`, que retornam `TRUE` se qualquer dos valores anteriores atender seus critérios.

In [47]:
tibble(
  time = 1:10,
  obs = c(1, 0, 1, 2, 4, 3, 1, 0, 3, 5)
) %>%
  mutate(
    cumsum = cumsum(obs),
    cummin = cummin(obs),
    cummax = cummax(obs),
    cumany = cumany(obs == 3),
    cumall = cumall(obs < 4)
  )

time,obs,cumsum,cummin,cummax,cumany,cumall
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>
1,1,1,1,1,False,True
2,0,1,0,1,False,True
3,1,2,0,1,False,True
4,2,4,0,2,False,True
5,4,8,0,4,False,False
6,3,11,0,4,True,False
7,1,12,0,4,True,False
8,0,12,0,4,True,False
9,3,15,0,4,True,False
10,5,20,0,5,True,False
