Visualization is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. You will often need to create some new variables or rename the variables or reorder the observations in order to make the data a little easier to work with.





<!-- ```{r} -->
<!-- babynames <- read_csv("/Users/Zhichao/Dropbox/courses/UMass/R\ course/lectures/data\ transformation/babynames.csv") -->
<!-- ``` -->

# Data transformation

`babynames` data from the `babynames` package include all baby names with at least 5 uses.
**What geoms shoul be used for this graph?**

<img src="./figures/transformation/propgarret.jpg" alt="ds" style="width: 750px;"/>

We will learn the five key `dplyr` functions that allow you to solve the vast majority of data transformation challenges:

* Pick observations by their values (`filter()`).

* Reorder the rows (`arrange()`).

* Pick variables by their names (`select()`).

* Create new variables with functions of existing variables (`mutate()`).

* Collapse many values down to a single summary (`summarize()`).

These can all be used in conjunction with `group_by()` which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the verbs for a language of data transformation.

All verbs work similarly:

1. The first argument is a data frame (tibble).

2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

3. The result is a new data frame.

Together these properties make it easy to chain together multiple simple steps to achieve a complex result. 

## `select()`
<img src="./figures/transformation/select.jpg" alt="ds" style="width: 750px;"/>

```r
select(babynames,name,prop)
```


## Select helpers

* use `:` to select range of columns
```r
select(babynames,name:prop)
```

* use `-` to select every column but
```r
select(babynames,-c(name,prop))
```

* use `starts_with()` to select columns start with
```r
select(babynames,starts_with("n"))
```

* use `ends_with()` to select columns end with
```r
select(babynames,ends_with("e"))
```

* use `contains()` to select columns contain
```r
select(babynames,contains("e"))
```

* use `num_range()` to select named in prefix, number style
```r
select(xxxx,num_range("x",1:5))
```

X2,X3,X4,X5
<dbl>,<dbl>,<dbl>,<dbl>
2,3,4,5
7,8,9,10


## `$` and `select()`

`$` extracts columnn contents as a vector. `select()` extracts column contents as a tibble.
```r
select(babynames, n)
babynames$n
```

## `filter()`

`filter()` allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. 
<img src="./figures/transformation/filter.jpg" alt="ds" style="width: 750px;"/>


<img src="./figures/transformation/logical.jpg" alt="ds" style="width: 750px;"/>


```r
filter(babynames, name == "Garret")
```


`slice()` function chooses rows by their ordinal position indicating by a vector.

```r
slice(babynames, 2:4)
```


### Missing values
One important feature of R that can make comparison tricky are missing values, or `NA` ("not availables"). `NA` represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown.

```r
NA > 5
NA + 10
NA == NA

NA | FALSE
NA & FALSE
NA*0
Inf*0
```

In [42]:
is.na(1)

To determine if a value is missing, use `is.na()`.
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. If you want to preserve missing values, ask for them explicitly.
```r
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
```


### Your turn

* Use filter, babynames, and the logical operators to find:
	+ All of the rows where prop is greater than or equal to 0.08
	+	All of the children named “Sea”




###  Boolean operators

```r
filter(babynames, name == "Garrett", year == 1880)
```


<!-- ![](fig/boolean.jpg) -->

```r
filter(babynames, name == "Garrett" & year == 1880)
```

**common mistakes**

Collapsing multiple tests into one
```r
filter(babynames, 10 < n < 20)
filter(babynames, 10 < n, n < 20)
```

Stringing together many tests (when you could use %in%)

```r
filter(babynames, n == 5 | n == 6 | n == 7 | n == 8)
filter(babynames, n %in% c(5, 6, 7, 8))
```


### Your turn
* Use Boolean operators to find the rows that contain:
	+	Boys named Sue
	+	Names that were used by exactly 5 or 6 children in 1880
	+	Names that are one of Acura, Lexus, or Yugo




When you run that line of code, dplyr executes the filtering operation and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you need to use the assignment operator. R either prints out the results, or saves them to a variable.

##  `arrange()`
<img src="./figures/transformation/arrange.jpg" alt="ds" style="width: 750px;"/>

```r
arrange(babynames,n)
```
Use `desc()` to re-order by a column in descending order.
```r
arrange(babynames,desc(n))
```
Missing values are always sorted at the end:

```r
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))
```


##  Combining multiple operations with the pipe `%>%`



```r
babynames %>% filter(n == 1000)
 filter(babynames,n == 1000)
```

<img src="./figures/transformation/shortcutpipe.jpg" alt="ds" style="width: 750px;"/>

1.	 Filter `babynames` to just boys born in 2015

2.	 Select the `name` and `n` columns from the result

3.	 Arrange those columns so that the most popular names appear near the top.

```r
arrange(select(filter(babynames, year == 2015, 
  sex == "M"), name, n), desc(n))
```


```r
boys_2015 <- filter(babynames, year == 2015, sex == "M")
boys_2015 <- select(boys_2015, name, n)
boys_2015 <- arrange(boys_2015, desc(n))
```


```r
babynames %>%
  filter(year == 2015, sex == "M") %>%
  select(name, n) %>%
  arrange(desc(n))
```

For ggplot, you should use `+` instead of `%>%`

```r
babynames  %>%  
  filter(sex == "F",name=="Michael") %>% 
  ggplot()+
  geom_line(aes(x=year,y=prop))
```
### Ploting groups

```r
babynames %>% 
  filter(name == "Michael") %>%
  ggplot() +
    geom_point(mapping = aes(year, prop))
```

```r
babynames %>% 
  filter(name == "Michael") %>%
  ggplot() +
    geom_line(mapping = aes(year, prop, group = sex))
```

```r
babynames %>% filter(name == "Michael") %>%
  ggplot() +
    geom_line(mapping = aes(year, prop)) + 
    facet_wrap(~ sex, scales = "free_y")
```


### Your turn
* Use %>% to write a sequence of functions that: 
  + Filters babynames to the girls that were born in 2017, then…
  + Selects the `name` and `n` columns, then…
  + Arranges the results so that the most popular names are near the top.

<!-- ```{r eval=FALSE, echo=FALSE} -->
<!--babynames %>%  -->
<!--  filter(year == 2017, sex == "F") %>%  -->
<!-- select(name, n) %>%  -->
<!-- arrange(desc(n)) -->
<!-- ``` -->



* Use %>% to write a sequence of functions that: 
  + Trim babynames to just the rows that contain any `name` and `sex`
  + Trim the result to just the columns that will appear in your graph (not strictly necessary, but useful practice)
  + Plot the results as a line graph with `year` on the x axis and `prop` on the y axis


<!--```{r eval = FALSE, echo = FALSE}-->
<!--babynames %>% -->
<!--  filter(name == "Garrett", sex == "M") %>%-->
<!--  select(year, prop) %>%-->
<!--  ggplot() +-->
<!--    geom_line(mapping = aes(year, prop))-->
<!--```-->

## `summarize()`
```r
babynames %>% summarize(total = sum(n), max = max(n))
```
`summerize()` function
creates one or more scalar variables summarizing the variables of a tibble.

 `n()` returns  the number of rows in a dataset

`n_disinct()` returns the number of distinct values in a variable 

```r
babynames %>% summarize(n = n(), nname = n_distinct(name))
```

### Your turn
Extract the rows where `name == "Khaleesi"`. Then use `summarize()` and `sum()` and `min()` to find:

1. The total number of children named Khaleesi
2. The first year Khaleesi appeared in the data

<!-- ```{r echo=FALSE, eval=FALSE}
babynames %>% filter(name == "Khaleesi") %>% 
  summarize(total = sum(n), first = min(year))
```-->



## Grouping

Can we do the things in the previous exercise for all names?

Let's work on another dataset to learn `group_by()`

```r
pollution <- tribble(
       ~city,   ~size, ~amount, 
  "New York", "large",      23,
  "New York", "small",      14,
    "London", "large",      22,
    "London", "small",      16,
   "Beijing", "large",      121,
   "Beijing", "small",      56
)
```

```r
pollution %>% 
 summarize(mean = mean(amount), sum = sum(amount), n = n())
```

<!-- <img src="./figures/transformation/groupby.jpg" alt="ds" style="width: 750px;"/> -->


```r
pollution %>% 
  group_by(city) %>%
  summarize(mean = mean(amount), sum = sum(amount), n = n())
```

<img src="./figures/transformation/groupsummarize.jpg" alt="ds" style="width: 750px;"/>

```r
pollution %>% 
  group_by(city, size) %>%
  summarize(mean = mean(amount), sum = sum(amount), n = n())
```

```r
 weather %>% 
  group_by(month) %>% 
  summarize(mean = mean(temp, na.rm = TRUE), 
            std_dev = sd(temp, na.rm = TRUE))
```

```r
diamonds %>% 
  group_by(cut) %>% 
  summarize(avg_price = mean(price))
```

```r
babynames %>% 
  group_by(sex) %>%
  summarize(total = sum(n))
```


Use `ungroup` to remove grouping criteria from a data frame.

```r
babynames %>% 
  group_by(sex) %>%
  ungroup() %>%
  summarize(total = sum(n))
```

### Your turn 

* Display the ten most popular name and sex combinations. Compute popularity as the total number of children with a given name and sex.

<!-- ```{r eval=FALSE,echo=FALSE}
babynames %>% 
  group_by(name,sex) %>% 
  summarize(popularity = sum(n)) %>% 
  arrange(desc(popularity))
```  -->


*  Use `group_by()` to calculate the total number of children born for every year. 
Plot the results as a line graph: year vs. total.
<!-- ```{r eval=FALSE,echo =FALSE}
babynames %>% 
  group_by(year) %>% 
  summarize(total = sum(n)) %>% 
  ggplot() +
  geom_line(aes(x=year,y=total))
``` -->

##  `mutate()`

**What was the top ranked name for each year?**


We can obtain the rank using `min_rank()`
```r
babynames %>% 
  mutate(rank = min_rank(desc(prop)))
```

```r
weather <- weather %>% 
 mutate(temp_in_C = (temp - 32) / 1.8)

weather %>% 
  group_by(month) %>% 
  summarize(mean_temp_in_F = mean(temp, na.rm = TRUE), 
            mean_temp_in_C = mean(temp_in_C, na.rm = TRUE))

```

If you only want to keep the new variables, use `transmute()`


### Your turn 
* Group babynames by `year` and then re-rank the data. Filter the results to just rows where `rank == 1`.

<!-- ```{r echo=FALSE,eval=FALSE}
babynames %>% 
  group_by(year) %>% 
  mutate(rank = min_rank(desc(prop))) %>% 
  filter(rank == 1)
```  -->



### Useful creation functions
There are many functions for creating new variables that you can use with mutate(). The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output. Here is a selection of functions that are frequently useful

* Arithmetic operators: `+`, `-`, `*`, `/`, `^`. These are all vectorised, using the so called "recycling rules." If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: `air_time / 60`, `hours * 60 + minute`, etc.

* Modular arithmetic: `%/%` (integer division) and `%%` (remainder). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from `air_time` with
```r
transmute(flights,
  air_time,
  hour = air_time %/% 60,
  minute = air_time %% 60
)
```

* Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`). 
```r
x <- 1:10
lead(x)
lag(x)
```

* Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means. 

```r
cumsum(x)
```

* Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`, which you learned about earlier. If you are doing a complex sequence of logical operations it is often a good idea to store the interim values in new variables so you can check that each step is working as expected.

* Ranking: there are a number of ranking functions, but you should start with `min_rank()`. It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use `desc(x)` to give the largest values the smallest ranks.

<img src="./figures/transformation/summary.jpg" alt="ds" style="width: 750px;"/>