# Statistical transformations

Let's consider the difference between `geom_bar()` and `geom_point()`

```r
ggplot(mpg, aes(x = displ, y = hwy))+
  geom_point()

ggplot(mpg, aes(x = drv))+
  geom_bar()
```


`geom_point()`: data ---> plot

`geom_bar()`: data ---> counts ---> plot


`geom_bar()` has one more step between data and plot. It uses the data to calcuate the counts and then draw the plot based on the counts.

Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot.



To illustrate this,
we will use a new dataset `diamonds`.
```r
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))
```

<img src="./figures/visualization/stat_process.jpg" alt="ds" style="width: 1000px;"/>

On the x-axis, the chart displays `cut`, a variable from diamonds. On the y-axis, it displays `count`, which is calculated from the data. The algorithm used to calculate new values for a graph is called a `stat`, short for statistical transformation. 

You can learn which `stat` a geom uses by inspecting the default value for the `stat` argument. For example, `?geom_bar` shows that the default value for stat is `count`, which means that `geom_bar()` uses `stat_count()`. `stat_count()` is documented on the same page as `geom_bar()`, and if you scroll down you can find a section called "Computed variables." That describes how it computes two new variables: `count` and `prop`.

You can generally use `geoms` and `stats` interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`

```r
ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))
```

This works because every geom has a default stat; and every stat has a default geom.

We can override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count.


```r
 ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
```

Each stat creates additional variables to map aesthetics to. These variables use a common **..name..** syntax.
 
 `group=1` is a "dummy" grouping to override the default behavior, which  is to group by `cut` and in general is to group by the `x` variable. The default for `geom_bar` is to group by the `x` variable in order to separately count the number observations in each level of the `x` variable. For example, here, the default would be for `geom_bar()` to return the number of observations with cut equal to "Fair", "Good", etc.

However, if we want proportions, then we need to consider all levels of cut together. Without  `group=1`, the proportion of Fair in Fair is 100%, as is the proportion of Good in Good, etc. `group=1` prevents this, so that the proportions of each level of `cut` will be relative to all levels of `cut`.

Boxplots also use data to calculate summary statistics and then draw the plots.
```r
ggplot(data = diamonds,mapping = aes(x = cut, y = depth))+
  geom_boxplot()
```


# Summary statistics
In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. For one variable, statisticians commonly try to describe the observations in

* a measure of location, or central tendency, such as the mean and the median

* a measure of statistical dispersion like the standard deviation, variance, range, interquantile range.

## Mean

Simply the mean or average when the context is clear, is the sum of a collection of numbers divided by the count of numbers in the collection.

```r
a <- c(1,1,2,2,3,3,3,3,3,4,5,6)
b <- c(-3,2,4,2,5,-6,1,2,2,4,4,-2)
mean(a)
mean(b)
```

## Median and quantitle

A median is a value separating the higher half from the lower half of a data sample

```r
median(a)
median(b)
```

Quantiles are cut points dividing the sample into intervals with pre-specified proportions observations. 
Quartiles are the three cut points that will divide a dataset into four equal-sized groups. 
```r
quantile(a,probs=c(0.25,0.5,0.75))
quantile(b,probs=c(0.25,0.5,0.75))
```


## Range 
The range of a set of data is the difference between the largest and smallest values

```r
range(a)
range(b)
```

## Interquartile range
Interquartile range is the difference between the third  and the first quartiles.
```r
IQR(a)
IQR(b)
```


##  Variance and standard deviation
Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers are spread out from their average value. 

```r
var(a)
var(b)
sd(a)
sd(b)
```



## `summary()`
```{r}
summary(a)
summary(b)
```
## Your turn 
*  Obtain the summary statistics for the variables in `diamonds` dataset. Does the `summary()` calculate different summary statistics for different types of variables?

*  Calculate the `mean`, `median`, `variance`, and `IQR` for the following two vector. Explain your findings.
```r
a <- 1:7
b <- c(1:6, 10000)
```



# Position adjustment
Position adjustments are used to adjust the position of each geom. It contols how overlapping objects are arranged. There are five different adjustments

1. "identity": default of most geoms
2. "jitter": default of `geom_jitter()`
3. "dodge" default of `geom_boxplot()`
4. "stack" default of `geom_bar()` and `geom_histogram()`
5. "fill": useful for `geom_bar()` and `geom_histogram()`


In a boxplot, if you map the `fill` aesthetic to another variable, like `clarity`: the bars are automatically stacked. In the following plot, each colored rectangle represents a combination of `cut` and `clarity`
```r
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))
```

The stacking is performed automatically by the position adjustment specified by the `position` argument. 

```r
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
```


## Your turn

* Create barplots for `color` of the `diamonds` dataset split by `cut`:  
1. compare the number of different levels of `cut` within each value of `color`
2. compare the proportions of different  levels of `cut` across different `color`

* What’s the default position adjustment for `geom_boxplot()`? Create a visualization of the mpg dataset that demonstrates it.

# Coordinate systems

 The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.
 
 `coord_flip()` switches the x and y axes. 
 ```r
 ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))+
  coord_flip()
 ```

`coord_polar()` uses polar coordinate system. It is a two-dimensional coordinate system in which each point on a plane is determined by a distance from a reference point and an angle from a reference direction.

<img src="./figures/visualization/polar.jpg" alt="ds" style="width: 750px;"/>

```r
 bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  labs(x = NULL, y = NULL)

bar + coord_flip()
bar + coord_polar()
 ```

## Your turn
* Turn a stacked bar chart into a pie chart using `coord_polar()`.

* What does the plot below tell you about the relationship between city and highway mpg? Why is `coord_fixed()` important? What does `geom_abline()` do?
```r
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()
```