-
Notifications
You must be signed in to change notification settings - Fork 6
/
histogram.Rmd
260 lines (207 loc) · 6.43 KB
/
histogram.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "Histogram Plots"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{Histogram Plots}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
dpi = 300,
fig.align = "center",
out.width = "100%",
error = TRUE,
fig.height = 6,
fig.width = 8,
fig.showtext = TRUE
)
```
```{r setup}
require(tlf)
```
# 1. Introduction
The following vignette aims at documenting and illustrating workflows for producing histograms using the function `plotHistogram` from the `tlf` package.
# 2. Illustration of basic histograms
## 2.1. Data
The data showed in the sequel is available at the following path: `system.file("extdata", "test-data.csv", package = "tlf")`.
In the code below, the data is loaded and assigned to `histData`.
```{r load-data, results='asis'}
# Load example
histData <- read.csv(
system.file("extdata", "test-data.csv", package = "tlf"),
stringsAsFactors = FALSE
)
# histData
knitr::kable(utils::head(histData), digits = 2)
```
## 2.2. `plotHistogram`
Besides, the usual `tlf` input arguments commonly used by the plot functions (`data`, `metaData`, `dataMapping`, `plotConfiguration` and `plotObject`), the function `plotHistogram` also includes the following optional input arguments:
- `x`: Numeric values used in the histogram instead of `data` and `dataMapping`.
- `bins`: Number of bins
- `binwidth`: Width of each bin, overwriting the number of bins.
- `stack`: Logical defining if histogram bars are stacked
- `distribution`: Name of a distribution to fit to the data.
Currently, only normal and log-normal distributions are available.
## 2.3. Minimal examples
Most of the time, the optional input `x` is convenient to assess the distribution of the data.
```{r minimal-example-x}
# Use directly x for quick histogram
plotHistogram(x = histData$Ratio)
# Use directly x and bins for quick histogram with a defined number of bins
plotHistogram(x = histData$Ratio, bins = 7)
```
## 2.4. Examples using `data` and `dataMapping`
Workflows in `tlf` usually includes the definition of `data`, their `metData` and `dataMapping`.
```{r example-data}
# Create HistogramDataMapping object
histoMapping <- HistogramDataMapping$new(
x = "Ratio",
fill = "Sex"
)
plotHistogram(
data = histData,
dataMapping = histoMapping
)
```
In such cases, the optional arguments previously presented can be included in `dataMapping`.
```{r example-stack}
# Create HistogramDataMapping object
histoMapping <- HistogramDataMapping$new(
x = "Ratio",
fill = "Sex",
stack = TRUE
)
plotHistogram(
data = histData,
dataMapping = histoMapping
)
```
If defined as `plotHistogram` input arguments, they will overwrite `dataMapping`.
```{r example-bins}
# Create HistogramDataMapping object
histoMapping <- HistogramDataMapping$new(
x = "Ratio",
fill = "Sex",
bins = 3
)
# bin defined in both, plotHistogram has priority and overwrites dataMapping internally
plotHistogram(
data = histData,
dataMapping = histoMapping,
bins = 6
)
```
## 2.5. Focus on binning
There are 3 ways of defining how the data is binned.
The priority between each method was defined according to their specificity.
Method 1, the simplest and used as default, defines the number of bins.
It can be overwritten by method 3, which defines the width of each bin.
Method 2 is the more specific and defines the bin edges, consequently it cannot be overwritten by method 3.
*1. Define the number of bins with the input argument `bins` (using a single value)
```{r example-bins-single}
# Create HistogramDataMapping object
histoMapping <- HistogramDataMapping$new(
x = "Ratio",
fill = "Sex"
)
# Define the number of bins in final plot
plotHistogram(
data = histData,
dataMapping = histoMapping,
bins = 6
)
```
*2. Define the edges of the bins with the input argument `bins` (using an array of values)
```{r example-bins-array}
# Create HistogramDataMapping object
histoMapping <- HistogramDataMapping$new(
x = "Ratio",
fill = "Sex"
)
# Define the edges of bins in final plot
plotHistogram(
data = histData,
dataMapping = histoMapping,
bins = seq(0, 6, 0.2)
)
```
*3. Define the width of bins with the input argument `binwidth` (using a single value).
```{r example-binwidth}
# Create HistogramDataMapping object
histoMapping <- HistogramDataMapping$new(
x = "Ratio",
fill = "Sex"
)
# Define the width of bins in final plot
plotHistogram(
data = histData,
dataMapping = histoMapping,
binwidth = 0.4
)
```
## 2.6. Focus on distribution fit
The optional input `distribution` aims at providing the possibility of fitting the data distribution.
Currently, two distributions can be fitted by the function `plotHistogram`:
- Fit a normal distribution and draw the distribution mean as vertical line using `"normal"`
- Fit a log-normal distribution and draw the distribution mode as vertical line using `"logNormal"`
```{r example-distribution}
# Plot normal distribution
plotHistogram(
x = histData$Ratio,
distribution = "normal"
)
# Plot normal distribution
plotHistogram(
x = histData$Ratio,
distribution = "logNormal"
)
```
To compare multiple distributions, they can be defined through the `dataMapping`:
```{r example-2-distributions}
# Create HistogramDataMapping object split by gender
histoMapping <- HistogramDataMapping$new(
x = "Ratio",
fill = "Sex"
)
# Plot normal distribution for each gender
plotHistogram(
data = histData,
dataMapping = histoMapping,
distribution = "normal"
)
```
With option `stack`, it is also possible to get the distribution of the sum only while splitting the content of the bars.
```{r example-2-distributions-stack}
# Create HistogramDataMapping object split by gender
histoMapping <- HistogramDataMapping$new(
x = "Ratio",
fill = "Sex"
)
# Plot normal distribution of sum but bars are split by gender
plotHistogram(
data = histData,
dataMapping = histoMapping,
distribution = "normal",
stack = TRUE
)
```
The `HistogramPlotConfiguration` objects can be used to tune the final plot aesthetics.
```{r example-2-distributions-plot-configuration}
histoConfiguration <- HistogramPlotConfiguration$new(
xlabel = "Ratios",
ylabel = "Occurences"
)
histoConfiguration$ribbons$fill <- "grey80"
histoConfiguration$lines$color <- "firebrick"
# Plot normal distribution of sum but bars are split by gender
plotHistogram(
x = histData$Ratio,
plotConfiguration = histoConfiguration,
distribution = "normal"
)
```