/
hagis.Rmd
312 lines (219 loc) · 12.1 KB
/
hagis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
---
title: "hagis: Tools for Analysis of Plant Pathogen Pathotype Complexities, Distributions and Diversity"
author: "Austin McCoy and Zachary Noel"
date: "`r Sys.Date()`"
vignette: >
%\VignetteIndexEntry{hagis: Tools for Analysis of Plant Pathogen Pathotype Complexities, Distributions and Diversity}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{pander}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 6,
fig.height = 4
)
data.table::setDTthreads(1L)
```
## Getting Started With {hagis}
The following examples are based on a dataset from Michigan State University *Phytophthora sojae* surveys for soybean phytophthora root rot pathotyping efforts.
First you'll want to load in your data set, for right now let's use a practice data set made for the {hagis} package, named `P_sojae_survey`.
The data set is available in your R session automatically when you load the {hagis} package.
```{r load_data}
library("hagis")
head(P_sojae_survey)
```
We see in the `gene` column that each gene is prepended with "Rps".
We can remove this to make the graphs cleaner and report the genes in tables as we would in a manuscript.
Note that this will work for any string you enter as the first value, `pattern`.
The second string, `replacement`, is the replacement value, the third, `x`, is where to look and make the changes.
```{r remove-gene}
P_sojae_survey$Rps <-
gsub(pattern = "Rps ",
replacement = "",
x = P_sojae_survey$Rps)
head(P_sojae_survey)
```
This practice data set contains 21 isolates', `Isolate`, virulence data on a set of 14 differential soybean cultivars, `Line`.
This package uses the *percentage* of susceptible, inoculated, plants to determine effective resistance genes, pathotype diversity and frequency, as well as individual isolates pathotypes.
To help ensure that the proper data are used in calculations, the user is asked to provide some information that instruct {hagis} about what data to use.
------------------------------------------------------------------------
## Function Arguments Used in {hagis}
We have striven to make {hagis} as intuitive to use as possible.
Part of that means that we have used the same arguments for the three main functions, `summarize_gene()`, `calculate_complexities()` and `calculate_diversities()`.
Each of these functions take the same arguments:
- `x` this is your data set name, *e.g.*, `P_sojae_survey` from the example above, allows for the function to identify where it will be pulling these columns (and their associated row values) from to use (*i.e.* your data collection Excel spreadsheet)
- `cutoff` this value sets the cutoff for susceptible reactions.
For example, `cutoff = 60` means that all genes with 60% or more of the plants rated susceptible will be treated as susceptible.
You can change this to whatever percentage you require for your study.
- `control` specifies the value used in the `gene` column to denote a susceptible control used in the study
- `sample` specifies the column header for the column which identifies the isolates tested
- `gene` specifies the column header for the column which identifies the genes tested
- `perc_susc` specifies the column header for the column which identifies the percent susceptible plants for each gene
Ordinarily you would use functions in {hagis} or other R packages like this:
```{r example-function, eval=FALSE}
Rps.summary <- summarize_gene(
x = P_sojae_survey,
cutoff = 60,
control = "susceptible",
sample = "Isolate",
gene = "Rps",
perc_susc = "perc.susc"
)
```
However, because the functions share arguments we can create a `list()` of arguments and share them with some of the functions from {hagis}.
First, we make a list of the arguments that `summarize_gene()`, `calculate_diversities()` and `calculate_complexities()` use that specify our inputs based on our example data:
```{r shared-args}
hagis_args <- list(
x = P_sojae_survey,
cutoff = 60,
control = "susceptible",
sample = "Isolate",
gene = "Rps",
perc_susc = "perc.susc"
)
```
Now that we have a list of arguments, we can now save time entering the same data for each function and also avoid typos or entering different cutoff values, etc. between the functions.
------------------------------------------------------------------------
## Determination of Effective Resistance Genes
Below is an example of tables and graphics that can be produced using the `summarize_gene()` function to identify effective resistance genes tested against the sampled *Phytophthora sojae* population.
The `summarize_gene()` function allows you to produce a detailed table showing the number of virulent isolates (`N_virulent_isolates`), as well as offering a percentage of the isolates tested which are pathogenic on each gene (`percent_pathogenic`).
```{r, echo=TRUE}
Rps.summary <- do.call(summarize_gene, hagis_args)
Rps.summary
```
Using the *pander* library we can make the table much more attractive in RMarkdown.
```{r pander-print-Rps, echo=TRUE}
library(pander)
pander(Rps.summary)
```
### Plotting Rps Summary Data
{hagis} also provides functions to quickly graph your data using {ggplot2}.
Two functions are provided to plot the summary depending on your needs.
If you need the frequency, use `autoplot(Rps.summary, type = "percentage")`, or if you desire the distribution `autoplot(Rps.summary, type = "count")`.
Both return the same graph, only the y-axis change; percent for frequency and n for distribution.
```{r plot-summary, echo=TRUE}
autoplot(Rps.summary, type = "percentage")
autoplot(Rps.summary, type = "count")
```
------------------------------------------------------------------------
## Pathotype Complexities
Pathotype frequency, distribution as well as statistics such as mean pathotype complexity can be calculated using the `calculate_complexities()` function.
This function will return a `list()` of two `data.table()` objects, `grouped_complexities` and `individual_complexities`.
`grouped_complexities` returns a `list()` as a `data.table()` object showing the frequency and distribution of pathotype complexities for the sampled population.
`individual_complexities()` returns a `list()` as a `data.table()` object showing each individual isolates pathotype complexity.
An isolates pathotype complexity refers to the number of resistance genes that it is able to overcome and cause disease on, *i.e.*, a pathotype complexity of "7" would mean that isolate can cause disease on 7 different resistance genes.
```{r complexities, echo=TRUE, message=FALSE, warning=FALSE}
complexities <- do.call(calculate_complexities, hagis_args)
complexities
```
Once again, using *pander* we can make these tables much more attractive in RMarkdown.
Since `complexities` is a `list()` object, we can refer to each object directly by name and print them as follows.
```{r pander-print-complexities}
pander(complexities$grouped_complexities)
pander(complexities$indvidual_complexities)
```
Using `summary()` will return the mean, standard error (se) and standard deviation (sd) for pathotypes of a complexities object.
```{r summary-complexities}
pander(summary(complexities))
```
### Plotting Complexities Data
Two functions are provided to plot the complexities depending on your needs.
If you need the frequency, use `autoplot(complexties, type = "percentage")`, or if you desire the distribution `autoplot(complexities, type = "count")`.
Both return the same graph, only the y-axis change; percent for frequency and n for distribution.
```{r complexities-plot}
autoplot(complexities, type = "percentage")
autoplot(complexities, type = "count")
```
------------------------------------------------------------------------
## Diversity Indices, Frequency of Unique Pathotypes and Individual Isolate Pathotypes
Diversity indices are extremely useful when trying to identify differences between two populations.
Here, pathotype diversities are calculated for the isolate population using the `calculate_diversities()` function.
Likewise, individual isolates' pathotypes, number of isolates used in the study, number of pathotypes within the study are calculated.
Five diversity indices are calculated when calling `calculate_diversities()`.
- Simple diversity index, which will show the proportion of unique pathotypes to total samples.
As the values gets closer to 1, there is greater diversity in pathoypes within the population.
Simple diversity is calculated as: $$D = \frac{Np}{Ns}$$ where $Np$ is the number of pathotypes and $Ns$ is the number of samples.
- Gleason diversity index, an alternate version of Simple diversity index, is less sensitive to sample size than the Simple index.
$$D = \frac{ (Np - 1) }{ log(Ns)}$$ Where $Np$ is the number of pathotypes and $Ns$ is the number of samples.
- Shannon diversity index is typically between 1.5 and 3.5, as richness and evenness of the population increase, so does the Shannon index value.
$$D = -\sum_{i = 1}^{R} p_i \log p_i$$ Where $p_i$ is the proportional abundance of species $i$.
- Simpson diversity index values range from 0 to 1, 1 represents high diversity and 0 represents no diversity.
Where diversity is calculated as: $$D = \sum_{i = 1}^{R} p_i^2$$
- Evenness ranges from 0 to 1, as the Evenness value approaches 1, there is a more even distribution of each pathotype's frequency within the population.
Where Evenness is calculated as: $$D = \frac{H'}{log(Np)}$$ where $H'$ is the Shannon diversity index and $Np$ is the number of pathotypes.
```{r calculate-diversities, echo=TRUE}
diversity <- do.call(calculate_diversities, hagis_args)
diversity
```
Or using `pander` for reporting, a nice table is generated.
```{r diversity-pander}
pander(diversity)
```
### Table of Diversities
To generate a table of diversities, use `diversities_table()`.
{hagis} will automatically create a `pander` object for you.
This is because it is much easier to read the resulting table in the console than the raw `data.frame` and insert into reports.
```{r diversities-table}
diversities_table(diversity)
```
To generate a table of individual pathotypes, use `individual_pathotypes()`.
Here again, {hagis} provides a `pander` object for ease of use.
### Table of Individual Pathotypes
```{r individual-pathotypes}
individual_pathotypes(diversity)
```
# Advanced Plotting {#AdvancedPlotting}
## {hagis} autoplot() Objects
Since {hagis} uses {ggplot2} to generate its plots, you can easily theme the outputs using common {ggplot2} themes and other options provided by {hagis} directly through `autoplot()`.
```{r set-up-adv.plot}
library(ggplot2)
Rps.plot <- autoplot(Rps.summary, type = "percentage")
Rps.plot
```
### Changing the ggplot2 Theme
Use {ggplot2}'s `theme_minimal()` theme.
```{r change-plot-theme}
Rps.plot <- Rps.plot +
theme_minimal()
Rps.plot
```
### Changing the Font
Set the font to be a bold-face serif family font.
```{r change-plot-font}
Rps.plot <- Rps.plot +
theme(text = element_text(face = "bold",
family = "serif"))
Rps.plot
```
### Make a Horizontal Plot
If your *Rps* gene names are too long, flipping the axis can make the graph more legible without rotating the x-axis labels.
```{r horizontal-plot}
Rps.plot <- Rps.plot +
coord_flip()
Rps.plot
```
### Use Colors in Autoplot Objects
You can use named, *e.g.*, "red", "yellow", "blue", colors in R or you can use custom hexadecimal color codes.
Illustrated below is using Michigan State University (MSU) Green, hex code #18453b, using `theme_bw()` with a serif font.
```{r use-Colors}
autoplot(Rps.summary,
type = "percentage",
color = "#18453b") +
theme_bw() +
theme(text = element_text(face = "bold",
family = "serif"))
```
### Sorting the x-axis
You can sort the x-axis of any graph produced using `autoplot()` in an `ascending` or `descending` order using the `order` parameter in `autoplot()`.
```{r sort-axis}
autoplot(Rps.summary,
type = "percentage",
color = "#18453b",
order = "ascending") +
theme_bw() +
theme(text = element_text(face = "bold",
family = "serif"))
```