/
01_rphenoscape.Rmd
216 lines (150 loc) · 14.3 KB
/
01_rphenoscape.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: "RPhenoscape Tutorial"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{RPhenoscape}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## 1. Introduction
The desired input of many comparative studies is a matrix that includes all the data available for a set of traits and a focal taxon. Bringing similar types of trait data together across studies is notoriously difficult, but is straightforward with ontology annotations. Trait data that are tagged using similar ontology terms can be automatically called using R functions.
This workshop uses trait data for vertebrates that has been annotated (tagged) with ontology terms. These data are accessed from the [Phenoscape KB]. In this lesson we use R functions to obtain and view character matrices from the KB, obtain synthetic matrices of inferred presence/absence characters, and understand the meaning and relations of ontology terms. The lesson assumes basic knowledge of R and RStudio.
## 2. Browse and filter data in the KB to understand the scope of available data.
You would like to understand the evolution of morphological traits in a taxonomic group, and you need a matrix to map to your (typically molecular) phylogenetic tree. You come to the Phenoscape KB to see what types of data are available across the members of your clade. In this exercise, you are an ichthyologist, interested in catfishes (Siluriformes), specifically the bullhead catfishes in the family Ictaluridae.
You begin by querying ‘Siluriformes’ in the Phenoscape KB [faceted browsing page]. You immediately see the scope of the data:
From left to right across the top tabs, for ‘Siluriformes’:
- There are 4,969 unique phenotypes
- There are 292,853 taxon annotations
- There are 1,505 siluriform taxa with data in the KB
- There are 40 publications in the KB that include catfish phenotypes
- There are 19,232 candidate genes linked to the phenotypes
Taxon annotations are the ontology terms tagged to original free text descriptions of phenotypes for each siluriform taxon. Click on the ‘Taxon annotations’ tab, and then the ‘Sources’ box on the right hand side of each row to see the publication source(s) and free text description of the phenotype to which each ontology annotation was tagged.
You are specifically interested in Ictaluridae (bullhead catfishes), and you click on that family in the faceted browsing interface. Now filter using ‘fin’ as the anatomical entity under ‘Query’ and include parts (check the ‘parts’ box.)
Download these taxa and their morphological trait data using the link provided.
Go to the publications tab, which shows nine publications that contain fin data. Note that any of these studies, e.g., Lundberg (1992), can be entered under ‘publications’ in faceted browsing, but the original matrix cannot be downloaded.
## 3. Retrieve term info using RPhenoscape
Now we will repeat some of the above steps in R, beginning with a query for ‘Ictaluridae’ and ‘fin’ (including parts) to retrieve the list of studies that contain fin characters for this taxon.
The [RPhenoscape] package provides convenient access to the Phenoscape Knowledgebase in the form of R functions returning R-native data structures. So that we don't have to prefix every call of the package's functions with the package name, we will first attach the package:
```{R}
library(rphenoscape)
```
First let’s look up information about the query terms. The [`find_term()`](https://rphenoscape.phenoscape.org/reference/find_term.html) function searches the KB for terms matching a given search text, and will return the ID (term IRI), label, source ontology IRI, and type of match (e.g., exact, broad). For example, search for terms containing the string “fin” in the label or synonym:
```{r}
find_term("fin")
```
Note that because the KB integrates model organism phenotypes, the results show types of fin specific to model organisms, such as "fin (zebrafish)".
There are several parameters available to narrow and filter search results. For example, the parameter `matchType` allows restricting to exact matches only:
```{r}
find_term("fin", matchTypes = c("exact"))
```
To retrieve more details about a given term, such as definition, synonyms and classification, you can pass the term IRI (e.g., 'http://purl.obolibrary.org/obo/UBERON_0008897' for fin) to the function [`as.terminfo()`](https://rphenoscape.phenoscape.org/reference/terminfo.html). Classification details include the superclass (parent) and subclass (child) relationships for a given ontology term, along with other relationships such as part_of.
For example, view the relationships and term info for “pectoral fin”:
```{r}
pecfin <- find_term('pectoral fin', matchType='exact')
as.terminfo(pecfin)
```
For a specific taxon, such as Siluriformes, you can also view the taxon rank, whether or not the taxon is extinct, and its common name:
```{r}
taxon <- find_term('Siluriformes')
as.terminfo(taxon)
```
## 4. Retrieve study matrices using RPhenoscape
Now we will query for ‘Ictaluridae’ and ‘fin’ (including parts) to retrieve the list of studies that contain fin characters for this taxon.
```{r}
slist <- get_studies(taxon = "Ictaluridae", entity = "fin", includeRels = "part of")
```
The list is returned as a data.frame containing the study IDs and citations. Navigate to Global Environment and click on ‘slist’ to view the ID and citation (author, year) for each study in the KB that contains any ictalurid species associated with fin data. Note that taxon, entity, and quality are optional arguments in this query, and we have also included fin parts. Here we found nine studies. Click on ‘slist’ in the Global Environment to view these studies and their IDs.
Next we want to obtain the character matrix for one of these studies. Using the study IDs from the previous step, we can get a list of the character matrices as NeXML objects. We will choose the Lundberg (1992) publication, the fifth item on the list above.
```{r}
lundberg_nexml <- get_study_data(slist$id[5])
```
Now go to Global Environment pane and click on ‘lundberg_nexml’ to view the study ID and associated NeXML object.
To view the full Lundberg (1992) matrix within RStudio, we need to first retrieve the matrix as a data.frame from the NeXML object:
```{r}
lundberg_matrix <- get_char_matrix(lundberg_nexml[[1]], otus_id = FALSE)
```
You can take a look at a small part of the Lundberg (1992) matrix (e.g., the first five taxa and first five characters):
```{r}
lundberg_matrix[1:5, 1:7]
```
Now navigate to Global Environment and click on ‘lundberg_matrix’ to view the matrix, including the total number of taxa and characters (i.e., the dimensions of the data.frame). To view the entire matrix in this panel, click on the first icon that appears to the right when you hover over the second row.
We can use the information in the NeXML object to convert character state symbols to labels:
```{r}
state_symbols2labels(lundberg_nexml[[1]], charmat = lundberg_matrix)[1:5, 1:7]
```
Note that all taxa and characters from this study are returned, not just those pertaining to our original search terms. In this example, many anatomical entities other than ‘fin’ and its parts are returned as part of the original study matrix. In a later step, we will learn how to subset a matrix using anatomy or taxonomy terms.
## 5. Get a presence/absence character matrix using OntoTrace in RPhenoscape
In the previous step, we queried the KB for individual studies that contain characters pertaining to the fins in ictalurid catfish, and we obtained character matrices for source publications that included those characters. Now we want to get a matrix that synthesizes these characters into a single matrix.
[OntoTrace] from the KB combines characters from multiple publications into a single synthetic matrix. It allows a user to specify any number or combination of anatomical entities and taxa. It returns a matrix as a nexml object that includes only presence or absence of anatomical entities. It uses asserted data of presence or absence, i.e., originally described by an author, or inferred data. Data are inferred from asserted characters, e.g., if an author describes a part of a fin such as ‘fin ray’, the fin is inferred present and scored as such in the matrix.
The OntoTrace default settings are to include parts of the anatomical entity and to include variable-only characters, but here we will include those that are invariant as well.
```{r}
nex <- get_ontotrace_data(taxon = "Ictaluridae", entity = "fin", variable_only = FALSE)
```
To view the matrix within RStudio, get the matrix (m) as a `data.frame` from the NeXML object. Opening this (‘m’) in the Global Environment shows the taxa, character names, and state assignments (0, absent; 1, present) (note: the otu and otus are NeXML-internal IDs):
```{r}
m <- get_char_matrix(nex)
```
The character and taxon lists for the matrix can also be returned. Click on ‘meta’ in the Global Environment pane to view the data.frames for taxa and entities:
```{r}
meta <- get_char_matrix_meta(nex)
```
The above returns 29 species of ictalurid catfishes and 124 anatomical entities for fins and their parts.
You will need to go to the [Phenoscape KB] to view the supporting states, i.e., the source data upon which inferences were made or those that were directly asserted.
For example, in the returned OntoTrace matrix (see ‘m’ in Global Environment (row 10)) the pelvic fin is scored as present ('1') for *Ictalurus australis*. Enter these data on the [faceted browsing page]:
- Anatomical entity = ‘pelvic fin’
- Quality = ‘present’
- Taxon = ‘Ictalurus australis’
Looking at ‘Taxon annotations’, you will find that there are zero phenotypes, i.e., no asserted character states to ‘pelvic fin present’. Now click the ‘inferred presence’ radio button under Phenotypic quality. You will now see two phenotypes for this taxon in the KB. These phenotypes ‘pelvic fin lepidotrichium amount’ and ‘pelvic splint present’ logically imply the presence of a pelvic fin. Thus, though an author did not directly assert the presence of a pelvic fin in *Ictalurus australis*, it is inferred and thus reported as present (‘1’) in the OntoTrace matrix.
## 6. Subset the matrix taxonomically
In the above step you retrieved a synthetic matrix for the Ictaluridae. Now you would like to pare it down in taxonomic and anatomical scope. Let’s begin by trimming the matrix down to the presence and absence characters pertaining only to the genus *Ictalurus*. We will reduce the matrix to include only descendents of *Ictalurus*.
First, we need to parse the list to determine which taxa in the matrix (‘m’) are descendants of *Ictalurus*. This returns a list of TRUE or FALSE assignments for each taxon in the matrix:
```{r}
is_desc <- is_descendant('Ictalurus', m$taxa)
```
We then subset the matrix m to the descendants of *Ictalurus*:
```{r}
ictalurus_m <- m[is_desc, ]
```
View the new matrix by clicking on the ‘ictalurus_m’ `data.frame` in the Global Environment.
We can also subset the matrix by anatomy. In this case, let’s select characters for the pectoral fin and its parts. We will use [`is_descendant()`](https://rphenoscape.phenoscape.org/reference/is_relation.html) to first parse the characters (column names; note the first three taxon-related columns are excluded) to determine which are descendents (types or parts) of the term ‘pectoral fin’. Because this function only looks for descendent terms, we also need to specify that any characters pertaining to 'pectoral fin' itself are included.
```{r}
is_desc_pecfin <- is_descendant('pectoral fin', colnames(ictalurus_m)[-c(1,2,3)], includeRels = "part_of") | colnames(ictalurus_m)[-c(1,2,3)] == 'pectoral fin'
table(is_desc_pecfin)
```
Using the ‘colnames’ command below, view the character names to verify that pectoral fin related characters will be selected:
```{r}
colnames(ictalurus_m)[-c(1,2,3)][is_desc_pecfin]
```
We now subset the matrix m using ‘is_desc_pecfin’ and retain the first three taxon-related columns:
```{r}
to_keep <- c(TRUE, TRUE, TRUE, is_desc_pecfin)
pecfin_m <- ictalurus_m[, to_keep]
```
View this matrix of pectoral fin characters for *Ictalurus* by clicking on ‘pecfin_m’ in the Global Environment.
## 6. Retrieve matrix of anatomical dependencies
Knowing how the information about an anatomical structure may rely on other layers of anatomical information is critical, if the goal is to isolate and analyze independent features. We can query the KB for a presence-absence dependency matrix for a specified set of anatomical structures. The dependency matrix is derived from the formalized knowledge of relationships among entities encoded in the ontology. In the example below, the presence of ‘pelvic splint’ is dependent on the presence of ‘pelvic fin’. In other words, the pelvic fin is implied present when the pelvic splint is present because ‘pelvic splint’ is part_of ‘pelvic fin'.
```{r}
tl <- c("pectoral fin", "pelvic fin", "pelvic splint", "antorbital", "preopercle")
dep_mat <- pa_dep_matrix(tl, .names = "label")
dep_mat
```
In the PARAMO tutorial, we will learn how to use these dependency matrices to reconstruct individual traits and entire phenotypes.
## 7. Retrieve semantic similarity matrix
We can also obtain similarity metrics for phenotypes based on ontological distance. Three similarity metrics - Jaccard, Tanimoto, and Cosine - are computed by the KB to measure similarity between terms (see RPhenoscape documentation in the Help panel for how these are measured). For example, here we compute and visualize Jaccard similarity for premaxillary tooth phenotypes in Ictaluridae:
```{r}
phens <- get_phenotypes(entity = "premaxillary tooth", taxon = "Ictaluridae")
nrow(phens)
sm <- jaccard_similarity(terms = phens$id, .labels = phens$label, .colnames = "label")
plot(hclust(as.dist(1-sm)))
```
In the PARAMO tutorial, we'll use a semantic similarity matrix to visualize traits in a heat map. We are also developing methods that use semantic similarity to cluster phenotypes into characters and character states.
[OntoTrace]: http://kb.phenoscape.org/ontotrace
[Phenoscape KB]: http://kb.phenoscape.org/
[RPhenoscape]: http://rphenoscape.phenoscape.org/
[faceted browsing page]: https://kb.phenoscape.org/facet