-
Notifications
You must be signed in to change notification settings - Fork 1
/
Explore_InterProScan_profile.Rmd
171 lines (115 loc) · 4.82 KB
/
Explore_InterProScan_profile.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "Explore InterProScan profile"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Explore InterProScan profile}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width=7, fig.height=7
)
```
First, load the rbims package.
```{r setup}
library(rbims)
```
# Example with PFAM database
First, I will read the InterProScan output in a long format and extract the PFAM abundance information.
If you want to follow this example, you can download the use rbims [test](https://github.com/mirnavazquez/RbiMs/blob/main/inst/extdata/Interpro_test.tsv) file.
```{r, eval=FALSE}
interpro_pfam_long<-read_interpro(data_interpro = "../inst/extdata/Interpro_test.tsv", database="Pfam", profile = F)
```
You can use the [subsetting functions](https://mirnavazquez.github.io/RbiMs/articles/Explore_KEGG_profile.html#metabolism-subsetting-1) to create subsets of the InterPro profile table. Here, we will extract the most important PFAMs, and we need to use them as an input, not the profile output from read_interpro.
The function [get_subset_pca](https://mirnavazquez.github.io/RbiMs/reference/get_subset_pca.html) calculates a PCA over the data to find the PFAM that explains the variation within the data.
```{r}
important_PFAMs<-get_subset_pca(tibble_rbims=interpro_pfam_profile,
cos2_val=0.95,
analysis="PFAM")
```
```{r}
head(important_PFAMs)
```
## The distance argument
Let's plot the results.
[plot_heatmap](https://mirnavazquez.github.io/RbiMs/reference/plot_heatmap.html) can help explore the results. We can perform two types of analyses; if we set the distance option as **TRUE**, we can plot to show how the samples could cluster based on the protein domains.
```{r}
plot_heatmap(important_PFAMs, y_axis=PFAM, analysis = "INTERPRO", distance = T)
```
If we set that to **FALSE**, we observed the presence and absence of the domains across the genome samples.
```{r}
plot_heatmap(important_PFAMs, y_axis=PFAM, analysis = "INTERPRO", distance = F)
```
```{r}
plot_heatmap(important_PFAMs, y_axis=PFAM, analysis = "INTERPRO", distance = F)
```
We can also visualize using a bubble plot.
```{r, message=FALSE, warning=FALSE}
plot_bubble(important_PFAMs,
y_axis=PFAM,
x_axis=Bin_name,
calc = "Binary",
analysis = "INTERPRO",
data_experiment = metadata,
color_character = Clades)
```
# Example with INTERPRO database
First, I will read the InterProScan output in a wide format and extract the PFAM abundance information.
```{r, eval=FALSE}
interpro_INTERPRO_profile<-read_interpro(data_interpro = "Interpro_test.tsv", database="INTERPRO", profile = F)
```
```{r, eval=FALSE}
head(interpro_INTERPRO_profile)
```
We are going to look for the InterProScan IDs that conform the `DNA topoisomerase 1`. To do this, we will create a vector of the IDs associated to that enzyme.
```{r}
DNA_topoisomerase_1<-c("IPR013497", "IPR023406", "IPR013824")
```
With the function [get_subset_pathway](https://mirnavazquez.github.io/RbiMs/reference/get_subset_pathway.html) we can create a subset of the INTERPRO table.
```{r}
DNA_tipo_INTERPRO<-get_subset_pathway(interpro_INTERPRO_profile, type_of_interest_feature=INTERPRO,
interest_feature=DNA_topoisomerase_1)
```
```{r}
head(DNA_tipo_INTERPRO)
```
We can create a bubble plot to visualize the distribution of these enzymes across the bins.
```{r}
plot_bubble(DNA_tipo_INTERPRO,
y_axis=INTERPRO,
x_axis=Bin_name,
calc = "Binary",
analysis = "INTERPRO",
data_experiment = metadata,
color_character = Sample_site)
```
# Example with KEGG database
First, I will read the InterProScan output in a long format and extract the KEGG information. When you use the `KEGG` option, the profile option is disabled.
```{r, eval=FALSE}
interpro_KEGG_long<-read_interpro(data_interpro = "Interpro_test.tsv", database="KEGG")
```
```{r}
head(interpro_KEGG_long)
```
## Mapping INTERPRO to KEGG database
We can use the [mapping_ko](https://mirnavazquez.github.io/RbiMs/reference/mapping_ko.html) function here, to get the extended KEGG table.
```{r, eval=FALSE}
interpro_map<-mapping_ko(tibble_interpro = interpro_KEGG_long)
```
```{r}
head(interpro_map)
```
We can plot all the KOs and the Modules to which they belong. An important thing here is that we will set `analysis = "KEGG"` despite this workflow started with the InterProScan output in analysis.
```{r}
plot_heatmap(tibble_ko=interpro_map,
data_experiment = metadata,
y_axis=KO,
order_y = Module,
order_x = Sample_site,
split_y = TRUE,
analysis = "KEGG",
calc="Percentage")
```