vignettes/Explore_InterProScan_profile.Rmd

---
title: "Explore InterProScan profile"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Explore InterProScan profile}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width=7, fig.height=7
)
```

First, load the rbims package.

```{r setup}
library(rbims)
```

# Example with PFAM database

First, I will read the InterProScan output in a long format and extract the PFAM abundance information.

If you want to follow this example, you can download the use rbims [test](https://github.com/mirnavazquez/RbiMs/blob/main/inst/extdata/Interpro_test.tsv) file. 

```{r, eval=FALSE}
interpro_pfam_long<-read_interpro(data_interpro = "../inst/extdata/Interpro_test.tsv", database="Pfam", profile = F)
```

You can use the [subsetting functions](https://mirnavazquez.github.io/RbiMs/articles/Explore_KEGG_profile.html#metabolism-subsetting-1) to create subsets of the InterPro profile table. Here, we will extract the most important PFAMs, and we need to use them as an input, not the profile output from read_interpro. 

The function [get_subset_pca](https://mirnavazquez.github.io/RbiMs/reference/get_subset_pca.html) calculates a PCA over the data to find the PFAM that explains the variation within the data.

```{r}
important_PFAMs<-get_subset_pca(tibble_rbims=interpro_pfam_profile, 
               cos2_val=0.95,
               analysis="PFAM")
```

```{r}
head(important_PFAMs)
```

## The distance argument

Let's plot the results.

[plot_heatmap](https://mirnavazquez.github.io/RbiMs/reference/plot_heatmap.html) can help explore the results. We can perform two types of analyses; if we set the distance option as **TRUE**, we can plot to show how the samples could cluster based on the protein domains.

```{r}
plot_heatmap(important_PFAMs, y_axis=PFAM, analysis = "INTERPRO", distance = T)
```

If we set that to **FALSE**, we observed the presence and absence of the domains across the genome samples.

```{r}
plot_heatmap(important_PFAMs, y_axis=PFAM, analysis = "INTERPRO", distance = F)
```


```{r}
plot_heatmap(important_PFAMs, y_axis=PFAM, analysis = "INTERPRO", distance = F)
```

We can also visualize using a bubble plot. 

```{r, message=FALSE, warning=FALSE}
plot_bubble(important_PFAMs, 
            y_axis=PFAM, 
            x_axis=Bin_name, 
            calc = "Binary",
            analysis = "INTERPRO", 
            data_experiment = metadata, 
            color_character = Clades)
```

# Example with INTERPRO database

First, I will read the InterProScan output in a wide format and extract the PFAM abundance information.

```{r, eval=FALSE}
interpro_INTERPRO_profile<-read_interpro(data_interpro = "Interpro_test.tsv", database="INTERPRO", profile = F)
```


```{r, eval=FALSE}
head(interpro_INTERPRO_profile)
```

We are going to look for the InterProScan IDs that conform the `DNA topoisomerase 1`. To do this, we will create a vector of the IDs associated to that enzyme.

```{r}
DNA_topoisomerase_1<-c("IPR013497", "IPR023406", "IPR013824")
```

With the function [get_subset_pathway](https://mirnavazquez.github.io/RbiMs/reference/get_subset_pathway.html) we can create a subset of the INTERPRO table.

```{r}
DNA_tipo_INTERPRO<-get_subset_pathway(interpro_INTERPRO_profile, type_of_interest_feature=INTERPRO,
                   interest_feature=DNA_topoisomerase_1)
```

```{r}
head(DNA_tipo_INTERPRO)
```

We can create a bubble plot to visualize the distribution of these enzymes across the bins.

```{r}
plot_bubble(DNA_tipo_INTERPRO, 
            y_axis=INTERPRO,
            x_axis=Bin_name,
            calc = "Binary",
            analysis = "INTERPRO", 
            data_experiment = metadata, 
            color_character = Sample_site)
```

# Example with KEGG database

First, I will read the InterProScan output in a long format and extract the KEGG information. When you use the `KEGG` option, the profile option is disabled. 

```{r, eval=FALSE}
interpro_KEGG_long<-read_interpro(data_interpro = "Interpro_test.tsv", database="KEGG")
```

```{r}
head(interpro_KEGG_long)
```

## Mapping INTERPRO to KEGG database

We can use the [mapping_ko](https://mirnavazquez.github.io/RbiMs/reference/mapping_ko.html) function here, to get the extended KEGG table.

```{r, eval=FALSE}
interpro_map<-mapping_ko(tibble_interpro = interpro_KEGG_long)
```


```{r}
head(interpro_map)
```

We can plot all the KOs and the Modules to which they belong. An important thing here is that we will set `analysis = "KEGG"` despite this workflow started with the InterProScan output in analysis. 

```{r}
plot_heatmap(tibble_ko=interpro_map,
             data_experiment = metadata,
             y_axis=KO,
             order_y = Module,
             order_x = Sample_site,
             split_y = TRUE,
             analysis = "KEGG",
             calc="Percentage")
```