Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploit gapfilled model for metabolic comparison #219

Open
mgabriell1 opened this issue May 31, 2024 · 5 comments
Open

Exploit gapfilled model for metabolic comparison #219

mgabriell1 opened this issue May 31, 2024 · 5 comments

Comments

@mgabriell1
Copy link

Hi,
First of all thanks for developing and maintaining Gapseq!
From what I've understood the *-Pathways.tbl file summarise the completeness of the MetaCYC pathways found in the genome analyzed. After that gapfilling is performed to (as said in the name ;)) fill the gaps potential arising from incompleteness or missed genes.
If someone would want to look at the diversity among different metabolic profiles (e.g., as done here: 10.1038/s43705-023-00221-z, incompleteness (either observed or hidden) might be skew the results.
Given this, I was thinking that creating another version of *-Pathways.tbl after gapfilling would address the issue as this in theory would take care of incompleteness in the pathways completion percentages.
Do you think this makes sense? If so, what would be the best method to do so?
Thanks again!

@jotech
Copy link
Owner

jotech commented May 31, 2024

Hi @mgabriell1

Thank you for your interest in gapseq!

What you are proposing is indeed an interesting application! My first guess would be to check out the reaction attributes table of the final model after gap-filling. Using that table, you can trace the origin of each reaction (added because of sequence homology, gap filling, etc.), which can then be linked back to the pathway level.
More details can be found here: https://gapseq.readthedocs.io/en/latest/tutorials/traceability.html

Let me know if this can work for you.

@mgabriell1
Copy link
Author

Hi,
Thanks for the quick reply!

That table seems to contain all the info that I need, but I am not completely sure on how to link it to meta metaCYC pathways. Using the ecoli.RDS test model I see in the reaction attribute table there are 3029 reactions which are distributed in 560 pathways, while in the all-Pathways.tbl file there are 2900 pathways. So in the model a lot are excluded (makes sense), but I would also like to know which pathways are not complete.

Looking at the source file I found the file meta_rea_pwy-gapseq.tbl which seems to contain the mapping between pathways and reactions which could be used to estimate the pathway completeness, but I'm having troubles linking that file with the reactions attribute table.
For example, \|PWY-7184\| (a random pwy) has a completeness of 88% in all-Pathways.tbl and the rxn DCDPKIN-RXN is among the ones found. However, reaction attribute table this rxn ID is not present in the column rxn, but rather in the column biocyc ID. So if I would use the column rxn I would not detect this. Conversely, column biocyc ID is often empty and so I cannot use that column as well. Other rxn ids are present in the column gapseq in meta_rea_pwy-gapseq.tbl, but often these are empty and so again not useful to map the two tables.
Given that the information seems to be linked across different columns, I'm not sure if I'm using the correct information. Could you confirm that this seems to be the right track?
If so, I guess I can check the different columns and then check whether any of them has a match.

Another question that I have is: gapseq estimates pwy completeness just by dividing the number of genes present over the total number of genes? I'm asking this to understand whether I wonder to consider "parallel" pwy routes or that has already be taken into account in metaCYC.

I hope this was clear enough.
Thanks again for the support!

@Waschina
Copy link
Collaborator

Waschina commented Jun 24, 2024

Hi @mgabriell1

I would not use the meta_rea_pwy-gapseq.tbl. Instead, the problem could be addressed for example like this:

library(sybil)
library(data.table)

# read model data
mod_filled <- readRDS("yogurt/ldel.RDS")
mod_draft  <- readRDS("yogurt/ldel-draft.RDS")
gs_findR   <- fread("yogurt/ldel-all-Reactions.tbl")
gs_findP   <- fread("yogurt/ldel-all-Pathways.tbl")

# read and prepare pathway DB
pwyDB <- rbind(fread("~/Software/gapseq/dat/meta_pwy.tbl"),
               fread("~/Software/gapseq/dat/custom_pwy.tbl"))
pwyDB <- pwyDB[!duplicated(id)]
pwyDB <- pwyDB[, .(id, name, spont, reaId)]
pwyDB <- pwyDB[id %in% gs_findP$ID]
pwyDB[, reaNr := length(unlist(strsplit(reaId, ","))), by = id]
pwyDB[, spontNr := length(unlist(strsplit(spont, ","))), by = id]

# identify which reactions were added during gapfilling
rxnGF <- mod_filled@react_id[!(mod_filled@react_id %in% mod_draft@react_id)]
rxnGF <- rxnGF[!grepl("^EX_",rxnGF)] # exclude exchange reactions
rxnGF <- gsub("_c0$","",rxnGF)

# get BioCyc-IDs of added reactions
newBC <- lapply(rxnGF, function(x) gs_findR[grepl(x, dbhit), rxn])
newBC <- unique(unlist(newBC))

# get Pathways, in which die newBC participate
newBC_pwys <- lapply(newBC, function(x) gs_findR[rxn == x, unique(pathway)])
names(newBC_pwys) <- newBC

# add new reaction to pathway table
gs_findP$newReactionsFound <- ""
for(rxni in newBC) {
  for(pwyi in newBC_pwys[[rxni]]) {
    gs_findP[ID == pwyi, newReactionsFound := paste0(newReactionsFound,rxni, sep = " ")]
  }
}
gs_findP[, newReactionsFound := gsub("^ | $","",newReactionsFound)] # remove trailing spaces

# merge and recalc completeness
gs_findP <- merge(gs_findP, pwyDB, by.x = "ID", by.y = "id")
gs_findP[,Nold := length(unlist(strsplit(ReactionsFound, " "))), by = "ID"]
gs_findP[,Nnew := length(unlist(strsplit(newReactionsFound, " "))), by = "ID"]

gs_findP[,C_old := (Nold + VagueReactions)/(reaNr - spontNr)*100]
gs_findP[,C_new := (Nold + Nnew + VagueReactions)/(reaNr - spontNr)*100]

I hope the code comments make it clear what happens at each step. The examples use gapseq data from here: https://github.com/Waschina/gapseq.tutorial.data/tree/master/yogurt

I noticed that there seem to be some inconsistencies in the way the pathway completeness is calculated in the ...-all-Pathways.tbl. I'll investigate this.

@Waschina
Copy link
Collaborator

Small addition: With the example above, I noticed that there is a small inconsistency in how the completeness reported in the ...-all-pathways.tbl is calculated. See issue #216 . We are working on a fix

@mgabriell1
Copy link
Author

Hi Silvio,
Thanks a lot for the code!
I've tried it with the tutorial file and one of my genomes (attached testData.zip) and I've noticed that a few pathways have a new completeness estimation above 100%. I found a couple of reasons while that was the case:

  • The spontaneous reactions were included both in the number of newly added reactions by gapfilling and while "discounting" them for pathway completeness estimation (as you wrote here: Possible inconsistencies in pathway completeness percentages. #216). My solution was to remove them from the count of newly added reactions during completeness estimation
  • The other issue was that in a few cases the column newReactionsFound included some of the ones present in ReactionsFound. I haven't understood completely why (maybe some mismatch caused by trailing whitespaces?), but I've implemented a loop to remove these duplicated reactions
  • Finally, I've noticed that some reactions attributed in mod_filled@react_attr and thus in ReactionsFound were not included in pwyDB (e..g., "ACETOACETATE-DECARBOXYLASE-RXN" is found in "|PWY-5451|" even though this is not found here (https://metacyc.org/pathway?orgid=META&id=PWY-5451) or in pwyDB). I've opted to remove those from ReactionsFound, but this might indicate that there are other issues in other parts of the pipeline

With these edits I've managed to reduce the number of pathways more than 100% complete, but in few cases this still occurred. All these pathways present vague reactions and their number (derived from -all-Pathways.tbl) is higher than the number of reactions in pwyDB (e.g., "|PWY-7695|" is indicated with 12 vague reactions, but only 7 reactions in pwyDB). Could it be an issue with the reactions database?

Here is the edited code (not the prettiest, but it seems to do the job):



###
library(sybil)
library(data.table)

# read model data
# mod_filled <- readRDS("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel.RDS")
# mod_draft  <- readRDS("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel-draft.RDS")
# gs_findR   <- fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel-all-Reactions.tbl")
# gs_findP   <- fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/gapseq.tutorial.data-master/yogurt/ldel-all-Pathways.tbl")

mod_filled <- readRDS("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_models/Legio00005_manualCheck-contigs.RDS")
mod_draft  <- readRDS("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_models/model_drafts/Legio00005_manualCheck-contigs-draft.RDS")
gs_findR   <- fread("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_reactions/Legio00005_manualCheck-contigs-all-Reactions.tbl.gz")
gs_findP   <- fread("/Users/gabriema/PostdocUMIK/Activities/Legionome/Data/GSMM/gapseq_pathways/Legio00005_manualCheck-contigs-all-Pathways.tbl")

# read and prepare pathway DB
pwyDB <- rbind(fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/dat/meta_pwy.tbl"),
               fread("/Users/gabriema/PostdocUMIK/Scripts/gapseq/dat/custom_pwy.tbl"))
pwyDB <- pwyDB[!duplicated(id)]
pwyDB <- pwyDB[, .(id, name, spont, reaId)]
pwyDB <- pwyDB[id %in% gs_findP$ID]
pwyDB[, reaNr := length(unlist(strsplit(reaId, ","))), by = id]
pwyDB[, spontNr := length(unlist(strsplit(spont, ","))), by = id]

# identify which reactions were added during gapfilling
rxnGF <- mod_filled@react_id[!(mod_filled@react_id %in% mod_draft@react_id)]
rxnGF <- rxnGF[!grepl("^EX_",rxnGF)] # exclude exchange reactions
rxnGF <- gsub("_c0$","",rxnGF)

# get BioCyc-IDs of added reactions
newBC <- lapply(rxnGF, function(x) gs_findR[grepl(x, dbhit), rxn])
newBC <- unique(unlist(newBC))

# get Pathways, in which die newBC participate
newBC_pwys <- lapply(newBC, function(x) gs_findR[rxn == x, unique(pathway)])
names(newBC_pwys) <- newBC

# add new reaction to pathway table
gs_findP$newReactionsFound <- ""
for(rxni in newBC) {
  for(pwyi in newBC_pwys[[rxni]]) {
    gs_findP[ID == pwyi, newReactionsFound := paste0(newReactionsFound,rxni, sep = " ")]
  }
}
gs_findP[, newReactionsFound := gsub("^ | $","",newReactionsFound)] # remove trailing spaces

# # Remove duplicated entries and ones not present in pwyDB
for (i in 1:nrow(gs_findP)){
  #print(c(i, gs_findP[i, "newReactionsFound"]))

  newReacts_tmp <- unlist(strsplit(as.character(gs_findP[i, "newReactionsFound"]), " "))
  oldReacts_tmp <- unlist(strsplit(as.character(gs_findP[i, "ReactionsFound"]), " "))
  gs_findP[i, "newReactionsFound"] <- paste0(newReacts_tmp[!(newReacts_tmp %in% oldReacts_tmp)], collapse  = " ")
  
  allReacts_tmp <- unlist(strsplit(pwyDB[ id == as.character(gs_findP[i, ID]), reaId], ","))
  gs_findP[i, "ReactionsFound"] <- paste0(oldReacts_tmp[(oldReacts_tmp %in% allReacts_tmp)], collapse  = " ")
}

# merge and recalc completeness
gs_findP <- merge(gs_findP, pwyDB, by.x = "ID", by.y = "id")
gs_findP[,Nold := length(unlist(strsplit(ReactionsFound, " "))), by = "ID"]
gs_findP[,Nnew := length(unlist(strsplit(newReactionsFound, " "))), by = "ID"]
gs_findP[,Nnew_spont := length(Reduce(intersect, list(unlist(strsplit(newReactionsFound, " ")), unlist(strsplit(spont, ","))))), by = "ID"]

gs_findP[,C_old := (Nold + VagueReactions)/(reaNr - spontNr)*100]
gs_findP[,C_new := (Nold + Nnew + VagueReactions)/(reaNr - spontNr)*100]

gs_findP[,diff_C := C_new - C_old]
gs_findP[,C_new_corr := (Nold + Nnew - Nnew_spont + VagueReactions)/(reaNr - spontNr)*100] # Remove new spontaneous reactions


Thanks again for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants