-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exploit gapfilled model for metabolic comparison #219
Comments
Hi @mgabriell1 Thank you for your interest in gapseq! What you are proposing is indeed an interesting application! My first guess would be to check out the reaction attributes table of the final model after gap-filling. Using that table, you can trace the origin of each reaction (added because of sequence homology, gap filling, etc.), which can then be linked back to the pathway level. Let me know if this can work for you. |
Hi, That table seems to contain all the info that I need, but I am not completely sure on how to link it to meta metaCYC pathways. Using the Looking at the source file I found the file Another question that I have is: gapseq estimates pwy completeness just by dividing the number of genes present over the total number of genes? I'm asking this to understand whether I wonder to consider "parallel" pwy routes or that has already be taken into account in metaCYC. I hope this was clear enough. |
Hi @mgabriell1 I would not use the library(sybil)
library(data.table)
# read model data
mod_filled <- readRDS("yogurt/ldel.RDS")
mod_draft <- readRDS("yogurt/ldel-draft.RDS")
gs_findR <- fread("yogurt/ldel-all-Reactions.tbl")
gs_findP <- fread("yogurt/ldel-all-Pathways.tbl")
# read and prepare pathway DB
pwyDB <- rbind(fread("~/Software/gapseq/dat/meta_pwy.tbl"),
fread("~/Software/gapseq/dat/custom_pwy.tbl"))
pwyDB <- pwyDB[!duplicated(id)]
pwyDB <- pwyDB[, .(id, name, spont, reaId)]
pwyDB <- pwyDB[id %in% gs_findP$ID]
pwyDB[, reaNr := length(unlist(strsplit(reaId, ","))), by = id]
pwyDB[, spontNr := length(unlist(strsplit(spont, ","))), by = id]
# identify which reactions were added during gapfilling
rxnGF <- mod_filled@react_id[!(mod_filled@react_id %in% mod_draft@react_id)]
rxnGF <- rxnGF[!grepl("^EX_",rxnGF)] # exclude exchange reactions
rxnGF <- gsub("_c0$","",rxnGF)
# get BioCyc-IDs of added reactions
newBC <- lapply(rxnGF, function(x) gs_findR[grepl(x, dbhit), rxn])
newBC <- unique(unlist(newBC))
# get Pathways, in which die newBC participate
newBC_pwys <- lapply(newBC, function(x) gs_findR[rxn == x, unique(pathway)])
names(newBC_pwys) <- newBC
# add new reaction to pathway table
gs_findP$newReactionsFound <- ""
for(rxni in newBC) {
for(pwyi in newBC_pwys[[rxni]]) {
gs_findP[ID == pwyi, newReactionsFound := paste0(newReactionsFound,rxni, sep = " ")]
}
}
gs_findP[, newReactionsFound := gsub("^ | $","",newReactionsFound)] # remove trailing spaces
# merge and recalc completeness
gs_findP <- merge(gs_findP, pwyDB, by.x = "ID", by.y = "id")
gs_findP[,Nold := length(unlist(strsplit(ReactionsFound, " "))), by = "ID"]
gs_findP[,Nnew := length(unlist(strsplit(newReactionsFound, " "))), by = "ID"]
gs_findP[,C_old := (Nold + VagueReactions)/(reaNr - spontNr)*100]
gs_findP[,C_new := (Nold + Nnew + VagueReactions)/(reaNr - spontNr)*100] I hope the code comments make it clear what happens at each step. The examples use gapseq data from here: https://github.com/Waschina/gapseq.tutorial.data/tree/master/yogurt I noticed that there seem to be some inconsistencies in the way the pathway completeness is calculated in the |
Small addition: With the example above, I noticed that there is a small inconsistency in how the completeness reported in the |
Hi Silvio,
With these edits I've managed to reduce the number of pathways more than 100% complete, but in few cases this still occurred. All these pathways present vague reactions and their number (derived from Here is the edited code (not the prettiest, but it seems to do the job):
Thanks again for the help! |
Hi,
First of all thanks for developing and maintaining Gapseq!
From what I've understood the *-Pathways.tbl file summarise the completeness of the MetaCYC pathways found in the genome analyzed. After that gapfilling is performed to (as said in the name ;)) fill the gaps potential arising from incompleteness or missed genes.
If someone would want to look at the diversity among different metabolic profiles (e.g., as done here: 10.1038/s43705-023-00221-z, incompleteness (either observed or hidden) might be skew the results.
Given this, I was thinking that creating another version of *-Pathways.tbl after gapfilling would address the issue as this in theory would take care of incompleteness in the pathways completion percentages.
Do you think this makes sense? If so, what would be the best method to do so?
Thanks again!
The text was updated successfully, but these errors were encountered: