Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient way to compare 2 sets of multiple sequences #34

Closed
seb-mueller opened this issue Sep 30, 2023 · 4 comments
Closed

Efficient way to compare 2 sets of multiple sequences #34

seb-mueller opened this issue Sep 30, 2023 · 4 comments
Assignees
Labels

Comments

@seb-mueller
Copy link
Contributor

Really nice package, I've one question though.

Is there are parallel version of to compare to sets of sequences?
I know there is twoSeqSim to compare 2 sequences and parSeqSim which takes one set of sequences and compares them pairwise. However if I have a 2 sets, say set A with seq1 and seq2, as well as set B with seq3 and seq4. I want to compare each member of set A with each member of set B, but don't want to compare within the 2 sets.
Is there an efficient (also using parallel execution) way to do so without having to loop using twoSeqSim?

@nanxstats
Copy link
Owner

@seb-mueller Great question. You can probably reuse protr:::.seqPairSim() which is the backbone of parSeqSim(). This is a function designed to compute the similarity between two protein sequences. It takes in the indices of two sequences, the list of sequences, and parameters for alignment. The output is a similarity score.

Here is a simple code sketch - I haven't validated its correctness but you can easily verify if it works as expected:

crossSetSim <- function(
    set1, set2,
    type = "local",
    submat = "BLOSUM62",
    gap.opening = 10,
    gap.extension = 4) {
  combinations <- expand.grid(seq_along(set1), seq_along(set2))

  results <- foreach(
    i = seq_len(nrow(combinations)),
    .combine = c,
    .packages = c("Biostrings")
  ) %dopar% {
    idx1 <- combinations[i, 1]
    idx2 <- combinations[i, 2]

    protr:::.seqPairSim(
      c(idx1, idx2),
      c(set1, set2),
      type,
      submat,
      gap.opening,
      gap.extension
    )
  }

  matrix(
    results,
    nrow = length(set1),
    ncol = length(set2),
    byrow = TRUE
  )
}

Example:

library(protr)
library(foreach)
library(doParallel)

doParallel::registerDoParallel(2)

s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]
s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]
s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]
s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]
s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]

set.seed(42)
plist1 <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 10, replace = TRUE)])
plist2 <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 20, replace = TRUE)])

psimmat <- crossSetSim(plist1, plist2)

@nanxstats nanxstats self-assigned this Oct 2, 2023
@seb-mueller
Copy link
Contributor Author

Hi @nanxstats , thanks so much for you quick response and the great suggestion.
I've tested your solution, and whilst it did return numbers, there was something off. After some debugging, I suppose there was only one minor mistake:
Instead of
c(idx1, idx2),
it needs to be:
c(idx1, idx2 + length(set1)),

The index of set2 needs to be offset by the length of set1 to make it work in the combined vector. I'll have to run some more test, but it should be ok like this.
Maybe this can be even added into the package since I believe this is a common usecase? Happy to create a PR if so.

@nanxstats
Copy link
Owner

@seb-mueller Perfect. I agree that this will be a very useful feature. Please send a PR when you are ready. (Also, feel free to add yourself as ctb to DESCRIPTION.)

nanxstats added a commit that referenced this issue Oct 30, 2023
implemented pairwise protlist function (fixes #34)
@nanxstats
Copy link
Owner

This feature has been shipped with protr 1.7-0, now on CRAN: https://cran.r-project.org/package=protr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants