scan_ag output #1

TS404 · 2018-05-16T02:50:55Z

The scan_ag and predict_hyp outputs are really nice.

It would also be good if there was an output option with the same colnames as "get_hmm" to simply list the locations of the relevant Prolines.

missuse · 2018-05-17T19:21:23Z

Thank you.

Could you elaborate a bit on this?

Do you mean something like a tidy data frame:

scan_ag - one row per P matched or one row per regex match?
From the perspective of plotting I trust it would be best if scan_ag had an output where each regex match was in one row with columns" id, start, end.

predict_hyp - a trimmed predictionelement where only predicted hydroxyprolines would be kept?
columns: id, location.

I could add an argument tidy = TRUE/ FALSE to both functions which could provide such an output.

TS404 · 2018-05-17T23:29:07Z

Yes, that's the sort of thing. I think that tidy dataframes are the way to go. Something similar to this output:

  agregions  <- scan_ag(sequence = sequences,
                        id = names(sequences),
                        dim = 3,
                        div = 6,
                        type = "extended", simplify = FALSE)$locations
  agregions2 <- matrix(unlist(lapply(agregions,t)), ncol = 2, byrow = TRUE)
  agregcount <- unlist(lapply(agregions, nrow))
  
  agregions3 <- data.frame(y           = rep(1:(length(sequences)),agregcount),
                           id          = rep(names(sequences),agregcount),
                           align_start = agregions2[,1],
                           align_end   = agregions2[,2])

missuse · 2018-05-21T22:05:02Z

scan_ag(sequence = at_nsp$sequence[c(1, 3, 16, 23)],
        id = at_nsp$Transcript.id[c(1, 3, 16, 23)],
        simplify = FALSE,
        tidy = TRUE)[,-1] #to omit the sequence column from showing here

output:


           id location.start location.end        P_pos length AG_aa
1 ATCG00660.1             NA           NA           NA     NA    NA
2 AT2G28410.1             26           41   27, 35, 40     16     8
3 AT2G28410.1             55           70   55, 67, 69     16     6
4 AT2G43620.1             62           76 63, 70, ....     15     8
5 AT2G43620.1            167          185 168, 176....     19     8
6 AT2G30933.1             36           51 37, 42, ....     16     9
7 AT2G30933.1             63           78 64, 68, ....     16     8

P_pos is a list column with P positions in the appropriate regex matches.
length is the length of the matched substring.
AG_aa is the number of amino acids in that were identified in the matched substring.

No information is lost compared to the list output.

What do you think?

TS404 · 2018-05-21T23:56:11Z

Love it!

missuse · 2018-05-25T19:33:57Z

Added tidy output. Currently it is available when simplify = FALSE and tidy = TRUE in function call.

missuse self-assigned this May 18, 2018

missuse closed this as completed May 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scan_ag output #1

scan_ag output #1

TS404 commented May 16, 2018 •

edited

missuse commented May 17, 2018 •

edited

TS404 commented May 17, 2018

missuse commented May 21, 2018

TS404 commented May 21, 2018

missuse commented May 25, 2018

scan_ag output #1

scan_ag output #1

Comments

TS404 commented May 16, 2018 • edited

missuse commented May 17, 2018 • edited

TS404 commented May 17, 2018

missuse commented May 21, 2018

TS404 commented May 21, 2018

missuse commented May 25, 2018

TS404 commented May 16, 2018 •

edited

missuse commented May 17, 2018 •

edited