Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scan_ag output #1

Closed
TS404 opened this issue May 16, 2018 · 5 comments
Closed

scan_ag output #1

TS404 opened this issue May 16, 2018 · 5 comments
Assignees

Comments

@TS404
Copy link

TS404 commented May 16, 2018

The scan_ag and predict_hyp outputs are really nice.

It would also be good if there was an output option with the same colnames as "get_hmm" to simply list the locations of the relevant Prolines.

@missuse
Copy link
Owner

missuse commented May 17, 2018

Thank you.

Could you elaborate a bit on this?

Do you mean something like a tidy data frame:

scan_ag - one row per P matched or one row per regex match?
From the perspective of plotting I trust it would be best if scan_ag had an output where each regex match was in one row with columns" id, start, end.

predict_hyp - a trimmed predictionelement where only predicted hydroxyprolines would be kept?
columns: id, location.

I could add an argument tidy = TRUE/ FALSE to both functions which could provide such an output.

@TS404
Copy link
Author

TS404 commented May 17, 2018

Yes, that's the sort of thing. I think that tidy dataframes are the way to go. Something similar to this output:

  agregions  <- scan_ag(sequence = sequences,
                        id = names(sequences),
                        dim = 3,
                        div = 6,
                        type = "extended", simplify = FALSE)$locations
  agregions2 <- matrix(unlist(lapply(agregions,t)), ncol = 2, byrow = TRUE)
  agregcount <- unlist(lapply(agregions, nrow))
  
  agregions3 <- data.frame(y           = rep(1:(length(sequences)),agregcount),
                           id          = rep(names(sequences),agregcount),
                           align_start = agregions2[,1],
                           align_end   = agregions2[,2])

@missuse missuse self-assigned this May 18, 2018
@missuse
Copy link
Owner

missuse commented May 21, 2018

scan_ag(sequence = at_nsp$sequence[c(1, 3, 16, 23)],
        id = at_nsp$Transcript.id[c(1, 3, 16, 23)],
        simplify = FALSE,
        tidy = TRUE)[,-1] #to omit the sequence column from showing here

output:


           id location.start location.end        P_pos length AG_aa
1 ATCG00660.1             NA           NA           NA     NA    NA
2 AT2G28410.1             26           41   27, 35, 40     16     8
3 AT2G28410.1             55           70   55, 67, 69     16     6
4 AT2G43620.1             62           76 63, 70, ....     15     8
5 AT2G43620.1            167          185 168, 176....     19     8
6 AT2G30933.1             36           51 37, 42, ....     16     9
7 AT2G30933.1             63           78 64, 68, ....     16     8

P_pos is a list column with P positions in the appropriate regex matches.
length is the length of the matched substring.
AG_aa is the number of amino acids in that were identified in the matched substring.

No information is lost compared to the list output.

What do you think?

@TS404
Copy link
Author

TS404 commented May 21, 2018

Love it!

@missuse
Copy link
Owner

missuse commented May 25, 2018

Added tidy output. Currently it is available when simplify = FALSE and tidy = TRUE in function call.

@missuse missuse closed this as completed May 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants