Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTU filtering #18

Open
mlbendall opened this issue Jul 20, 2016 · 5 comments
Open

OTU filtering #18

mlbendall opened this issue Jul 20, 2016 · 5 comments

Comments

@mlbendall
Copy link
Collaborator

For those not at BU, we had a conversation today about how different analyses have different filtering requirements for the data. For example, you should not filter low-abundance OTUs for alpha diversity calculations, but there are other situations where you might want to filter for analysis or visualization. So we concluded:

  1. The entire raw PathoID reports should be read in and stored
  2. We need a general purpose function for filtering the data. For example, get only the top 10 OTUs, or get all OTUs that account for >1% of the data, or remove OTUs that are only present in one sample.
  3. There will be intermediate layer that performs this filtering. Functions should assume that it is being handed a properly filtered object.

There are other details that need to be sorted out, such as how to track if users upload pre-filtered data, etc.

@ecastron
Copy link
Collaborator

The Santiago team agrees with this 100%!

I'd like to add that while pathoID writes a sorted .tsv file, it's sorted by Final Guess and sometimes you want it sorted by Final Read Numbers.
If we read the full pathoID output without any cutoffs, then in phyloseq you can easily get the top X by issuing something like:

top10 <- names(sort(taxa_sums(physeq), TRUE)[1:10])

Someone may want to define the top X by proportions instead of counts, in which case a transformation is needed:

physeq <- transform_sample_counts(physeq, function(x) x / sum(x) )

Regarding point 3, I think users should be warned to upload unfiltered results only, and let pathoStats decide when it's appropriate to filter.

BTW, @mlosada323 mentioned rarefaction for 16S data. That's also a oneliner in phyloseq:

physeq_rare<-rarefy_even_depth(physeq, sample.size =1000,replace=FALSE, rngseed=T);physeq_rare

Cheers,

Eduardo

PS: The alluvial plot is almost done! @Sanrrone

captura de pantalla 2016-07-20 a las 18 08 51

@mlbendall
Copy link
Collaborator Author

Wow looks nice @Sanrrone!

@mlbendall
Copy link
Collaborator Author

Can you make a remote branch and push up what you have currently? I'd like to look at how you are getting the sample condition.

@Sanrrone
Copy link
Collaborator

Im confusing about how remote branch works, I did make a pull request, is the same?, wherever, you can looks the change in my fork: https://github.com/Sanrrone/PathoStat

@mlbendall
Copy link
Collaborator Author

Oh, didn't know you were working on a fork.

Remote branch is in the same repository, while fork creates a new repository. There is currently debate about when to branch or fork, but it boils down to how closely you are involved with the original project and whether your changes will eventually be incorporated into the original project.

Just make sure to keep your fork in sync with master, and (ideally) merge the upstream master and test your code before making a pull request. Same goes for branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants