-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Visualization overhaul (2 of 2) #164
Conversation
b3d8987
to
e65e86c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the direction this is going and I think this is a useful reorganization you have here. There are some big changes with some new parts though so more docstrings, even if they're on Mixins or components that aren't otherwise exposed, would be really helpful to understand the overall architecture. Also, I have some comments/concerns about how we're bumping some of our dependencies.
@@ -2,31 +2,31 @@ | |||
boto3>=1.4.2 | |||
click>=6.6 | |||
jsonschema>=2.4 | |||
python-dateutil>=2.5.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly reluctant to pin with a >=
; we should probably pin to specific minor-release versions of packages, e.g. using ~=, to prevent unexpected downstream breakages from taking out this library.
Here are some comments/suggestions: Separate normalization steps from statsI feel pretty strongly that preprocessing/normalization steps should remain separate from statistics steps. This would read nicer and cut down on repetition. The user only has to transform their data once, and then pass the transformed object to multiple stats functions. # perform normalization
processed_analysis = normalize(agglomerate(analysis_object, rank='Phylum'))
# do some stats
alpha = alpha_diversity(processed_analysis)
beta = beta_diversity(processed_analysis) API DesignSince the entire API is changing I think it's worthwhile to evaluate a few possible API designs Proposal A: method chaining. This would be pretty familiar to anyone used to phyloseq (or d3). I also think it reads well and is easy to type: experiment.agglomerate('Phylum').normalize().beta_diversity() Proposal B: IDK what this is called but it's familiar to scikit-learn which is an API for processing DataFrames and doing stats/machine learning: transformer = ReadCountNormalizer(method="proportion")
summary_statistic = ShannonDiversityIndex()
stats = summary_statistic(transformer.transform(results_data_frame)) This API design can be made a little bit more readable using something like # straight up ripoff of scikit-learn
my_analysis = Pipeline([
('transform', ReadCountNormalizer()),
('top5', SelectTopTaxa(n=5)),
('alpha_diversity', ShannonDiversityIndex())
])
my_analysis.transform(results_data_frame) This design also lets you store artifacts from the transformations / stats functions (e.g., It would even be possible to have multiple APIs. Again these are just suggestions but I think it's worthwhile. Also, it's possible to implement these APIs on top of what you've built already (and have multiple ones at the same time). Better naming"ResultsDataFrame" is kind of ambiguous? I'm bad at naming things though. I am thinking that there are going to be multiple things named "results" in users' notebooks and it might get confusing. The Guessing normalizationEven though the odds of a false positive are probably nil, a good alternative would be to store a bool on Lines 107 to 111 in 4e7a769
|
(playing around with the lib now. I'll give you some thoughts on my first impressions) onecodex/onecodex/models/__init__.py Line 238 in 4e7a769
I think I think it would be better to drop the internal caching mechanisms and just use something external and explicit like https://github.com/reclosedev/requests-cache which automatically caches all HTTP requests and stores them in an sqlite3 database. As a user, I am unaware of when things are being pulled from the API or from the cache. This way, I am in control of the cache. We could make it simpler for the user (like |
IIRC |
|
Updated |
0b9d912
to
82706a6
Compare
* New object, SampleCollection, which is a list of samples/analyses/classifications * The base object contains functions to collate multiple results into taxonomy, metadata, and results dataframes * Additional functionality is added through mixins which provide visualization methods, taxonomy manipulation, and distance calculations * Update pandas version * New subclassed pandas DataFrame, ResultsDataFrame, which holds taxonomy/metadata/results dfs * Switch to skbio's to_data_frame() method
* VizDistanceMixin, rather than SampleCollection, now subclasses DistanceMixin * Added ability to output altair chart when calling plot_heatmap * Renamed skbio tree methods, removing skbio_ from names * Added tree_prune_tax_ids method which keeps taxa and their parents * OneCodexAccessor object now subsets _taxonomy and _metadata based on contents of ResultsDataFrame * Auto rank finding must be aware of class type
* SampleAnalysis pulls in methods for analyzing results tables form mixins * SampleCollection inherits these methods from SampleAnalysis * Methods from SampleAnalysis are made accessible in the `ocx` accessor namespace via a pandas extension
* Use 'auto' rank everywhere * Track rank used to subset data in ResultsDataFrame, but only before manipulations are done on it * Use rank tracked in ResultsDataFrame where applicable * Guess if data has already been normalized in diversity methods, and if so, raise * Fix passing of metadata between ResultsDataFrame and ResultsSeries * Use automatic ranks when calling unifrac * Only do rangeStep in boxplot, not others * Update docstrings to be more like numpy
* Relocate distance function calls for dryness * Let users pass their own distance function * Allow users to choose from pcoa or smacof MDS algos * Renamed plot_pcoa to plot_mds * Suppress skbio warning about negative eigenvalues * Add ability to return altair chart object to all
* Renamed axes in PCA plots to PC1/PC2 * Fix package name specification in requirements.txt * Fix bug in plot_metadata when haxis is numeric * Don't bother searching taxa if query is too short
* Uses euclidean distances and average-linkage clustering, for now
* If inside notebook service nbconvert, render high-res PNGs * If inside ipython, use the notebook renderer
* SampleCollection subclasses ResourceList and contains classifications/analyses/samples. * In ResourceList, rename oc_model to _oc_model since it really is private * In ResourceList, call _update() whenever changes are made to underlying resource * Rename helpers.SampleCollection to SampleCollectionAnalyses to reflect that it contains analysis methods * Stop importing viz and distance into Api instances * Add functionality to SampleCollection from viz, distance, and taxonomy if dependencies are available
* Enable __add__ and copy on SampleCollection * Disallow duplicates in SampleCollection * Stop calling append inside extend in ResourceList * DRY up code and fix checking for resource validity * Pass __repr__ and __len__ as properties in ResourceList * Relocate functools import in subclassed pandas df/series * SampleCollectionAnalyses is now EnhancedSampleCollection * SampleAnalysis is now AnalysisMethods * Reordered methods in ResourceList * Remove unused ModuleAlias import * Bypass loading of EnhancedSampleCollection with environment variable * Fixed regression with string coercion in magic_metadata_fetch * Fixed bug in way number of taxa determined * Stop checking for invalid taxonomy, fixed in mainline
* Enable testing of EnhancedSampleCollection in conftest * Added new route for a where query on Samples * Added tests for SampleCollection * Change CircleCI tests to Python 3.6 * Fixes for Python 2.7 compat * Fix failing coverage test
* Moved to_otu from Classifications instance to SampleCollection * Moved magic_classifications_fetch from EnhancedSampleCollection to SampleCollection * Changed test_where_clauses test to pass with additional requests being generated
* Remove 'magic_' from names and make methods private * Make collate_metadata and collate_results private * Only collate data if not cached and remove from cache if update needed * Use @Property for taxonomy, metadata, field, and _results * Disallow dupes in ResourceList, not SampleCollection * Move pandas-requiring methods into SampleCollection and raise on ImportError * Remove EnhancedSampleCollection * Change _classifications to new @Property primary_classifications * Get tests to pass
* Make them aware of pandas requirement and skip/fail nicely * Remove ONE_CODEX_NO_ENHANCED var and update tests * Remove test_api.py which is superseded by newer tests * Remove test_minimal.py as it doesn't make sense anymore * Updated dependency versions to match * Updated CircleCI and Tox * Lint
* Fixed a bug where empty lists arent returned as ResourceList * Fixed spelling error * Stop collate_metadata failing on empty df
* Rename AnalysisMethods to AnalysisMixin * Rename ResultsDataFrame to ClassificationsDataFrame * Stop calling _update() in ResourceList on read * Remove comparison NotImplementedErrors and associated test * Codacy suggested changes * Small tweaks and get tests passing
013311c
to
1e934f6
Compare
9fc3c3a
to
741caff
Compare
* Additional docstrings * Enforce oc_model subclasses OneCodexBase * Update CircleCI * Fix non-deterministic test failure due to responses
741caff
to
8c99d6d
Compare
Minor @polyatail – but this would probably have been a good PR to either rebase or squash before merging into master (commit history starts fairly clean, but also includes a bunch of "More changes..." style commits). |
The PR is a substantial overhaul to how
Samples
and their associatedClassifications
are analyzed in a notebook environment.The following issues were addressed:
plot_mds
.plot_heatmap
, the samples and taxa are ordered according to euclidean distance.plot_heatmap
or inplot_distance
, the user can now select the clustering algorithm, e.g., single- or average-linkage.Samples
orClassifications
are now returned wrapped in a newSampleCollection
object which exposes lots of additional analyses methods that operate on the objects contained within.ResourceList
objects were not being saved.New functionality has been added as part of the
SampleCollection
object, some examples of which are outlined below.Retrieving a project and generating plots
All plotting functions are now available as methods of the new
SampleCollection
object.Obtaining a DataFrame OTU table
We introduce a new subclassed
ResultsDataFrame
which is returned when the user requests an OTU table from aSampleCollection
. This object transfers taxonomy and metadata along with it, which is automatically subset to only ever contain the taxonomy and metadata of taxa present in theResultsDataFrame
and metadata belonging toSamples
present.Metadata and taxonomy can be accessed two ways:
The values in the
ResultsDataFrame
can be manipulated as usual. Analysis methods which are available toSampleCollection
are also available in theResultsDataFrame.ocx
namespace viapd.api.extensions
.Manipulating contents of SampleCollection
The
Samples
in aSampleCollection
can be manipulated as if they were a list. A user might notice that one sample is "bad" and want to exclude it from their analyses. Or they might want to combine multiple projects together, or analyze their own project in the context of a larger public dataset.