Visualization overhaul (2 of 2) #164

polyatail · 2019-01-08T01:06:21Z

The PR is a substantial overhaul to how Samples and their associated Classifications are analyzed in a notebook environment.

The following issues were addressed:

Add PCoA as a plotting option #74 Scatter plots of distances between samples (calculated using a distance metric, e.g., unifrac) can now be visualized with principal component analysis, via two algorithms: deterministic pcoa, or smacof, an iterative optimization. This is exposed to the user via plot_mds.
Make resource creation or lookup easier #130 Enabled easier lookup of Tags resources only, rather than a change affecting all models.
Explore static image rendering options for notebooks #145 To facilitate rendering of report PDFs (via HTML and weasyprint), setting an environment variable will enable the onecodex renderer in altair, which produces high-res PNGs.
Add clustering to sample vs. abundance heatmaps #149 In plot_heatmap, the samples and taxa are ordered according to euclidean distance.
Allow specification of clustering algorithm in heatmaps, distance plots #150 When clusters are generated, in plot_heatmap or in plot_distance, the user can now select the clustering algorithm, e.g., single- or average-linkage.
Adopt new docstring conventions and update docstrings #152 New methods and most older methods have updated Numpy-style docstrings now.
Slices of ResourceList objects should be ResourceList objects #158 Fixed bug where ResourceList slices were returned as lists, when they should have been ResourceLists.
Return lists of Classifications, Analyses, and Samples as SampleCollection #159 Lists of Samples or Classifications are now returned wrapped in a new SampleCollection object which exposes lots of additional analyses methods that operate on the objects contained within.
__setattr__ should be aware of ResourceList types #160 Fixed bug where properties of a model which were ResourceList objects were not being saved.

New functionality has been added as part of the SampleCollection object, some examples of which are outlined below.

Retrieving a project and generating plots
All plotting functions are now available as methods of the new SampleCollection object.

>>> project1 = ocx.Projects.get('d53ad03b010542e3')
>>> samples1 = ocx.Samples.where(project=project1.id, public=True, limit=10)
>>> type(samples1)
onecodex.models.SampleCollection
>>> samples1.plot_pca(color='Bacteroides')

Obtaining a DataFrame OTU table
We introduce a new subclassed ResultsDataFrame which is returned when the user requests an OTU table from a SampleCollection. This object transfers taxonomy and metadata along with it, which is automatically subset to only ever contain the taxonomy and metadata of taxa present in the ResultsDataFrame and metadata belonging to Samples present.

results = samples1.results(rank='phylum', normalize=False)

# take the fourth root and plot a heatmap
(results ** 0.25).ocx.plot_heatmap()

# restrict to taxa with 100 reads or more and run PCA
results.loc[:, results.sum() > 100].ocx.plot_pca(color='Bacteroidetes')

# do PCoA with weighted Unifrac
results.ocx.plot_mds(method='pcoa', metric='weighted_unifrac', color='Bacteroidetes')

Metadata and taxonomy can be accessed two ways:

samples1.metadata == samples1.results().ocx.metadata
samples1.taxonomy == samples1.results().ocx.taxonomy

The values in the ResultsDataFrame can be manipulated as usual. Analysis methods which are available to SampleCollection are also available in the ResultsDataFrame.ocx namespace via pd.api.extensions.

Manipulating contents of SampleCollection
The Samples in a SampleCollection can be manipulated as if they were a list. A user might notice that one sample is "bad" and want to exclude it from their analyses. Or they might want to combine multiple projects together, or analyze their own project in the context of a larger public dataset.

project2 = ocx.Projects.get('be33784611fe4c19')
samples2 = ocx.Samples.where(project=project2.id, public=True, limit=10)

# analyze these two projects together
combined = samples1 + samples2

# exclude this sample
del combined[4]

combined.plot_pca(size='Firmicutes')

# get a combined OTU table and write to a CSV file
combined.results(rank='genus', normalize=True).to_csv('otu_table.csv')

# and in long format
combined.results(table_format='long').to_csv('otu_table.long.csv')

coveralls · 2019-01-09T01:10:54Z

Coverage decreased (-0.6%) to 84.078% when pulling 8c99d6d on roo-viz-part2 into 6473770 on master.

bovee

I really like the direction this is going and I think this is a useful reorganization you have here. There are some big changes with some new parts though so more docstrings, even if they're on Mixins or components that aren't otherwise exposed, would be really helpful to understand the overall architecture. Also, I have some comments/concerns about how we're bumping some of our dependencies.

onecodex/helpers.py

onecodex/distance.py

onecodex/helpers.py

onecodex/taxonomy.py

onecodex/viz/_distance.py

bovee · 2019-01-09T18:28:45Z

requirements.txt

@@ -2,31 +2,31 @@
 boto3>=1.4.2
 click>=6.6
 jsonschema>=2.4
+python-dateutil>=2.5.3


I'm fairly reluctant to pin with a >=; we should probably pin to specific minor-release versions of packages, e.g. using ~=, to prevent unexpected downstream breakages from taking out this library.

tox.setup.ini

onecodex/helpers.py

audy · 2019-01-09T21:55:16Z

Here are some comments/suggestions:

Separate normalization steps from stats

I feel pretty strongly that preprocessing/normalization steps should remain separate from statistics steps. This would read nicer and cut down on repetition. The user only has to transform their data once, and then pass the transformed object to multiple stats functions.

# perform normalization

processed_analysis = normalize(agglomerate(analysis_object, rank='Phylum'))

# do some stats
alpha = alpha_diversity(processed_analysis)

beta = beta_diversity(processed_analysis)

API Design

Since the entire API is changing I think it's worthwhile to evaluate a few possible API designs

Proposal A: method chaining. This would be pretty familiar to anyone used to phyloseq (or d3). I also think it reads well and is easy to type:

experiment.agglomerate('Phylum').normalize().beta_diversity()

Proposal B: IDK what this is called but it's familiar to scikit-learn which is an API for processing DataFrames and doing stats/machine learning:

transformer = ReadCountNormalizer(method="proportion")

summary_statistic = ShannonDiversityIndex()

stats = summary_statistic(transformer.transform(results_data_frame))

This API design can be made a little bit more readable using something like Pipeline

# straight up ripoff of scikit-learn
my_analysis = Pipeline([
  ('transform', ReadCountNormalizer()),
  ('top5', SelectTopTaxa(n=5)),
  ('alpha_diversity', ShannonDiversityIndex())
])

my_analysis.transform(results_data_frame)

This design also lets you store artifacts from the transformations / stats functions (e.g., top5.kept_taxa_names)

It would even be possible to have multiple APIs. Again these are just suggestions but I think it's worthwhile. Also, it's possible to implement these APIs on top of what you've built already (and have multiple ones at the same time).

Better naming

"ResultsDataFrame" is kind of ambiguous? I'm bad at naming things though. I am thinking that there are going to be multiple things named "results" in users' notebooks and it might get confusing. The results function is also ambiguous. I think this should be renamed to transform maybe.

Guessing normalization

Even though the odds of a false positive are probably nil, a good alternative would be to store a bool on ResultsDataFrame and have any normalization function turn it to True.

onecodex/onecodex/helpers.py

Lines 107 to 111 in 4e7a769

    
           def _guess_normalized(self): 
        
               # it's possible that the _results df has already been normalized, which can cause some 
        
               # methods to fail. we must guess whether this is the case 
        
               return bool((self._results.sum(axis=1).round(4) == 1.0).all())

onecodex/helpers.py

audy · 2019-01-09T23:19:00Z

(playing around with the lib now. I'll give you some thoughts on my first impressions)

onecodex/onecodex/models/__init__.py

Line 238 in 4e7a769

def primary_classifications(self):

I think @property doesn't work well here because it's really a function that calls an API (and changes the objects state). As a user, I expect that this will return immediately when it's really doing 900 GET requests.

I think it would be better to drop the internal caching mechanisms and just use something external and explicit like https://github.com/reclosedev/requests-cache which automatically caches all HTTP requests and stores them in an sqlite3 database. As a user, I am unaware of when things are being pulled from the API or from the cache. This way, I am in control of the cache. We could make it simpler for the user (like onecodex.enable_cache()).

polyatail · 2019-01-10T21:16:47Z

IIRC primary_classifications doesn't actually trigger any API calls, but I get what you're saying.

audy · 2019-01-10T21:39:39Z

SampleCollection.primary_classification checks self._cached and if the result is False, dispatches _classification_fetch which will trigger an API request at least in the case of the model being Classifications.

audy · 2019-01-10T22:04:10Z

Updated SampleCollections.primary_classifications to only cache the results after all classifications have been fetched to prevent a potential gotcha.

* New object, SampleCollection, which is a list of samples/analyses/classifications * The base object contains functions to collate multiple results into taxonomy, metadata, and results dataframes * Additional functionality is added through mixins which provide visualization methods, taxonomy manipulation, and distance calculations * Update pandas version * New subclassed pandas DataFrame, ResultsDataFrame, which holds taxonomy/metadata/results dfs * Switch to skbio's to_data_frame() method

* VizDistanceMixin, rather than SampleCollection, now subclasses DistanceMixin * Added ability to output altair chart when calling plot_heatmap * Renamed skbio tree methods, removing skbio_ from names * Added tree_prune_tax_ids method which keeps taxa and their parents * OneCodexAccessor object now subsets _taxonomy and _metadata based on contents of ResultsDataFrame * Auto rank finding must be aware of class type

* SampleAnalysis pulls in methods for analyzing results tables form mixins * SampleCollection inherits these methods from SampleAnalysis * Methods from SampleAnalysis are made accessible in the `ocx` accessor namespace via a pandas extension

* Use 'auto' rank everywhere * Track rank used to subset data in ResultsDataFrame, but only before manipulations are done on it * Use rank tracked in ResultsDataFrame where applicable * Guess if data has already been normalized in diversity methods, and if so, raise * Fix passing of metadata between ResultsDataFrame and ResultsSeries * Use automatic ranks when calling unifrac * Only do rangeStep in boxplot, not others * Update docstrings to be more like numpy

* Relocate distance function calls for dryness * Let users pass their own distance function * Allow users to choose from pcoa or smacof MDS algos * Renamed plot_pcoa to plot_mds * Suppress skbio warning about negative eigenvalues * Add ability to return altair chart object to all

* Renamed axes in PCA plots to PC1/PC2 * Fix package name specification in requirements.txt * Fix bug in plot_metadata when haxis is numeric * Don't bother searching taxa if query is too short

* Uses euclidean distances and average-linkage clustering, for now

* If inside notebook service nbconvert, render high-res PNGs * If inside ipython, use the notebook renderer

* Make OneCodexBase.__setattr__ aware of ResourceList objects * Provide _constructor property that can be overloaded if necessary * Lint

* SampleCollection subclasses ResourceList and contains classifications/analyses/samples. * In ResourceList, rename oc_model to _oc_model since it really is private * In ResourceList, call _update() whenever changes are made to underlying resource * Rename helpers.SampleCollection to SampleCollectionAnalyses to reflect that it contains analysis methods * Stop importing viz and distance into Api instances * Add functionality to SampleCollection from viz, distance, and taxonomy if dependencies are available

* Enable __add__ and copy on SampleCollection * Disallow duplicates in SampleCollection * Stop calling append inside extend in ResourceList * DRY up code and fix checking for resource validity * Pass __repr__ and __len__ as properties in ResourceList * Relocate functools import in subclassed pandas df/series * SampleCollectionAnalyses is now EnhancedSampleCollection * SampleAnalysis is now AnalysisMethods * Reordered methods in ResourceList * Remove unused ModuleAlias import * Bypass loading of EnhancedSampleCollection with environment variable * Fixed regression with string coercion in magic_metadata_fetch * Fixed bug in way number of taxa determined * Stop checking for invalid taxonomy, fixed in mainline

* Enable testing of EnhancedSampleCollection in conftest * Added new route for a where query on Samples * Added tests for SampleCollection * Change CircleCI tests to Python 3.6 * Fixes for Python 2.7 compat * Fix failing coverage test

* Moved to_otu from Classifications instance to SampleCollection * Moved magic_classifications_fetch from EnhancedSampleCollection to SampleCollection * Changed test_where_clauses test to pass with additional requests being generated

* Remove 'magic_' from names and make methods private * Make collate_metadata and collate_results private * Only collate data if not cached and remove from cache if update needed * Use @Property for taxonomy, metadata, field, and _results * Disallow dupes in ResourceList, not SampleCollection * Move pandas-requiring methods into SampleCollection and raise on ImportError * Remove EnhancedSampleCollection * Change _classifications to new @Property primary_classifications * Get tests to pass

* Make them aware of pandas requirement and skip/fail nicely * Remove ONE_CODEX_NO_ENHANCED var and update tests * Remove test_api.py which is superseded by newer tests * Remove test_minimal.py as it doesn't make sense anymore * Updated dependency versions to match * Updated CircleCI and Tox * Lint

* Fixed a bug where empty lists arent returned as ResourceList * Fixed spelling error * Stop collate_metadata failing on empty df

* Rename AnalysisMethods to AnalysisMixin * Rename ResultsDataFrame to ClassificationsDataFrame * Stop calling _update() in ResourceList on read * Remove comparison NotImplementedErrors and associated test * Codacy suggested changes * Small tweaks and get tests passing

* Additional docstrings * Enforce oc_model subclasses OneCodexBase * Update CircleCI * Fix non-deterministic test failure due to responses

boydgreenfield · 2019-01-14T02:22:02Z

Minor @polyatail – but this would probably have been a good PR to either rebase or squash before merging into master (commit history starts fairly clean, but also includes a bunch of "More changes..." style commits).

polyatail requested a review from boydgreenfield January 8, 2019 01:27

polyatail self-assigned this Jan 8, 2019

polyatail added the in progress not yet ready for review label Jan 8, 2019

boydgreenfield requested review from audy and bovee January 8, 2019 06:50

polyatail force-pushed the roo-viz-part2 branch from b3d8987 to e65e86c Compare January 8, 2019 21:42

polyatail mentioned this pull request Jan 8, 2019

Add method for generating exports in current format as app.onecodex.com UI #81

Closed

bovee reviewed Jan 9, 2019

View reviewed changes

audy reviewed Jan 9, 2019

View reviewed changes

onecodex/helpers.py Show resolved Hide resolved

audy reviewed Jan 9, 2019

View reviewed changes

onecodex/helpers.py Show resolved Hide resolved

audy reviewed Jan 9, 2019

View reviewed changes

onecodex/helpers.py Show resolved Hide resolved

polyatail added this to the Sprint Kiwi milestone Jan 10, 2019

polyatail force-pushed the roo-viz-part2 branch 5 times, most recently from 0b9d912 to 82706a6 Compare January 11, 2019 02:53

polyatail added 7 commits January 10, 2019 18:53

Introduce new SampleAnalysis class

4bc582e

* SampleAnalysis pulls in methods for analyzing results tables form mixins * SampleCollection inherits these methods from SampleAnalysis * Methods from SampleAnalysis are made accessible in the `ocx` accessor namespace via a pandas extension

More changes

ce0360e

* Renamed axes in PCA plots to PC1/PC2 * Fix package name specification in requirements.txt * Fix bug in plot_metadata when haxis is numeric * Don't bother searching taxa if query is too short

Cluster samples/abundances in heatmaps. Closes #149.

00cc298

* Uses euclidean distances and average-linkage clustering, for now

polyatail added 13 commits January 10, 2019 18:53

Allow user to specify linkage for clustering. Closes #150.

bc22ad0

Choose altair renderer more intelligently. Closes #145.

458b8cc

* If inside notebook service nbconvert, render high-res PNGs * If inside ipython, use the notebook renderer

Return ResourceList instead of list. Closes #158 and #160.

3a59226

* Make OneCodexBase.__setattr__ aware of ResourceList objects * Provide _constructor property that can be overloaded if necessary * Lint

Bump for version v0.4.0

4350347

Tests for distance, helpers, taxonomy, and viz

458b788

* Enable testing of EnhancedSampleCollection in conftest * Added new route for a where query on Samples * Added tests for SampleCollection * Change CircleCI tests to Python 3.6 * Fixes for Python 2.7 compat * Fix failing coverage test

Add to_otu() method to SampleCollection

af07f2b

* Moved to_otu from Classifications instance to SampleCollection * Moved magic_classifications_fetch from EnhancedSampleCollection to SampleCollection * Changed test_where_clauses test to pass with additional requests being generated

Work toward better docstrings

374f247

Tags can be looked up more easily. Closes #130.

b2483ba

* Fixed a bug where empty lists arent returned as ResourceList * Fixed spelling error * Stop collate_metadata failing on empty df

polyatail force-pushed the roo-viz-part2 branch 4 times, most recently from 013311c to 1e934f6 Compare January 11, 2019 18:54

polyatail added done ready to be merged and removed in progress not yet ready for review labels Jan 11, 2019

polyatail force-pushed the roo-viz-part2 branch 5 times, most recently from 9fc3c3a to 741caff Compare January 12, 2019 02:19

More changes to merge

8c99d6d

* Additional docstrings * Enforce oc_model subclasses OneCodexBase * Update CircleCI * Fix non-deterministic test failure due to responses

polyatail force-pushed the roo-viz-part2 branch from 741caff to 8c99d6d Compare January 12, 2019 02:24

polyatail merged commit e4c9e1f into master Jan 12, 2019

polyatail deleted the roo-viz-part2 branch January 12, 2019 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visualization overhaul (2 of 2) #164

Visualization overhaul (2 of 2) #164

polyatail commented Jan 8, 2019 •

edited

Loading

coveralls commented Jan 9, 2019 •

edited

Loading

bovee left a comment

bovee Jan 9, 2019

audy commented Jan 9, 2019 •

edited

Loading

audy commented Jan 9, 2019

polyatail commented Jan 10, 2019

audy commented Jan 10, 2019

audy commented Jan 10, 2019

boydgreenfield commented Jan 14, 2019

Visualization overhaul (2 of 2) #164

Visualization overhaul (2 of 2) #164

Conversation

polyatail commented Jan 8, 2019 • edited Loading

coveralls commented Jan 9, 2019 • edited Loading

bovee left a comment

Choose a reason for hiding this comment

bovee Jan 9, 2019

Choose a reason for hiding this comment

audy commented Jan 9, 2019 • edited Loading

Separate normalization steps from stats

API Design

Better naming

Guessing normalization

audy commented Jan 9, 2019

polyatail commented Jan 10, 2019

audy commented Jan 10, 2019

audy commented Jan 10, 2019

boydgreenfield commented Jan 14, 2019

polyatail commented Jan 8, 2019 •

edited

Loading

coveralls commented Jan 9, 2019 •

edited

Loading

audy commented Jan 9, 2019 •

edited

Loading