Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualization overhaul (2 of 2) #164

Merged
merged 21 commits into from
Jan 12, 2019
Merged

Visualization overhaul (2 of 2) #164

merged 21 commits into from
Jan 12, 2019

Conversation

polyatail
Copy link
Contributor

@polyatail polyatail commented Jan 8, 2019

The PR is a substantial overhaul to how Samples and their associated Classifications are analyzed in a notebook environment.

The following issues were addressed:

New functionality has been added as part of the SampleCollection object, some examples of which are outlined below.

Retrieving a project and generating plots
All plotting functions are now available as methods of the new SampleCollection object.

>>> project1 = ocx.Projects.get('d53ad03b010542e3')
>>> samples1 = ocx.Samples.where(project=project1.id, public=True, limit=10)
>>> type(samples1)
onecodex.models.SampleCollection
>>> samples1.plot_pca(color='Bacteroides')

Obtaining a DataFrame OTU table
We introduce a new subclassed ResultsDataFrame which is returned when the user requests an OTU table from a SampleCollection. This object transfers taxonomy and metadata along with it, which is automatically subset to only ever contain the taxonomy and metadata of taxa present in the ResultsDataFrame and metadata belonging to Samples present.

results = samples1.results(rank='phylum', normalize=False)

# take the fourth root and plot a heatmap
(results ** 0.25).ocx.plot_heatmap()

# restrict to taxa with 100 reads or more and run PCA
results.loc[:, results.sum() > 100].ocx.plot_pca(color='Bacteroidetes')

# do PCoA with weighted Unifrac
results.ocx.plot_mds(method='pcoa', metric='weighted_unifrac', color='Bacteroidetes')

Metadata and taxonomy can be accessed two ways:

samples1.metadata == samples1.results().ocx.metadata
samples1.taxonomy == samples1.results().ocx.taxonomy

The values in the ResultsDataFrame can be manipulated as usual. Analysis methods which are available to SampleCollection are also available in the ResultsDataFrame.ocx namespace via pd.api.extensions.

Manipulating contents of SampleCollection
The Samples in a SampleCollection can be manipulated as if they were a list. A user might notice that one sample is "bad" and want to exclude it from their analyses. Or they might want to combine multiple projects together, or analyze their own project in the context of a larger public dataset.

project2 = ocx.Projects.get('be33784611fe4c19')
samples2 = ocx.Samples.where(project=project2.id, public=True, limit=10)

# analyze these two projects together
combined = samples1 + samples2

# exclude this sample
del combined[4]

combined.plot_pca(size='Firmicutes')

# get a combined OTU table and write to a CSV file
combined.results(rank='genus', normalize=True).to_csv('otu_table.csv')

# and in long format
combined.results(table_format='long').to_csv('otu_table.long.csv')

@coveralls
Copy link

coveralls commented Jan 9, 2019

Coverage Status

Coverage decreased (-0.6%) to 84.078% when pulling 8c99d6d on roo-viz-part2 into 6473770 on master.

Copy link
Contributor

@bovee bovee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the direction this is going and I think this is a useful reorganization you have here. There are some big changes with some new parts though so more docstrings, even if they're on Mixins or components that aren't otherwise exposed, would be really helpful to understand the overall architecture. Also, I have some comments/concerns about how we're bumping some of our dependencies.

onecodex/helpers.py Outdated Show resolved Hide resolved
onecodex/helpers.py Outdated Show resolved Hide resolved
onecodex/helpers.py Outdated Show resolved Hide resolved
onecodex/distance.py Show resolved Hide resolved
onecodex/helpers.py Outdated Show resolved Hide resolved
onecodex/taxonomy.py Show resolved Hide resolved
onecodex/viz/_distance.py Show resolved Hide resolved
@@ -2,31 +2,31 @@
boto3>=1.4.2
click>=6.6
jsonschema>=2.4
python-dateutil>=2.5.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly reluctant to pin with a >=; we should probably pin to specific minor-release versions of packages, e.g. using ~=, to prevent unexpected downstream breakages from taking out this library.

tox.setup.ini Show resolved Hide resolved
@audy
Copy link
Contributor

audy commented Jan 9, 2019

Here are some comments/suggestions:

Separate normalization steps from stats

I feel pretty strongly that preprocessing/normalization steps should remain separate from statistics steps. This would read nicer and cut down on repetition. The user only has to transform their data once, and then pass the transformed object to multiple stats functions.

# perform normalization

processed_analysis = normalize(agglomerate(analysis_object, rank='Phylum'))

# do some stats
alpha = alpha_diversity(processed_analysis)

beta = beta_diversity(processed_analysis)

API Design

Since the entire API is changing I think it's worthwhile to evaluate a few possible API designs

Proposal A: method chaining. This would be pretty familiar to anyone used to phyloseq (or d3). I also think it reads well and is easy to type:

experiment.agglomerate('Phylum').normalize().beta_diversity()

Proposal B: IDK what this is called but it's familiar to scikit-learn which is an API for processing DataFrames and doing stats/machine learning:

transformer = ReadCountNormalizer(method="proportion")

summary_statistic = ShannonDiversityIndex()

stats = summary_statistic(transformer.transform(results_data_frame))

This API design can be made a little bit more readable using something like Pipeline

# straight up ripoff of scikit-learn
my_analysis = Pipeline([
  ('transform', ReadCountNormalizer()),
  ('top5', SelectTopTaxa(n=5)),
  ('alpha_diversity', ShannonDiversityIndex())
])

my_analysis.transform(results_data_frame)

This design also lets you store artifacts from the transformations / stats functions (e.g., top5.kept_taxa_names)

It would even be possible to have multiple APIs. Again these are just suggestions but I think it's worthwhile. Also, it's possible to implement these APIs on top of what you've built already (and have multiple ones at the same time).

Better naming

"ResultsDataFrame" is kind of ambiguous? I'm bad at naming things though. I am thinking that there are going to be multiple things named "results" in users' notebooks and it might get confusing. The results function is also ambiguous. I think this should be renamed to transform maybe.

Guessing normalization

Even though the odds of a false positive are probably nil, a good alternative would be to store a bool on ResultsDataFrame and have any normalization function turn it to True.

def _guess_normalized(self):
# it's possible that the _results df has already been normalized, which can cause some
# methods to fail. we must guess whether this is the case
return bool((self._results.sum(axis=1).round(4) == 1.0).all())

@audy
Copy link
Contributor

audy commented Jan 9, 2019

(playing around with the lib now. I'll give you some thoughts on my first impressions)

def primary_classifications(self):

I think @property doesn't work well here because it's really a function that calls an API (and changes the objects state). As a user, I expect that this will return immediately when it's really doing 900 GET requests.

I think it would be better to drop the internal caching mechanisms and just use something external and explicit like https://github.com/reclosedev/requests-cache which automatically caches all HTTP requests and stores them in an sqlite3 database. As a user, I am unaware of when things are being pulled from the API or from the cache. This way, I am in control of the cache. We could make it simpler for the user (like onecodex.enable_cache()).

@polyatail polyatail added this to the Sprint Kiwi milestone Jan 10, 2019
@polyatail
Copy link
Contributor Author

IIRC primary_classifications doesn't actually trigger any API calls, but I get what you're saying.

@audy
Copy link
Contributor

audy commented Jan 10, 2019

SampleCollection.primary_classification checks self._cached and if the result is False, dispatches _classification_fetch which will trigger an API request at least in the case of the model being Classifications.

@audy
Copy link
Contributor

audy commented Jan 10, 2019

Updated SampleCollections.primary_classifications to only cache the results after all classifications have been fetched to prevent a potential gotcha.

@polyatail polyatail force-pushed the roo-viz-part2 branch 5 times, most recently from 0b9d912 to 82706a6 Compare January 11, 2019 02:53
* New object, SampleCollection, which is a list of samples/analyses/classifications
* The base object contains functions to collate multiple results into taxonomy, metadata, and results dataframes
* Additional functionality is added through mixins which provide visualization methods, taxonomy manipulation, and distance calculations
* Update pandas version
* New subclassed pandas DataFrame, ResultsDataFrame, which holds taxonomy/metadata/results dfs
* Switch to skbio's to_data_frame() method
* VizDistanceMixin, rather than SampleCollection, now subclasses DistanceMixin
* Added ability to output altair chart when calling plot_heatmap
* Renamed skbio tree methods, removing skbio_ from names
* Added tree_prune_tax_ids method which keeps taxa and their parents
* OneCodexAccessor object now subsets _taxonomy and _metadata based on contents of ResultsDataFrame
* Auto rank finding must be aware of class type
* SampleAnalysis pulls in methods for analyzing results tables form mixins
* SampleCollection inherits these methods from SampleAnalysis
* Methods from SampleAnalysis are made accessible in the `ocx` accessor namespace via a pandas extension
* Use 'auto' rank everywhere
* Track rank used to subset data in ResultsDataFrame, but only before manipulations are done on it
* Use rank tracked in ResultsDataFrame where applicable
* Guess if data has already been normalized in diversity methods, and if so, raise
* Fix passing of metadata between ResultsDataFrame and ResultsSeries
* Use automatic ranks when calling unifrac
* Only do rangeStep in boxplot, not others
* Update docstrings to be more like numpy
* Relocate distance function calls for dryness
* Let users pass their own distance function
* Allow users to choose from pcoa or smacof MDS algos
* Renamed plot_pcoa to plot_mds
* Suppress skbio warning about negative eigenvalues
* Add ability to return altair chart object to all
* Renamed axes in PCA plots to PC1/PC2
* Fix package name specification in requirements.txt
* Fix bug in plot_metadata when haxis is numeric
* Don't bother searching taxa if query is too short
* Uses euclidean distances and average-linkage clustering, for now
* If inside notebook service nbconvert, render high-res PNGs
* If inside ipython, use the notebook renderer
* Make OneCodexBase.__setattr__ aware of ResourceList objects
* Provide _constructor property that can be overloaded if necessary
* Lint
* SampleCollection subclasses ResourceList and contains classifications/analyses/samples.
* In ResourceList, rename oc_model to _oc_model since it really is private
* In ResourceList, call _update() whenever changes are made to underlying resource
* Rename helpers.SampleCollection to SampleCollectionAnalyses to reflect that it contains analysis methods
* Stop importing viz and distance into Api instances
* Add functionality to SampleCollection from viz, distance, and taxonomy if dependencies are available
* Enable __add__ and copy on SampleCollection
* Disallow duplicates in SampleCollection
* Stop calling append inside extend in ResourceList
* DRY up code and fix checking for resource validity
* Pass __repr__ and __len__ as properties in ResourceList
* Relocate functools import in subclassed pandas df/series
* SampleCollectionAnalyses is now EnhancedSampleCollection
* SampleAnalysis is now AnalysisMethods
* Reordered methods in ResourceList
* Remove unused ModuleAlias import
* Bypass loading of EnhancedSampleCollection with environment variable
* Fixed regression with string coercion in magic_metadata_fetch
* Fixed bug in way number of taxa determined
* Stop checking for invalid taxonomy, fixed in mainline
* Enable testing of EnhancedSampleCollection in conftest
* Added new route for a where query on Samples
* Added tests for SampleCollection
* Change CircleCI tests to Python 3.6
* Fixes for Python 2.7 compat
* Fix failing coverage test
* Moved to_otu from Classifications instance to SampleCollection
* Moved magic_classifications_fetch from EnhancedSampleCollection to SampleCollection
* Changed test_where_clauses test to pass with additional requests being generated
* Remove 'magic_' from names and make methods private
* Make collate_metadata and collate_results private
* Only collate data if not cached and remove from cache if update needed
* Use @Property for taxonomy, metadata, field, and _results
* Disallow dupes in ResourceList, not SampleCollection
* Move pandas-requiring methods into SampleCollection and raise on ImportError
* Remove EnhancedSampleCollection
* Change _classifications to new @Property primary_classifications
* Get tests to pass
* Make them aware of pandas requirement and skip/fail nicely
* Remove ONE_CODEX_NO_ENHANCED var and update tests
* Remove test_api.py which is superseded by newer tests
* Remove test_minimal.py as it doesn't make sense anymore
* Updated dependency versions to match
* Updated CircleCI and Tox
* Lint
* Fixed a bug where empty lists arent returned as ResourceList
* Fixed spelling error
* Stop collate_metadata failing on empty df
* Rename AnalysisMethods to AnalysisMixin
* Rename ResultsDataFrame to ClassificationsDataFrame
* Stop calling _update() in ResourceList on read
* Remove comparison NotImplementedErrors and associated test
* Codacy suggested changes
* Small tweaks and get tests passing
@polyatail polyatail force-pushed the roo-viz-part2 branch 4 times, most recently from 013311c to 1e934f6 Compare January 11, 2019 18:54
@polyatail polyatail added done ready to be merged and removed in progress not yet ready for review labels Jan 11, 2019
@polyatail polyatail force-pushed the roo-viz-part2 branch 5 times, most recently from 9fc3c3a to 741caff Compare January 12, 2019 02:19
* Additional docstrings
* Enforce oc_model subclasses OneCodexBase
* Update CircleCI
* Fix non-deterministic test failure due to responses
@polyatail polyatail merged commit e4c9e1f into master Jan 12, 2019
@polyatail polyatail deleted the roo-viz-part2 branch January 12, 2019 02:39
@boydgreenfield
Copy link
Contributor

Minor @polyatail – but this would probably have been a good PR to either rebase or squash before merging into master (commit history starts fairly clean, but also includes a bunch of "More changes..." style commits).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants