Skip to content

Latest commit

 

History

History
152 lines (94 loc) · 4.61 KB

quickstart.rst

File metadata and controls

152 lines (94 loc) · 4.61 KB

Quickstart

python

import matplotlib.pyplot as plt

I want to use CCA or PLS. How many samples are required?

Simply import gemmr

python

from gemmr import *

Then, for CCA, only the number of features in each of the 2 datasets, as well as the powerlaw decay constant for the within-set principal component spectrum, need to be specified, e.g.

python

cca_sample_size(100, 50, -.8, -1.2)

The result is a dictionary with keys indicating assumed ground truth correlations and values giving the corresponding sample size estimate. For PLS, sample size can be calculated similarly:

python

pls_sample_size(100, 50, -1.5, -.5)

Note

Required sample sizes are calculated to obtain at least 90% power and less than 10% error in a number of other metrics. See the [gemmr] publication for more details.

More use cases and options of the sample size functions are discussed in sample_size_calculation_tutorial.

How can I generate synthetic data for CCA or PLS?

The functionality is provided in module generative_model and requires two steps

python

from gemmr.generative_model import GEMMR

First, a model needs to be specified. The required parameters are:

  • the number of features in X and Y
  • the assumed true correlation between scores in X and Y
  • the power-law exponents describing the within-set principal component spectra of X and Y

python

px, py = 3, 5 r_between = 0.3 ax, ay = -1, -.5 gm = GEMMR('cca', wx=px, wy=py, ax=ax, ay=ay, r_between=r_between)

Analogously, if a model for PLS is desired the first argument becomes 'pls'.

Second, data can be drawn from the model distribution:

python

X, Y = gm.generate_data(n=5000) X.shape, Y.shape

See the API reference for .generative_model.setup_model and .generative_model.generate_data for more details.

How do the provided CCA or PLS estimators work?

We assume two data arrays X and Y are given and shall be analyzed with CCA or PLS. The provided estimators work like those in sklearn. For example, to perform a CCA:

python

from gemmr.estimators import SVDCCA

cca = SVDCCA(n_components=1) cca.fit(X, Y)

After fitting several attributes become available. Estimated canonical correlations are stored in

python

cca.corrs

weight (rotation) vectors in

and analogously in cca.y_rotations_, and the attributes x_scores_ and y_scores_ provide the in-sample scores:

python

@savefig svdcca_scatter_scores.png width=4in plt.scatter(cca.x_scores, cca.y_scores, s=1)

SVDPLS works analogously, but note that it finds maximal covariances instead of correlations, and correspondingly has an attribute covs_.

For more information see the reference pages for .estimators.SVDCCA and .estimators.SVDPLS.

A sparse CCA estimator, based on the R-package PMA, is implemented as .estimators.SparseCCA.

How can I investigate parameter dependencies of CCA or PLS?

This can be done with the function .sample_analysis.analyze_model_parameters. A basic use case is shown here:

python

from gemmr.sample_analysis import * results = analyze_model_parameters( 'cca', pxs=(2, 5), rs=(.3, .5), n_per_ftrs=(2, 10, 30, 100), n_rep=10, n_perm=1, ) results

The variable results contains a number of outcome metrics by default, and further ones can be obtained through add-ons specified as keyword-argument addons to .sample_analysis.analyze_model_parameters.

Dependence of outcomes on, for example, sample size, can then be inspected:

python

plt.loglog(results.n_per_ftr, results.between_assocs.sel(px=5, r=0.3, Sigma_id=0).mean('rep'), label='r=0.3') plt.loglog(results.n_per_ftr, results.between_assocs.sel(px=5, r=0.5, Sigma_id=0).mean('rep'), label='r=0.5')

plt.xlabel('samples per feature') plt.ylabel('canonical correlation') plt.legend()

@savefig canonical_correlation_vs_n.png width=4in plt.gcf().tight_layout()

See model_param_ana for a more extensive example and the reference page for .sample_analysis.analyzers.analyze_model_parameters for more details.