Is PMA::CCA implemented in your package? #107

JohannesWiesner · 2021-11-26T19:19:17Z

Hi! First of all, thank you for the great work! I stumbled upon your package because I was looking for a python alternative to the CCA function as implemented in the PMA-package from Witten et al.:

https://github.com/bnaras/PMA/blob/e36a23c506b17be03d23c547075ac134861f7ba0/R/CCA.R#L197

I already found a package that can handle this, but I wonder if cca-zoo has also implemented it? The only method from Witten et al. that I could only find was PMD

The text was updated successfully, but these errors were encountered:

jameschapman19 · 2021-11-27T23:46:49Z

Yes - my PMD is sparse CCA by penalized matrix decomposition from Witten 09 (and the more than 2 view extension described in the follow up paper).

That being said your comment is interesting in that I obviously haven't made that clear. I guess because I've called it PMD it seems to refer to the Penalized PCA rather than CCA?

I have seen the method referred to Interchangeably as Sparse CCA (because it was the first of the sparse CCA methods), Penalized Matrix Decomposition, and Sparse Partial Least Squares (the latter actually being my preferred name for mathematical reasons!).

jameschapman19 · 2021-11-27T23:48:58Z

Perhaps the actionable change I can make is to make the docs explicit that this refers to PMD for CCA?

JohannesWiesner · 2021-11-28T11:46:28Z

Hi James, yeah issues with nomenclature seem to be strong with sparse CCA in general...And I must admit that I am also not quite sure if I get your answer right, so, just to get on the same page here:

PMA::CCA (page 4 of PMA docs) is not implemented in your package but only PMA::PMD (page 18 of PMA docs)?

I can definitely tell that the python version of PMA::CCA from @teekuningas is tested against the official R-package. Over the last week, the repository owner and I made sure that PMA::CCA gives the same results as sparsecca._cca_pmd.cca. Testing is done by running PMA::CCA with the example data from Witten et al. against sparsecca._cca_pmd.cca and then comparing the canonical weights. The testing script can be found here.

Is there a Class in your repository that would give the same outputs (possibly with a different name but with the same underlying algorithm?)

jameschapman19 · 2021-11-28T12:26:19Z

Other way round :)

So my PMD class is [Sparse CCA by] Penalized Matrix Decomposition - and this thread is convincing me I should go with SCCA_PMD as the class name.

My PMD should be able to perform both CCA and MultiCCA from PMA in R looking through those Docs.

I would expect results to match in all cases for the first component and then for further components I have the option of what I would call CCA (or projection) deflation and classical PLS deflation. PMA performs classical PLS deflation.

I could probably write a quick script based off of that one to check that.

JohannesWiesner · 2021-11-28T12:58:40Z

Ah okay so currently:

cca_zoo.models.PMD == PMA:CCA == sparsecca._cca_pmd.cca?

But are you really sure about that? Because I couldn't reproduce the results from PMA:CCA using cca_zoo.models.PMD. Besides, cca_zoo.models.PMD only allows regularization l1 regularisation parameter between 1 and sqrt(number of features) for each view, however, PMA:CCA regularization parameters "must be between 0 and 1 (larger L1
bound corresponds to less penalization)"

jameschapman19 · 2021-11-28T17:31:43Z

Yep - I use the scale from the original paper 1 to sqrt number of features in PMA it looks like they convert it after the argument is passed.

That being said there seem to be some numerical differences in the 2nd and 3rd mode on your example that I'll try to take a look at (objective value pretty close but converging to a different place).

jameschapman19 · 2021-11-29T12:55:00Z

So the following passes a similar test:

import rpy2.robjects.packages as rpackages
import rpy2.robjects as robjects

import numpy as np
from cca_zoo.models import PMD

def test_compare_pmd_to_r():
    """ Compares the output of the original R implementation to the implementation
        of this python implementation using example data.
        Thanks for JohannesWiesner for R and python code.
        """

    utils = rpackages.importr('utils')
    utils.chooseCRANmirror(ind=1)

    if not rpackages.isinstalled('PMA'):
        utils.install_packages('PMA', verbose=True)

    robjects.r('''
            ## Run example from PMA package ################################################

            library(PMA)

            # first, do CCA with type="standard"
            # A simple simulated example
            set.seed(3189)
            u <- matrix(c(rep(1,25),rep(0,75)),ncol=1)
            v1 <- matrix(c(rep(1,50),rep(0,450)),ncol=1)
            v2 <- matrix(c(rep(0,50),rep(1,50),rep(0,900)),ncol=1)
            x <- u%*%t(v1) + matrix(rnorm(100*500),ncol=500)
            z <- u%*%t(v2) + matrix(rnorm(100*1000),ncol=1000)

            # Can run CCA with default settings, and can get e.g. 3 components
            out <- CCA(x,z,typex="standard",typez="standard",K=1,penaltyx=0.3,penaltyz=0.3)
            out_u = out$u
            out_v = out$v

        ''')

    # Get the r data as numpy arrays
    out_u = np.array(robjects.globalenv['out_u'])
    out_v = np.array(robjects.globalenv['out_v'])
    x = np.array(robjects.globalenv['x'])
    z = np.array(robjects.globalenv['z'])

    # Compute cca with the same data as in
    pmd = PMD(c=[0.3 * np.sqrt(x.shape[1]), 0.3 * np.sqrt(z.shape[1])], latent_dims=1,
              max_iter=15, deflation='pls', tol=1e-15)
    pmd.fit((x, z))

    assert(np.allclose(np.abs(pmd.weights[0]),np.abs(out_u)))
    assert(np.allclose(np.abs(pmd.weights[1]),np.abs(out_v)))

Note that I have changed the number of components to 1.

Differences in further components come down to the following differences in the intialization of the inner loop:

I deflate the data whereas PMA stacks an additional data point such that the objective is zero for the previous set of weights. This allows for projection/CCA or classical/PLS deflation. This can result in small numerical differences.
PMA/the code you linked to initialize with the outer SVD whereas I initialize each inner loop with an unregularized PLS/SVD on the deflated data.

I've checked the latter point by passing the exact same initializations and then they converge to the same place so overall I'm satisfied that I'm implementing the algorithm in a valid way.

With your permission I'll add your test for the first component as it's a very useful ground truth.

JohannesWiesner · 2021-11-29T18:20:05Z

Hi James, thanks for setting up the test. I asked @teekuningas for permission, since he finalized the test and will let you know once he responded :) Unfortunately I couldn't run it on my (Windows) machine (even with your latest version from today)?:

TypeError: __init__() got an unexpected keyword argument 'deflation'

Differences in further components come down to the following differences in the intialization of the inner loop:

I deflate the data whereas PMA stacks an additional data point such that the objective is zero for the previous set of weights. This allows for projection/CCA or classical/PLS deflation. This can result in small numerical differences.
PMA/the code you linked to initialize with the outer SVD whereas I initialize each inner loop with an unregularized PLS/SVD on the deflated data.

Yeah, that's the problem, I am really lacking the math here, so I cannot really contribute anything meaningful to this discussion...

Running PMD() without the deflation keyword works (using my current project data). And you're right, the weights are pretty similar on the first variate. However, setting K=5 leads to quite dissimilar results of u weights when plotting the absolute deviations:

import matplotlib.pyplot as plt

# plot distribution of deviations (ignore zeros for plotting purposes)
u,v = pmd.weights
deviations = np.abs(u)-np.abs(out_u)
deviations = deviations.flatten()
deviations = deviations[deviations != 0]
plt.hist(deviations)

Since I would also need to reproduce results from a former package (but still would like to use Python instead of R) I wonder if one could tweak the class in such a way that you would get more similar results also on the other variates?

JohannesWiesner · 2021-11-29T18:25:02Z

Yep - I use the scale from the original paper 1 to sqrt number of features in PMA it looks like they convert it after the argument is passed.

Ah, good to know! Just as a side note (but really only a UX thing, so not really important): Wouldn't it be then a good idea to adopt the CCA::PMA style and do [0.3 * np.sqrt(x.shape[1]), 0.3 * np.sqrt(z.shape[1])] behind the scenes? Then users would pass a value between 0 and 1 which would also be more close to what one is used to when using sklearn estimators?

JohannesWiesner · 2021-11-30T11:33:47Z

I asked @teekuningas for permission, since he finalized the test and will let you know once he responded :)

@teekuningas agreed, so feel free to implement our test in your package :)

jameschapman19 · 2021-11-30T11:43:56Z

Thanks @JohannesWiesner

The deflation kwarg should be ok now (if not then do a pip install cca-zoo --upgrade or pull).

I'll take a look at the initialization problem some time today most likely. I have an idea how I'd achieve it. That being said your result above is super interesting for my own work (and lots of other work using sparse CCA) in that it shows just how much of a problem the lack of global convergence of the algorithm is i.e. small differences in initialization result in substantial differences in weights (even though the value of the objective is similar!).

Your UI point is interesting. Do you feel that it is more intuitive to have parameters entered between 0 and 1 and an error if c*#features < 1 or c*#features>sqrt(#features) or parameters entered between 1 and sqrt(#features)?

JohannesWiesner · 2021-11-30T12:14:16Z

The deflation kwarg should be ok now (if not then do a pip install cca-zoo --upgrade or pull).

Yup, works now! However, with my current project data and your settings I get the following UserWarnings:

C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 0 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 1 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 2 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 3 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 4 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")

That being said your result above is super interesting for my own work (and lots of other work using sparse CCA) in that it shows just how much of a problem the lack of global convergence of the algorithm is i.e. small differences in initialization result in substantial differences in weights (even though the value of the objective is similar!).

Ha, interesting, glad I could generate interesting research questions!

Your UI point is interesting. Do you feel that it is more intuitive to have parameters entered between 0 and 1 and an error if c*#features < 1 or c*#features>sqrt(#features) or parameters entered between 1 and sqrt(#features)?

Yup, exactly! It would not require calculating the number of features before-hand but just uses error handling in case users don't provide a number between 0 and 1.

JohannesWiesner · 2021-11-30T12:16:09Z

These are the deviations from PMA::CCA that sparsecca._cca_pmd.cca produces (using my current project data):

So they are really super small as you can see

jameschapman19 · 2021-12-01T18:34:18Z

I've done some tinkering on the dev branch. The hidden dev release pip install cca-zoo==1.10.2.dev1 contains my adjusted version. Now the initialization at the start of each loop is the same. There is still a bit of a difference but the weights in all 3 modes are now correlated with the R package above 0.9 so a script of the form:

    # Compute cca with the same data as in
    pmd = PMD(
        c=[0.3, 0.3],
        latent_dims=3,
        max_iter=15,
        deflation="pls",
        tol=1e-6,
    )
    pmd.fit((x, z))

    assert np.testing.assert_array_less(0.9,
                                        np.diag(np.abs(np.corrcoef(pmd.weights[0], out_u, rowvar=False)), 3)) is None
    assert np.testing.assert_array_less(0.9,
                                        np.diag(np.abs(np.corrcoef(pmd.weights[0], out_u, rowvar=False)), 3)) is None

will work. This checks that each dimensions has correlation with the original weights greater than 0.9.

I'll probably keep the changes I've made in a future version as they've actually sped things up a bit but I've left them in dev for the moment until I get a chance to review.

jameschapman19 · 2021-12-01T18:35:47Z

This also contains the UI change - I'm torn on that change though!

jameschapman19 · 2021-12-03T10:57:51Z

Quick note to say I have now put all of these changes in the main version i.e. from 1.10.3.

jameschapman19 · 2021-12-10T16:25:51Z

Closing for now

jameschapman19 mentioned this issue Dec 1, 2021

Partial cca #112

Merged

jameschapman19 closed this as completed Dec 10, 2021

JohannesWiesner mentioned this issue Jan 16, 2023

Documentation of penalty parameter ranges? #161

Open

JohannesWiesner mentioned this issue Feb 10, 2023

GRCCA gives quite different results with original implementation #163

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is PMA::CCA implemented in your package? #107

Is PMA::CCA implemented in your package? #107

JohannesWiesner commented Nov 26, 2021

jameschapman19 commented Nov 27, 2021

jameschapman19 commented Nov 27, 2021

JohannesWiesner commented Nov 28, 2021

jameschapman19 commented Nov 28, 2021

JohannesWiesner commented Nov 28, 2021

jameschapman19 commented Nov 28, 2021

jameschapman19 commented Nov 29, 2021

JohannesWiesner commented Nov 29, 2021

JohannesWiesner commented Nov 29, 2021

JohannesWiesner commented Nov 30, 2021

jameschapman19 commented Nov 30, 2021

JohannesWiesner commented Nov 30, 2021

JohannesWiesner commented Nov 30, 2021

jameschapman19 commented Dec 1, 2021 •

edited

Loading

jameschapman19 commented Dec 1, 2021

jameschapman19 commented Dec 3, 2021

jameschapman19 commented Dec 10, 2021

Is PMA::CCA implemented in your package? #107

Is PMA::CCA implemented in your package? #107

Comments

JohannesWiesner commented Nov 26, 2021

jameschapman19 commented Nov 27, 2021

jameschapman19 commented Nov 27, 2021

JohannesWiesner commented Nov 28, 2021

jameschapman19 commented Nov 28, 2021

JohannesWiesner commented Nov 28, 2021

jameschapman19 commented Nov 28, 2021

jameschapman19 commented Nov 29, 2021

JohannesWiesner commented Nov 29, 2021

JohannesWiesner commented Nov 29, 2021

JohannesWiesner commented Nov 30, 2021

jameschapman19 commented Nov 30, 2021

JohannesWiesner commented Nov 30, 2021

JohannesWiesner commented Nov 30, 2021

jameschapman19 commented Dec 1, 2021 • edited Loading

jameschapman19 commented Dec 1, 2021

jameschapman19 commented Dec 3, 2021

jameschapman19 commented Dec 10, 2021

jameschapman19 commented Dec 1, 2021 •

edited

Loading