Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is PMA::CCA implemented in your package? #107

Closed
JohannesWiesner opened this issue Nov 26, 2021 · 17 comments
Closed

Is PMA::CCA implemented in your package? #107

JohannesWiesner opened this issue Nov 26, 2021 · 17 comments

Comments

@JohannesWiesner
Copy link
Contributor

Hi! First of all, thank you for the great work! I stumbled upon your package because I was looking for a python alternative to the CCA function as implemented in the PMA-package from Witten et al.:

https://github.com/bnaras/PMA/blob/e36a23c506b17be03d23c547075ac134861f7ba0/R/CCA.R#L197

I already found a package that can handle this, but I wonder if cca-zoo has also implemented it? The only method from Witten et al. that I could only find was PMD

@jameschapman19
Copy link
Owner

Yes - my PMD is sparse CCA by penalized matrix decomposition from Witten 09 (and the more than 2 view extension described in the follow up paper).

That being said your comment is interesting in that I obviously haven't made that clear. I guess because I've called it PMD it seems to refer to the Penalized PCA rather than CCA?

I have seen the method referred to Interchangeably as Sparse CCA (because it was the first of the sparse CCA methods), Penalized Matrix Decomposition, and Sparse Partial Least Squares (the latter actually being my preferred name for mathematical reasons!).

@jameschapman19
Copy link
Owner

Perhaps the actionable change I can make is to make the docs explicit that this refers to PMD for CCA?

@JohannesWiesner
Copy link
Contributor Author

Hi James, yeah issues with nomenclature seem to be strong with sparse CCA in general...And I must admit that I am also not quite sure if I get your answer right, so, just to get on the same page here:

PMA::CCA (page 4 of PMA docs) is not implemented in your package but only PMA::PMD (page 18 of PMA docs)?

I can definitely tell that the python version of PMA::CCA from @teekuningas is tested against the official R-package. Over the last week, the repository owner and I made sure that PMA::CCA gives the same results as sparsecca._cca_pmd.cca. Testing is done by running PMA::CCA with the example data from Witten et al. against sparsecca._cca_pmd.cca and then comparing the canonical weights. The testing script can be found here.

Is there a Class in your repository that would give the same outputs (possibly with a different name but with the same underlying algorithm?)

@jameschapman19
Copy link
Owner

Other way round :)

So my PMD class is [Sparse CCA by] Penalized Matrix Decomposition - and this thread is convincing me I should go with SCCA_PMD as the class name.

My PMD should be able to perform both CCA and MultiCCA from PMA in R looking through those Docs.

I would expect results to match in all cases for the first component and then for further components I have the option of what I would call CCA (or projection) deflation and classical PLS deflation. PMA performs classical PLS deflation.

I could probably write a quick script based off of that one to check that.

@JohannesWiesner
Copy link
Contributor Author

Ah okay so currently:

cca_zoo.models.PMD == PMA:CCA == sparsecca._cca_pmd.cca?

But are you really sure about that? Because I couldn't reproduce the results from PMA:CCA using cca_zoo.models.PMD. Besides, cca_zoo.models.PMD only allows regularization l1 regularisation parameter between 1 and sqrt(number of features) for each view, however, PMA:CCA regularization parameters "must be between 0 and 1 (larger L1
bound corresponds to less penalization)"

@jameschapman19
Copy link
Owner

Yep - I use the scale from the original paper 1 to sqrt number of features in PMA it looks like they convert it after the argument is passed.

That being said there seem to be some numerical differences in the 2nd and 3rd mode on your example that I'll try to take a look at (objective value pretty close but converging to a different place).

@jameschapman19
Copy link
Owner

So the following passes a similar test:

import rpy2.robjects.packages as rpackages
import rpy2.robjects as robjects

import numpy as np
from cca_zoo.models import PMD

def test_compare_pmd_to_r():
    """ Compares the output of the original R implementation to the implementation
        of this python implementation using example data.
        Thanks for JohannesWiesner for R and python code.
        """

    utils = rpackages.importr('utils')
    utils.chooseCRANmirror(ind=1)

    if not rpackages.isinstalled('PMA'):
        utils.install_packages('PMA', verbose=True)

    robjects.r('''
            ## Run example from PMA package ################################################

            library(PMA)

            # first, do CCA with type="standard"
            # A simple simulated example
            set.seed(3189)
            u <- matrix(c(rep(1,25),rep(0,75)),ncol=1)
            v1 <- matrix(c(rep(1,50),rep(0,450)),ncol=1)
            v2 <- matrix(c(rep(0,50),rep(1,50),rep(0,900)),ncol=1)
            x <- u%*%t(v1) + matrix(rnorm(100*500),ncol=500)
            z <- u%*%t(v2) + matrix(rnorm(100*1000),ncol=1000)

            # Can run CCA with default settings, and can get e.g. 3 components
            out <- CCA(x,z,typex="standard",typez="standard",K=1,penaltyx=0.3,penaltyz=0.3)
            out_u = out$u
            out_v = out$v

        ''')

    # Get the r data as numpy arrays
    out_u = np.array(robjects.globalenv['out_u'])
    out_v = np.array(robjects.globalenv['out_v'])
    x = np.array(robjects.globalenv['x'])
    z = np.array(robjects.globalenv['z'])

    # Compute cca with the same data as in
    pmd = PMD(c=[0.3 * np.sqrt(x.shape[1]), 0.3 * np.sqrt(z.shape[1])], latent_dims=1,
              max_iter=15, deflation='pls', tol=1e-15)
    pmd.fit((x, z))

    assert(np.allclose(np.abs(pmd.weights[0]),np.abs(out_u)))
    assert(np.allclose(np.abs(pmd.weights[1]),np.abs(out_v)))

Note that I have changed the number of components to 1.

Differences in further components come down to the following differences in the intialization of the inner loop:

  • I deflate the data whereas PMA stacks an additional data point such that the objective is zero for the previous set of weights. This allows for projection/CCA or classical/PLS deflation. This can result in small numerical differences.
  • PMA/the code you linked to initialize with the outer SVD whereas I initialize each inner loop with an unregularized PLS/SVD on the deflated data.

I've checked the latter point by passing the exact same initializations and then they converge to the same place so overall I'm satisfied that I'm implementing the algorithm in a valid way.

With your permission I'll add your test for the first component as it's a very useful ground truth.

@JohannesWiesner
Copy link
Contributor Author

Hi James, thanks for setting up the test. I asked @teekuningas for permission, since he finalized the test and will let you know once he responded :) Unfortunately I couldn't run it on my (Windows) machine (even with your latest version from today)?:

TypeError: __init__() got an unexpected keyword argument 'deflation'

Differences in further components come down to the following differences in the intialization of the inner loop:

I deflate the data whereas PMA stacks an additional data point such that the objective is zero for the previous set of weights. This allows for projection/CCA or classical/PLS deflation. This can result in small numerical differences.
PMA/the code you linked to initialize with the outer SVD whereas I initialize each inner loop with an unregularized PLS/SVD on the deflated data.

Yeah, that's the problem, I am really lacking the math here, so I cannot really contribute anything meaningful to this discussion...

Running PMD() without the deflation keyword works (using my current project data). And you're right, the weights are pretty similar on the first variate. However, setting K=5 leads to quite dissimilar results of u weights when plotting the absolute deviations:

import matplotlib.pyplot as plt

# plot distribution of deviations (ignore zeros for plotting purposes)
u,v = pmd.weights
deviations = np.abs(u)-np.abs(out_u)
deviations = deviations.flatten()
deviations = deviations[deviations != 0]
plt.hist(deviations)

image

Since I would also need to reproduce results from a former package (but still would like to use Python instead of R) I wonder if one could tweak the class in such a way that you would get more similar results also on the other variates?

@JohannesWiesner
Copy link
Contributor Author

Yep - I use the scale from the original paper 1 to sqrt number of features in PMA it looks like they convert it after the argument is passed.

Ah, good to know! Just as a side note (but really only a UX thing, so not really important): Wouldn't it be then a good idea to adopt the CCA::PMA style and do [0.3 * np.sqrt(x.shape[1]), 0.3 * np.sqrt(z.shape[1])] behind the scenes? Then users would pass a value between 0 and 1 which would also be more close to what one is used to when using sklearn estimators?

@JohannesWiesner
Copy link
Contributor Author

I asked @teekuningas for permission, since he finalized the test and will let you know once he responded :)

@teekuningas agreed, so feel free to implement our test in your package :)

@jameschapman19
Copy link
Owner

Thanks @JohannesWiesner

The deflation kwarg should be ok now (if not then do a pip install cca-zoo --upgrade or pull).

I'll take a look at the initialization problem some time today most likely. I have an idea how I'd achieve it. That being said your result above is super interesting for my own work (and lots of other work using sparse CCA) in that it shows just how much of a problem the lack of global convergence of the algorithm is i.e. small differences in initialization result in substantial differences in weights (even though the value of the objective is similar!).

Your UI point is interesting. Do you feel that it is more intuitive to have parameters entered between 0 and 1 and an error if c*#features < 1 or c*#features>sqrt(#features) or parameters entered between 1 and sqrt(#features)?

@JohannesWiesner
Copy link
Contributor Author

The deflation kwarg should be ok now (if not then do a pip install cca-zoo --upgrade or pull).

Yup, works now! However, with my current project data and your settings I get the following UserWarnings:

C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 0 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 1 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 2 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 3 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")
C:\Users\Johannes.Wiesner\Anaconda3\envs\csp_wiesner_johannes\lib\site-packages\cca_zoo\models\iterative.py:103: UserWarning: Inner loop 4 did not converge or converged to nans
  warnings.warn(f"Inner loop {k} did not converge or converged to nans")

That being said your result above is super interesting for my own work (and lots of other work using sparse CCA) in that it shows just how much of a problem the lack of global convergence of the algorithm is i.e. small differences in initialization result in substantial differences in weights (even though the value of the objective is similar!).

Ha, interesting, glad I could generate interesting research questions!

Your UI point is interesting. Do you feel that it is more intuitive to have parameters entered between 0 and 1 and an error if c*#features < 1 or c*#features>sqrt(#features) or parameters entered between 1 and sqrt(#features)?

Yup, exactly! It would not require calculating the number of features before-hand but just uses error handling in case users don't provide a number between 0 and 1.

@JohannesWiesner
Copy link
Contributor Author

These are the deviations from PMA::CCA that sparsecca._cca_pmd.cca produces (using my current project data):

image

So they are really super small as you can see

@jameschapman19
Copy link
Owner

jameschapman19 commented Dec 1, 2021

I've done some tinkering on the dev branch. The hidden dev release pip install cca-zoo==1.10.2.dev1 contains my adjusted version. Now the initialization at the start of each loop is the same. There is still a bit of a difference but the weights in all 3 modes are now correlated with the R package above 0.9 so a script of the form:

    # Compute cca with the same data as in
    pmd = PMD(
        c=[0.3, 0.3],
        latent_dims=3,
        max_iter=15,
        deflation="pls",
        tol=1e-6,
    )
    pmd.fit((x, z))

    assert np.testing.assert_array_less(0.9,
                                        np.diag(np.abs(np.corrcoef(pmd.weights[0], out_u, rowvar=False)), 3)) is None
    assert np.testing.assert_array_less(0.9,
                                        np.diag(np.abs(np.corrcoef(pmd.weights[0], out_u, rowvar=False)), 3)) is None

will work. This checks that each dimensions has correlation with the original weights greater than 0.9.

I'll probably keep the changes I've made in a future version as they've actually sped things up a bit but I've left them in dev for the moment until I get a chance to review.

@jameschapman19
Copy link
Owner

This also contains the UI change - I'm torn on that change though!

@jameschapman19
Copy link
Owner

Quick note to say I have now put all of these changes in the main version i.e. from 1.10.3.

@jameschapman19
Copy link
Owner

Closing for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants