xmipp_matrix_dimred gives wrong mapping matrix with PCA and pPCA #315

MohamadHarastani · 2020-08-08T15:23:58Z

Hello,
xmipp_matrix_dimred has an option to save the mapping for linear methods.

   [--saveMapping <fn=>]
           Save mapping if available (PCA, LLTSA, LPP, pPCA, NPE) so that it can be reused later (Y=X*M)

X is the input matrix, Y is the output matrix and M is the mapping.
The function seems to give correct output matrix (Y) but incorrect mapping matrix (M) for two methods: PCA and pPCA.

Here is a sample (200 samples, 3 features per sample) for testing.
X.txt

The following lines can generate all the matrices after:

xmipp_matrix_dimred -i X.txt -o Y_PCA.txt -m PCA --dout 2 --saveMapping M_PCA.txt --din 3 --samples 200
xmipp_matrix_dimred -i X.txt -o Y_LLTSA.txt -m LLTSA --dout 2 --saveMapping M_LLTSA.txt --din 3 --samples 200
xmipp_matrix_dimred -i X.txt -o Y_LPP.txt -m LPP --dout 2 --saveMapping M_LPP.txt --din 3 --samples 200
xmipp_matrix_dimred -i X.txt -o Y_pPCA.txt -m pPCA --dout 2 --saveMapping M_pPCA.txt --din 3 --samples 200
xmipp_matrix_dimred -i X.txt -o Y_NPE.txt -m NPE --dout 2 --saveMapping M_NPE.txt --din 3 --samples 200

The following Python commands can be used to test the output:

from numpy import loadtxt, matmul, sum
X = loadtxt('X.txt')
M_PCA = loadtxt('M_PCA.txt')
M_LLTSA = loadtxt('M_LLTSA.txt')
M_LPP = loadtxt('M_LPP.txt')
M_pPCA = loadtxt('M_pPCA.txt')
M_NPE = loadtxt('M_NPE.txt')
print('PCA mapping verification (should be zero) ', sum(matmul(X,M_PCA) - loadtxt('Y_PCA.txt')))
print('LLTSA mapping verification (should be zero) ', sum(matmul(X,M_LLTSA) - loadtxt('Y_LLTSA.txt')))
print('LPP mapping verification (should be zero) ', sum(matmul(X,M_LPP) - loadtxt('Y_LPP.txt')))
print('pPCA mapping verification (should be zero) ', sum(matmul(X,M_pPCA) - loadtxt('Y_pPCA.txt')))
print('NPE mapping verification (should be zero) ', sum(matmul(X,M_NPE) - loadtxt('Y_NPE.txt')))

Here is what the output looks like:

PCA mapping verification (should be zero)  5008.6003077752175
LLTSA mapping verification (should be zero)  0.7019146815386402
LPP mapping verification (should be zero)  0.005094921908160148
pPCA mapping verification (should be zero)  58.372971159836005
NPE mapping verification (should be zero)  0.00780677463630014

Another problem is happening when the size of the input is high. Some of the methods fail and give a message "Killed".

Regards,
Mohamad

The text was updated successfully, but these errors were encountered:

DStrelak · 2020-08-11T14:06:03Z

Hi @MohamadHarastani ,
Thanks for reporting this issue.
I have no experience with this program, but I guess I can at least debug the crash. Can you provide me with the big input, or tell me how to generate it?

DStrelak · 2020-08-11T14:06:52Z

@cossorzano , can you please have a look at the output, or assign somebody who knows this program?

MohamadHarastani · 2020-08-11T15:00:54Z

Hi @MohamadHarastani ,
Thanks for reporting this issue.
I have no experience with this program, but I guess I can at least debug the crash. Can you provide me with the big input, or tell me how to generate it?

Hi @DStrelak ,
Thanks for looking into this. Here is a bigger input that I have used as a test (I didn't test other than this)
X.txt
This is the command you need to see the error
xmipp_matrix_dimred -i X.txt -o Y.txt -m PCA --din 35832 --samples 100 --dout 2 --saveMapping M.txt
It fails for these methods and gives the message between the parenthesis:
PCA (KILLED)
LLTSA (KILLED)
LPP (KILLED)
pPCA (KILLED)
SPE (Segmentation fault (core dumped))
NPE (XMIPP_ERROR 30: Incorrect matrix dimensions)

While it works for the remaining methods (LTSA, DM, kPCA, LE, HLLE)

cossorzano · 2020-08-12T08:14:08Z

I will have a look into this to see if I can find the problem

DStrelak · 2020-08-12T10:47:02Z

Hi @MohamadHarastani ,
so, I had a look at the issue. It seems that the big input is indeed too big. On my machine, it crashes due to insufficient memory. Your example uses around 19GB of memory, and then fails at another allocation (of ~9.5GB). In theory, we can release 9.5GB during the process. Still, you will need at least 20GB of RAM.
I have prepared a hotfix for devel version of xmipp (xmippCore, ds_issue315_dimredCrashing), which fixes PCA.
We will discuss if we will include this change into release, and I will have a look at the other cases meanwhile.

cossorzano · 2020-08-12T11:20:59Z

I have gone over the PCA problem, and it is indeed a memory problem. The possible solution by David is not really that useful, because it frees the memory in Xmipp in the function firstEigs, which is certainly not pleasant, as in that function we do not know what that matrix is used for.

I am checking now for LLTSA

DStrelak · 2020-08-12T11:28:44Z

which is certainly not pleasant, as in that function we do not know what that matrix is used for.

Indeed.

cossorzano · 2020-08-12T11:45:42Z

Hi @MohamadHarastani , I have gone over LLTSA and it is also a memory problem. I guess it is the same for all the methods that get killed. Apart from the interesting stress problem, in which context in CryoEM do you have such a high input dimensionality?

I will check now about the mapping problem you reported first.

cossorzano · 2020-08-12T12:03:30Z

Hi @MohamadHarastani ,

the reason for the difference between the projected PCA and the one in python, is that in PCA you have to subtract first the mean by columns

from numpy import loadtxt, matmul, sum, mean, outer, ones
X = loadtxt('X.txt')
M_PCA = loadtxt('M_PCA.txt')
M_pPCA = loadtxt('M_pPCA.txt')
Xp = X-outer(ones(X.shape[0]),mean(X, axis=0))
print('PCA mapping verification (should be zero) ', sum(matmul(Xp,M_PCA) - loadtxt('Y_PCA.txt')))
print('pPCA mapping verification (should be zero) ', sum(matmul(Xp,M_pPCA) - loadtxt('Y_pPCA.txt')))

The result is
('PCA mapping verification (should be zero) ', 9.5580000628398e-05)
('pPCA mapping verification (should be zero) ', -1.150100000465483e-06)

Cheers and thank you for such detailed issues, they are very easy to reproduce.
Carlos Oscar

MohamadHarastani · 2020-08-12T12:29:57Z

Hi @cossorzano and @DStrelak ,
Thanks for looking into this.

Apart from the interesting stress problem, in which context in CryoEM do you have such a high input dimensionality?

Each line of this data is an atomic structure (atom coordinates reshaped into a line).
I was able to use PCA from scikit-learn without asking for anything specific other than the output dimensions (by these codes on the big X.txt ).

import numpy as np
import matplotlib.pyplot as plt
from sklearn import decomposition
X = np.loadtxt('X.txt')
pca = decomposition.PCA(n_components=3)
X = pca.fit_transform(X)
plt.figure()
plt.scatter(X[:, 0], X[:, 1])
plt.show()

I looked at how PCA is handled there and they seem to use the method of (Halko et al. 2009) for a size bigger than 500*500.
You can search on this page for (svd_solverstr {‘auto’, ‘full’, ‘arpack’, ‘randomized’}). We can consider the issue with the big matrix a secondary problem.

Thanks for your efforts

MohamadHarastani · 2020-08-12T12:42:53Z

Hi @MohamadHarastani ,

the reason for the difference between the projected PCA and the one in python, is that in PCA you have to subtract first the mean by columns

from numpy import loadtxt, matmul, sum, mean, outer, ones
X = loadtxt('X.txt')
M_PCA = loadtxt('M_PCA.txt')
M_pPCA = loadtxt('M_pPCA.txt')
Xp = X-outer(ones(X.shape[0]),mean(X, axis=0))
print('PCA mapping verification (should be zero) ', sum(matmul(Xp,M_PCA) - loadtxt('Y_PCA.txt')))
print('pPCA mapping verification (should be zero) ', sum(matmul(Xp,M_pPCA) - loadtxt('Y_pPCA.txt')))

The result is
('PCA mapping verification (should be zero) ', 9.5580000628398e-05)
('pPCA mapping verification (should be zero) ', -1.150100000465483e-06)

Cheers and thank you for such detailed issues, they are very easy to reproduce.
Carlos Oscar

Thanks a lot @cossorzano for the solution.
Honestly, it is not intuitive to subtract the mean for these two methods particularly. I was depending on an implementation of inverse PCA in our HEMNMA plugin that was wrong for years (inverse PCA to generate animations).
Now I can fix all.
But please add somewhere in xmipp_matrix_dimred discription that these two methods shoule be treated specially.

Thanks again,
Cheers,
Mohamad

DStrelak · 2020-08-12T13:10:03Z

Each line of this data is an atomic structure (atom coordinates reshaped into a line).

I strongly recommend to NOT use xmipp_matrix_dimred for this. I tried it using your dataset. My patience run out after 80 minutes :-D

MohamadHarastani · 2020-08-12T13:34:18Z

Each line of this data is an atomic structure (atom coordinates reshaped into a line).

I strongly recommend to NOT use xmipp_matrix_dimred for this. I tried it using your dataset. My patience run out after 80 minutes :-D

Oh 80 minutes! I didn't reach to this point before. If you try to use the sklearn PCA that I put before it works in seconds on the same data.
Thanks a lot @DStrelak for your patience,
Cheers

MohamadHarastani · 2020-09-01T16:23:18Z

I want to close this issue so I did another check.
Actually, all the methods require that subtracting the mean from original matrix, and not only the two that I reported (PCA, pPCA).
To verify, I repeated the testing using the instructions of @cossorzano

from numpy import loadtxt, matmul, sum, mean, outer, ones
X = loadtxt('X.txt')
M_PCA = loadtxt('M_PCA.txt')
M_LLTSA = loadtxt('M_LLTSA.txt')
M_LPP = loadtxt('M_LPP.txt')
M_pPCA = loadtxt('M_pPCA.txt')
M_NPE = loadtxt('M_NPE.txt')

Xp = X-outer(ones(X.shape[0]),mean(X, axis=0))

print('PCA mapping verification (should be zero) ', sum(matmul(Xp,M_PCA) - loadtxt('Y_PCA.txt')))
print('LLTSA mapping verification (should be zero) ', sum(matmul(Xp,M_LLTSA) - loadtxt('Y_LLTSA.txt')))
print('LPP mapping verification (should be zero) ', sum(matmul(Xp,M_LPP) - loadtxt('Y_LPP.txt')))
print('pPCA mapping verification (should be zero) ', sum(matmul(Xp,M_pPCA) - loadtxt('Y_pPCA.txt')))
print('NPE mapping verification (should be zero) ', sum(matmul(Xp,M_NPE) - loadtxt('Y_NPE.txt')))

And here is the result:

PCA mapping verification (should be zero)  9.5580000628398e-05
LLTSA mapping verification (should be zero)  1.2249999997534415e-05
LPP mapping verification (should be zero)  0.5852513499999976
pPCA mapping verification (should be zero)  -1.150100000465483e-06
NPE mapping verification (should be zero)  -0.6755314400000025

Meaning, only this description of "xmipp_matrix_dimred" should explain what is x, y and m to solve this confusion:

   [--saveMapping <fn=>]
           Save mapping if available (PCA, LLTSA, LPP, pPCA, NPE) so that it can be reused later (Y=X*M)

I will add a little description to the file and open a PR.

Regards,
Mohamad

DStrelak assigned cossorzano Aug 11, 2020

MohamadHarastani self-assigned this Sep 1, 2020

MohamadHarastani added the in progress label Sep 1, 2020

MohamadHarastani linked a pull request Sep 1, 2020 that will close this issue

removing a redundant confusing check #336

Merged

DStrelak closed this as completed in #336 Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xmipp_matrix_dimred gives wrong mapping matrix with PCA and pPCA #315

xmipp_matrix_dimred gives wrong mapping matrix with PCA and pPCA #315

MohamadHarastani commented Aug 8, 2020

DStrelak commented Aug 11, 2020

DStrelak commented Aug 11, 2020

MohamadHarastani commented Aug 11, 2020 •

edited

Loading

cossorzano commented Aug 12, 2020

DStrelak commented Aug 12, 2020

cossorzano commented Aug 12, 2020

DStrelak commented Aug 12, 2020

cossorzano commented Aug 12, 2020

cossorzano commented Aug 12, 2020 •

edited

Loading

MohamadHarastani commented Aug 12, 2020

MohamadHarastani commented Aug 12, 2020

DStrelak commented Aug 12, 2020

MohamadHarastani commented Aug 12, 2020

MohamadHarastani commented Sep 1, 2020

xmipp_matrix_dimred gives wrong mapping matrix with PCA and pPCA #315

xmipp_matrix_dimred gives wrong mapping matrix with PCA and pPCA #315

Comments

MohamadHarastani commented Aug 8, 2020

DStrelak commented Aug 11, 2020

DStrelak commented Aug 11, 2020

MohamadHarastani commented Aug 11, 2020 • edited Loading

cossorzano commented Aug 12, 2020

DStrelak commented Aug 12, 2020

cossorzano commented Aug 12, 2020

DStrelak commented Aug 12, 2020

cossorzano commented Aug 12, 2020

cossorzano commented Aug 12, 2020 • edited Loading

MohamadHarastani commented Aug 12, 2020

MohamadHarastani commented Aug 12, 2020

DStrelak commented Aug 12, 2020

MohamadHarastani commented Aug 12, 2020

MohamadHarastani commented Sep 1, 2020

MohamadHarastani commented Aug 11, 2020 •

edited

Loading

cossorzano commented Aug 12, 2020 •

edited

Loading