Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xmipp_matrix_dimred gives wrong mapping matrix with PCA and pPCA #315

Closed
MohamadHarastani opened this issue Aug 8, 2020 · 14 comments · Fixed by #336
Closed

xmipp_matrix_dimred gives wrong mapping matrix with PCA and pPCA #315

MohamadHarastani opened this issue Aug 8, 2020 · 14 comments · Fixed by #336
Assignees

Comments

@MohamadHarastani
Copy link
Collaborator

Hello,
xmipp_matrix_dimred has an option to save the mapping for linear methods.

   [--saveMapping <fn=>]
           Save mapping if available (PCA, LLTSA, LPP, pPCA, NPE) so that it can be reused later (Y=X*M) 

X is the input matrix, Y is the output matrix and M is the mapping.
The function seems to give correct output matrix (Y) but incorrect mapping matrix (M) for two methods: PCA and pPCA.

Here is a sample (200 samples, 3 features per sample) for testing.
X.txt

The following lines can generate all the matrices after:

xmipp_matrix_dimred -i X.txt -o Y_PCA.txt -m PCA --dout 2 --saveMapping M_PCA.txt --din 3 --samples 200
xmipp_matrix_dimred -i X.txt -o Y_LLTSA.txt -m LLTSA --dout 2 --saveMapping M_LLTSA.txt --din 3 --samples 200
xmipp_matrix_dimred -i X.txt -o Y_LPP.txt -m LPP --dout 2 --saveMapping M_LPP.txt --din 3 --samples 200
xmipp_matrix_dimred -i X.txt -o Y_pPCA.txt -m pPCA --dout 2 --saveMapping M_pPCA.txt --din 3 --samples 200
xmipp_matrix_dimred -i X.txt -o Y_NPE.txt -m NPE --dout 2 --saveMapping M_NPE.txt --din 3 --samples 200

The following Python commands can be used to test the output:

from numpy import loadtxt, matmul, sum
X = loadtxt('X.txt')
M_PCA = loadtxt('M_PCA.txt')
M_LLTSA = loadtxt('M_LLTSA.txt')
M_LPP = loadtxt('M_LPP.txt')
M_pPCA = loadtxt('M_pPCA.txt')
M_NPE = loadtxt('M_NPE.txt')
print('PCA mapping verification (should be zero) ', sum(matmul(X,M_PCA) - loadtxt('Y_PCA.txt')))
print('LLTSA mapping verification (should be zero) ', sum(matmul(X,M_LLTSA) - loadtxt('Y_LLTSA.txt')))
print('LPP mapping verification (should be zero) ', sum(matmul(X,M_LPP) - loadtxt('Y_LPP.txt')))
print('pPCA mapping verification (should be zero) ', sum(matmul(X,M_pPCA) - loadtxt('Y_pPCA.txt')))
print('NPE mapping verification (should be zero) ', sum(matmul(X,M_NPE) - loadtxt('Y_NPE.txt')))

Here is what the output looks like:

PCA mapping verification (should be zero)  5008.6003077752175
LLTSA mapping verification (should be zero)  0.7019146815386402
LPP mapping verification (should be zero)  0.005094921908160148
pPCA mapping verification (should be zero)  58.372971159836005
NPE mapping verification (should be zero)  0.00780677463630014

Another problem is happening when the size of the input is high. Some of the methods fail and give a message "Killed".

Regards,
Mohamad

@DStrelak
Copy link
Collaborator

Hi @MohamadHarastani ,
Thanks for reporting this issue.
I have no experience with this program, but I guess I can at least debug the crash. Can you provide me with the big input, or tell me how to generate it?

@DStrelak
Copy link
Collaborator

@cossorzano , can you please have a look at the output, or assign somebody who knows this program?

@MohamadHarastani
Copy link
Collaborator Author

MohamadHarastani commented Aug 11, 2020

Hi @MohamadHarastani ,
Thanks for reporting this issue.
I have no experience with this program, but I guess I can at least debug the crash. Can you provide me with the big input, or tell me how to generate it?

Hi @DStrelak ,
Thanks for looking into this. Here is a bigger input that I have used as a test (I didn't test other than this)
X.txt
This is the command you need to see the error
xmipp_matrix_dimred -i X.txt -o Y.txt -m PCA --din 35832 --samples 100 --dout 2 --saveMapping M.txt
It fails for these methods and gives the message between the parenthesis:
PCA (KILLED)
LLTSA (KILLED)
LPP (KILLED)
pPCA (KILLED)
SPE (Segmentation fault (core dumped))
NPE (XMIPP_ERROR 30: Incorrect matrix dimensions)

While it works for the remaining methods (LTSA, DM, kPCA, LE, HLLE)

@cossorzano
Copy link
Contributor

I will have a look into this to see if I can find the problem

@DStrelak
Copy link
Collaborator

Hi @MohamadHarastani ,
so, I had a look at the issue. It seems that the big input is indeed too big. On my machine, it crashes due to insufficient memory. Your example uses around 19GB of memory, and then fails at another allocation (of ~9.5GB). In theory, we can release 9.5GB during the process. Still, you will need at least 20GB of RAM.
I have prepared a hotfix for devel version of xmipp (xmippCore, ds_issue315_dimredCrashing), which fixes PCA.
We will discuss if we will include this change into release, and I will have a look at the other cases meanwhile.

@cossorzano
Copy link
Contributor

I have gone over the PCA problem, and it is indeed a memory problem. The possible solution by David is not really that useful, because it frees the memory in Xmipp in the function firstEigs, which is certainly not pleasant, as in that function we do not know what that matrix is used for.

I am checking now for LLTSA

@DStrelak
Copy link
Collaborator

which is certainly not pleasant, as in that function we do not know what that matrix is used for.

Indeed.

@cossorzano
Copy link
Contributor

Hi @MohamadHarastani , I have gone over LLTSA and it is also a memory problem. I guess it is the same for all the methods that get killed. Apart from the interesting stress problem, in which context in CryoEM do you have such a high input dimensionality?

I will check now about the mapping problem you reported first.

@cossorzano
Copy link
Contributor

cossorzano commented Aug 12, 2020

Hi @MohamadHarastani ,

the reason for the difference between the projected PCA and the one in python, is that in PCA you have to subtract first the mean by columns

from numpy import loadtxt, matmul, sum, mean, outer, ones
X = loadtxt('X.txt')
M_PCA = loadtxt('M_PCA.txt')
M_pPCA = loadtxt('M_pPCA.txt')
Xp = X-outer(ones(X.shape[0]),mean(X, axis=0))
print('PCA mapping verification (should be zero) ', sum(matmul(Xp,M_PCA) - loadtxt('Y_PCA.txt')))
print('pPCA mapping verification (should be zero) ', sum(matmul(Xp,M_pPCA) - loadtxt('Y_pPCA.txt')))

The result is
('PCA mapping verification (should be zero) ', 9.5580000628398e-05)
('pPCA mapping verification (should be zero) ', -1.150100000465483e-06)

Cheers and thank you for such detailed issues, they are very easy to reproduce.
Carlos Oscar

@MohamadHarastani
Copy link
Collaborator Author

Hi @cossorzano and @DStrelak ,
Thanks for looking into this.

Apart from the interesting stress problem, in which context in CryoEM do you have such a high input dimensionality?

Each line of this data is an atomic structure (atom coordinates reshaped into a line).
I was able to use PCA from scikit-learn without asking for anything specific other than the output dimensions (by these codes on the big X.txt ).

import numpy as np
import matplotlib.pyplot as plt
from sklearn import decomposition
X = np.loadtxt('X.txt')
pca = decomposition.PCA(n_components=3)
X = pca.fit_transform(X)
plt.figure()
plt.scatter(X[:, 0], X[:, 1])
plt.show()

I looked at how PCA is handled there and they seem to use the method of (Halko et al. 2009) for a size bigger than 500*500.
You can search on this page for (svd_solverstr {‘auto’, ‘full’, ‘arpack’, ‘randomized’}). We can consider the issue with the big matrix a secondary problem.

Thanks for your efforts

@MohamadHarastani
Copy link
Collaborator Author

Hi @MohamadHarastani ,

the reason for the difference between the projected PCA and the one in python, is that in PCA you have to subtract first the mean by columns

from numpy import loadtxt, matmul, sum, mean, outer, ones
X = loadtxt('X.txt')
M_PCA = loadtxt('M_PCA.txt')
M_pPCA = loadtxt('M_pPCA.txt')
Xp = X-outer(ones(X.shape[0]),mean(X, axis=0))
print('PCA mapping verification (should be zero) ', sum(matmul(Xp,M_PCA) - loadtxt('Y_PCA.txt')))
print('pPCA mapping verification (should be zero) ', sum(matmul(Xp,M_pPCA) - loadtxt('Y_pPCA.txt')))

The result is
('PCA mapping verification (should be zero) ', 9.5580000628398e-05)
('pPCA mapping verification (should be zero) ', -1.150100000465483e-06)

Cheers and thank you for such detailed issues, they are very easy to reproduce.
Carlos Oscar

Thanks a lot @cossorzano for the solution.
Honestly, it is not intuitive to subtract the mean for these two methods particularly. I was depending on an implementation of inverse PCA in our HEMNMA plugin that was wrong for years (inverse PCA to generate animations).
Now I can fix all.
But please add somewhere in xmipp_matrix_dimred discription that these two methods shoule be treated specially.

Thanks again,
Cheers,
Mohamad

@DStrelak
Copy link
Collaborator

Each line of this data is an atomic structure (atom coordinates reshaped into a line).

I strongly recommend to NOT use xmipp_matrix_dimred for this. I tried it using your dataset. My patience run out after 80 minutes :-D

@MohamadHarastani
Copy link
Collaborator Author

Each line of this data is an atomic structure (atom coordinates reshaped into a line).

I strongly recommend to NOT use xmipp_matrix_dimred for this. I tried it using your dataset. My patience run out after 80 minutes :-D

Oh 80 minutes! I didn't reach to this point before. If you try to use the sklearn PCA that I put before it works in seconds on the same data.
Thanks a lot @DStrelak for your patience,
Cheers

@MohamadHarastani
Copy link
Collaborator Author

I want to close this issue so I did another check.
Actually, all the methods require that subtracting the mean from original matrix, and not only the two that I reported (PCA, pPCA).
To verify, I repeated the testing using the instructions of @cossorzano

from numpy import loadtxt, matmul, sum, mean, outer, ones
X = loadtxt('X.txt')
M_PCA = loadtxt('M_PCA.txt')
M_LLTSA = loadtxt('M_LLTSA.txt')
M_LPP = loadtxt('M_LPP.txt')
M_pPCA = loadtxt('M_pPCA.txt')
M_NPE = loadtxt('M_NPE.txt')

Xp = X-outer(ones(X.shape[0]),mean(X, axis=0))

print('PCA mapping verification (should be zero) ', sum(matmul(Xp,M_PCA) - loadtxt('Y_PCA.txt')))
print('LLTSA mapping verification (should be zero) ', sum(matmul(Xp,M_LLTSA) - loadtxt('Y_LLTSA.txt')))
print('LPP mapping verification (should be zero) ', sum(matmul(Xp,M_LPP) - loadtxt('Y_LPP.txt')))
print('pPCA mapping verification (should be zero) ', sum(matmul(Xp,M_pPCA) - loadtxt('Y_pPCA.txt')))
print('NPE mapping verification (should be zero) ', sum(matmul(Xp,M_NPE) - loadtxt('Y_NPE.txt')))

And here is the result:

PCA mapping verification (should be zero)  9.5580000628398e-05
LLTSA mapping verification (should be zero)  1.2249999997534415e-05
LPP mapping verification (should be zero)  0.5852513499999976
pPCA mapping verification (should be zero)  -1.150100000465483e-06
NPE mapping verification (should be zero)  -0.6755314400000025

Meaning, only this description of "xmipp_matrix_dimred" should explain what is x, y and m to solve this confusion:

   [--saveMapping <fn=>]
           Save mapping if available (PCA, LLTSA, LPP, pPCA, NPE) so that it can be reused later (Y=X*M) 

I will add a little description to the file and open a PR.

Regards,
Mohamad

@MohamadHarastani MohamadHarastani self-assigned this Sep 1, 2020
@MohamadHarastani MohamadHarastani linked a pull request Sep 1, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants