-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karcher mean #40
Karcher mean #40
Conversation
I do not like future right now as most users are not using it, would like to keep the parallel option |
Also if you add back in rotation and scale how is this different than |
I removed the future framework and used doParallel as you do in other functions. I compared the output of I think we need to document somewhere why there is need to call internally Also, I am thinking about bringing back in the mode as well. Is multidimensional function with first and last point equal to be considered a closed curve? To that end, I tried once to set Finally, once we agree on the above, I'll fix the median computation. |
So I do not agree with just comparing outputs. See Algorithm 39,40,41, and 46 in 2016 Book by Anuj and Klassen The point of moving the centroid is purely numerics as it makes sure all data is centered and you are not working with any input data that has an odd translation. |
Also what what do you mean it “errored”, your input curves have to be closed if you use that mode. We have been using it for a long time. |
I think the purpose of Then one can include re-parametrization and rotation invariance by minimising properly this distance over the correct set (of rotations or of diffeomorphisms). It then defines a semi-metric in the sense that d(f_1, f_2) = 0 does not imply that the two functions are equal but that they belong to the same class of equivalence (which I believe you refer to as an orbit). Also, from a package user perspective, I do not understand the need to have one function for Karcher mean of 1D functional data and another for the multidimensional case. We could have just one which picks the optimisation algorithm depending on the dimension of the co-domain.
Ok. So the centroid should be added back at the end of the iterative procedure then? If so, I see easily how to do so for the original functions but not that clear which center should be added to the Karcher mean. That is why currently the output functions are all centered according to |
Thanks for the book references. I definitely think then that we as in "people working with functional data" are not in this framework. I quickly looked at the references you pointed out and in your setting, you are projecting systematically SRSFs on an hypersphere so you are not exploiting the About the error, I could give you a reprex but maybe it is just my data which are not closed curves... I wanted to try because the first and last value of my curves are the same. Hence on a 3D plot, curves are "closed". But it might not be the definition of a closed curve. |
In fact, what bugs me (probably because of my poor understanding) is that the SRSFs are systematically normalised in the |
So you again are conflating theories. We are dealing with multivariate data so the 2013 paper does not apply. That is only for functions in R^1!. Also I am going to ask have you read the all the papers and the 2016 book by Anuj and Klassen? Your question is fundamental. We assume that the curves are absolutely continuous in R^n for the SRVF to exist. You can include rotation, but as was demonstrated in the papers in 2011 and 2012 by Anuj and Kurtek and the 2016 book by Anuj that one does not want to do that on L2 metric usually and we want an elastic metric and we go to the preshape space of the Hilbert sphere for open curves and \mathbb{S}_c for closed curves. This is WHY the package is coded this way. All of it is around the elastic metric in the proper space! I am not going to change this. From my perspective I want a separate function for multivariate data as it is slightly different from functions in R^1. The proper metric for the case of multivariate functional data (R^2) is L2 if you do mod out rotation and scale. I coded this function for this very explicit case. Per the users of the package and groups that I support we will keep it this way. We could do a complete refactor as you suggest, but this will be very large and a lot of breaking changes for users. If I was going to rewrite from the ground up after ALL the theory was developed I would have done that. This package was developed as the theory progressed. The centroid could be added back in, but we have decided that is left to the user and will stay that way. |
I would say some people are not in this framework, others are. And there is a difference between SRSF and SRVF. Hence the difference in the package. Generally for any curve in R^N where N > 1 this is a BAD idea. See comment above for a bit more. But the point of all this work is that we want to use the elastic metric in the proper space. I need to see your data to answer this question on closed curves |
We do this for the numerics of computing the optimal warping function. The c++ code expects a normalized SRVF which makes the numerics more accurate. Scale does not effect the solution of the warping function. |
I did read the papers from 2011 and 2012 (no not the entire book from 2016). Specifically, I am referring to Table 1 from the 2012 paper which defines the prespace as |
Yes table 1 in that paper points out the spaces. The focus of this package is on the bottom row. If there is any expansion I would only want to do the upper left. If you are wanting to do the entire top row, we need to be very explicit in the documentation as users will just use this based on file name and not understand which metrics and then will complain when the results don't match. Kurtek goes on in that paper to motivate the elastic approach and using Algorithm 1. I really want to avoid confusion between those spaces. |
The bottom row does not include rotation invariance. So I wonder what is done in the package when we use I am happy to create a separate package if you think it might clarify for the users. But I need to fully grasp the subtle points it seems I am missing first :-). In fact, I originally was misled by the name of the package I still think that a single package could handle all 4 feature spaces. |
It is dedicated to FDA, I am very offended by that remark actually and no we should not create another package, but should be careful about how we name things here if we go this route. Please see the book https://link.springer.com/book/10.1007/978-1-4939-4020-2 That is my entire research is in FDA. The bad part is most statisticians don't go beyond the simple definition of R^1 and multivariate functional data. Functional data goes way beyond that. See the work by Dryden, Marron, Srivastava, Tucker, Kurtek and more. Functional Data Analysis is the application of functional analysis and can go beyond functions in R^1!!! |
You are too focused on one paper and one table and the bottom row does talk about rotation that is the bottom right entry. |
I think we can handle all four, but its going to have to be a meta function that is documented well and goes out and calls the sub-functions. I have a lot of users that use the other functions so I need to keep functionality. Some people don't read documentation and just apply functions by names and gets them into trouble so I would like to avoid that as much as possible. This function is not going to solve this though. |
I apologise, I did not mean to offend you at all. I am really just trying to understand the whole picture here. In particular, my confusion started with the need to create 3 functions to handle 1D functional data, multidimensional data and curves separately. Why do we want to do this if all of these are functional data? This can be really confusing. The type of data is always functional data, which justifies the name of the package. Then, the choice from the user perspective should be the feature space, i.e. does he want to detect changes in scale, rotation, position, etc. Once the user has made this choice, then we as developers should internally use the appropriate space/representation and metric to do all kind of statistical analysis, be it simple distance computation, or PCA or Karcher mean or clustering or others. On the contrary, by naming one function
I think you are a bit harsh here. Statisticians have been using FDA to deal with multivariate functional data for a long time, so not just functions in |
My bad, yes indeed, you are right. But that (bottom right) means that you systematically assume that In any event, if I can try to explain where I wanted to go with |
Actually, after this discussion, I think the current implementation of If you agree, then, combined with |
I think your confusion is one of the first in this area. I am going to repeat this again. This project has grown as the theory developed, hence its structure. It also was set up to mirror the MATLAB codes first produced by Anuj, Sebastian, and I. If I would start over now and have a different structure, yes. But we are here now with a lot of users used to this structure and naming. This is a start on the top row, but needs better documentation, better initialization and is only for multivariate data not 1-D. There are specific numerical things in the 1-D code that we do to protect the user that is not covered in this function. But yes combined with Again I think at multivariate meta function to handle all four cases would be better, with good documentation on what each means. Doing a whole sale refactor of the package, while great is not a good idea given how this package is being used by multiple grad students and national labs. We can slowly work down that road, but it will have to be very slow. Also you are still limiting your view on FDA by just saying shape. We can also go to the space of [0,1]^2 and include images and surfaces. |
Equally offended by this remark. I have also dedicated most of my research to FDA and we as a group at the department of mathematics in France and also in my old lab at MOX, Politecnico di Milano, got confused by the namings and the multiple functions. To be more constructive, I think this is a classic case of package developed by theoretical mathematicians where each function does one mathematical operation in a given space with a given metric. On the contrary, R is primarily targeting applied statisticians for whom a bijection between package functions and statistical methods will generally work better, e.g. a single function for computation of mean for functional data using elastic distance. The applied statistician often does not necessarily have the underlying knowledge to pick the right function in the current API of the package. I mean by that the (s)he does know whether (s)he wants rotation invariance or scale invariance and so on, but (s)he does not necessarily know what kind of preshape space does this imply nor which metric should be used in that space. So I guess the API design mostly depends on the targeted audience, which, when it comes to R packages, are applied statisticians. I have a similar issue with two packages I maintain: rgudhi and rgeomstats, which are still at their infancy. They make the Python packages GUDHI (for topological data analysis) and GeomStats (for statistics on Riemannian manifolds) available to the R community. I originally (currently) designed the R packages to mimic the Python API. But as time goes by and feedback comes in, I feel the R community is mostly composed of data scientist who want to apply stats methods but do not always have the required maths background to correctly make the subtle choices. |
Many functions of the package require a huge update of their documentation I would say, indeed including this one. This is related to another issue posted by myself and @araiari. About the initialisation, I will push today the function As for multivariate VS 1D, well, aren't 1D function multivariate with dimension 1 somehow? Why would it not work for 1D functions? If you can explain, I can adapt the code but as I am not familiar with the numerical tricks, it is hard for me to grasp the difference. You confirm then that |
Aren't images and surfaces shapes? The difference is only the dimension of the domain that changes, isn't it? |
I would go farther. Since you said that the package focuses on the bottom row of Table 1 in the 2012 paper, then the optional argument |
I disagree that R is applied statistics, it is used by applied statisticians, research statisticians, mathematical statisticians. Yes the documentation needs improvement. Yes 1-D mathematically is dimension 1 of multidimensional. There are numerical tricks to handle integration error and choosing orbit elements, etc. I don't have time right now to explain all of it and its not for this PR. You are trying understand 14years of work in a PR. You could call images shape. But there are differences and the distinction is important and for shape we go to Kendall's definition. One could also argue the reverse that 1-D function and N-D functions are shape so it is all "shape". the option of Epifanio, I., Gimeno, V., Gual-Arnau, X., and Ibáñez-Gual, M. V. (2020), “A New Geometric Metric in the Shape and Size Space of Curves in Rn,” Mathematics, 8, 1691. https://doi.org/10.3390/math8101691. |
Sure, this PR is about Karcher mean but it is all related. My aim is to provide
I think I have the second function ready (called currently And this PR also should go up to the third point and provide a function for Karcher mean computation. So although the PR is named Karcher mean, all the points discussed here are important to get it right. |
…to initialize Karcher mean computation.
|
I just committed the function which should fulfil the second point. |
It also should be noted for your awareness, the curve functions do not implement all the optimization functions that are available for 1-D. See |
Thanks for merging. I am still working on all this but will create separate issues and PRs when the time comes. |
A new version for
multivariate_karcher_mean()
which brings back the possibility to use a rotation- and/or scale-invariant metric.The function is parallelized via the future framework. It is handy because you do not have to add any extra parameter to the function. It is parallelized automatically whenever the user chooses to parallelise calculations within his R session via
future::plan(future::multisession, workers = 8)
for example.One point to improve is the case of median under scale invariance. I am pretty sure I do not use the correct distance.