Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing modified Twinspan according to Rolecek et al. (2009) #2

Open
kamapu opened this issue Dec 9, 2021 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@kamapu
Copy link

kamapu commented Dec 9, 2021

As mentioned once, we have tested this package during a course using a small data set from Fujiwara et al. (2014). Now my question is if there is a possibility to implement the modified Twinspan according to Roleček et al. (2009).

During the course I also had contacted @zdealveindy, who is also interested in such implementation (maybe even able to support?).

Though the modified Twinspan is already implemented on Juice, this software is restricted to Windows and therefore an implementation on R will increase the accessibility of the assessment for people working in different operative systems, like me (I'm working on Linux).

@jarioksa
Copy link
Owner

jarioksa commented Dec 14, 2021

There are a couple of issues here. Firstly, the choice of the next step should be implemented in Fortran code which makes it all a bit trickier. In particular as I also want to maintain the original behaviour of TWINSPAN so that we need IF...ELSE... branches. Second issue is how to define the heterogeneity. I got to log in with my university credentials to read the linked paper, and therefore I am not sure how Roleček defined the heterogeneity, but I have a vague memory that it was some external dissimilarity index.

Then an alternative that we can easily implement. The twinspan package has now function twintotalchi which estimates the total heterogeneity of each division using the internal TWINSPAN criterion of heterogeneity: total Chi-square of downweighted data within each group. We can already plot dendrograms with this measure of heterogeneity (the attached graph uses the ahti data of the twinspan package).
twinlevel
The package also has a function cut that returns the classification at given level. The red line in the left graph cuts the quadrats in eight classes. There is no (yet) such a function for heterogeneity-based dendrogram. However, I worked out a simple function that can do this: it will return a requested number of groups picking groups by their heterogeneity. The red line in the right graph gives the cut level for eight groups when the tree is based on within-group heterogeneity (Chi-square). That would in effect be similar as doing these splits by heterogeneity. I think there is no guarantee that split groups are more homogeneous than their parent group – although this can be expected in many cases for Chi-square (but not for the first eigenvalue that is the criterion for a split).

The standard cut by division level gives this distribution of quadrats in groups 8…15

> table(cut(tw, 3))
 8  9 10 11 12 13 14 15 
14 25 18 30 22 18 32 11 

With the experimental function the distribution of quadrats to eight groups is:

> table(twinspan:::chicut(tw, 8)) # preliminary (also name which can change), not yet exported
 7  8  9 10 12 13 22 23 
43 14 25 18 22 18 19 11 

Group 7 was not split, and so groups 14 & 15 are missing. Group 11 is also missing because it was split to 22 & 23. Other groups are the same in both cuts

The first experimental version of function is available in commit 03bad03 in branch cut-by-chisquare.

@jarioksa jarioksa added the enhancement New feature or request label Dec 14, 2021
jarioksa added a commit that referenced this issue Dec 21, 2021
Implements Rolecek et al., JVS 20, 596-602 (2009) modified TWINSPAN
where divisions are picked by heterogeneity (but honouring hierarchy).
Basically there are two functions achieve this: cuth() that returns
a classification vector for quadrats or species for most heterogeneous
classes, and modification of as.hclust that now can display the
TWINSPAN tree using heterogeneity as height. The heterogeneity is
extracted from twinspan object, and it is the total chi-square (or
inertia or sum of all eigenvalues) of the matrix internally used by
twinspan in that division. Wish of github issue #2.
@jarioksa
Copy link
Owner

jarioksa commented Dec 21, 2021

I have essentially implemented this using "alternative method" of the Roleček paper. This involves one new user-level function and one new argument in another, plus some internal support functions. I kept the old TWINSPAN code untouched and only manipulate the result so that splitting heterogeneities are taken into account.

Function as.hclust.twinspan which draws the TWINSPAN classification tree can now use heterogeneity to set the tree heights or the branching points. This essentially draws the Roleček tree.

> library(twinspan)
> library(vegan) # ordilabel
> data(ahti)
> tw <- twinspan(ahti)
> cl0 <- as.hclust(tw)
> cl <- as.hclust(tw, height="chi")
> plot(cl0)
> ordilabel(cl0, "internal", cl0$nodelabels) # plot(tw) does all this, but does not do the rect.hclust
> rect.hclust(cl0, 8)
> plot(cl, main="Rolecek Tree")
> ordilabel(cl, "internal", cl$nodelabels)
> rect.hclust(cl, 8)

roletree
The tree has fixed number of division, but the modified tree can be recovered for a decent number of classes as long as there is enough depth.

The trees can also be displayed in the image plot:

> image(tw, height="chi", leadingspecies=TRUE)

roleiamage
As you see, the method applies to species, too.

I wrote a new function cuth that cuts the classification by height, and can be used to extract classes for a single class. This function takes as argument the desired number of classes, and will go down from most heterogeneous:

> cut(tw, level=3) # level 3 has 2^3 = 8 classes (at maximum)
  [1] 11 11 11 11 11 11 11 11 14 11 11 11 12 11 11 11 14 10 10 10 10 10 10 10 10
 [26] 10 10 10 10 10 10 10 10 10  8  8  8  8  8  8  8  8  8  8  8  8  8  9  8  9
 [51]  9  9  9  9  9  9  9  9  9 11 11 11  9  9  9  9  9 11 11 11 11 11  9 11 10
 [76]  9  9  9  9 11 11  9 11 11 11 11 11 15 14 14 14 14 15 14 14 14 14 14 14 14
[101] 14 14 14 14 14 14 14 14 14 12 12 12 15 15 15 15 15 14 14 14 14 15 14 15 14
[126] 14 15 13 12 14 13 14 13 13 12 13 13 13 14 12 12 12 12 12  9 12  9 12 12 12
[151] 13 13 13 12  9 13 12 13 13 12 12 12 12 12 13 13 13 13 13 15
> cuth(tw, n=8)
  [1] 23 23 23 22 23 23 23 23  7 22 22 23 12 22 23 22  7 10 10 10 10 10 10 10 10
 [26] 10 10 10 10 10 10 10 10 10  8  8  8  8  8  8  8  8  8  8  8  8  8  9  8  9
 [51]  9  9  9  9  9  9  9  9  9 22 23 23  9  9  9  9  9 22 22 22 22 22  9 22 10
 [76]  9  9  9  9 22 22  9 22 22 22 22 22  7  7  7  7  7  7  7  7  7  7  7  7  7
[101]  7  7  7  7  7  7  7  7  7 12 12 12  7  7  7  7  7  7  7  7  7  7  7  7  7
[126]  7  7 13 12  7 13  7 13 13 12 13 13 13  7 12 12 12 12 12  9 12  9 12 12 12
[151] 13 13 13 12  9 13 12 13 13 12 12 12 12 12 13 13 13 13 13  7

As the criterion of heterogeneity I used total Chi-square also known as Inertia also known as the sum of all eigenvalues. I use the same Chi-square of the same matrix (or submatrix) that is used in TWINSPAN: stacked pseudospecies with downweighting. For species, I also use the internally constructed species matrix – which is a bit trickier to construct. Functions twin2stack and twin2specstack and orditotalchi estimates the Chi-square. All these are old functions. It would easy to plug in other criteria (as the implementation is modular), but I'm not inclined to do so: it makes sense to use internal TWINSPAN criterion instead of criteria alien to the method.

The greatest complication was that the classification tree can be reversed: child group is more heterogeneous than her parent group. It seems that I can handle these cases. If you look carefully the image, you see that species tree has two such reversals, but they seem to work smoothly.

Finally about the direct alternative implementation following more closely Roleček's original implementation. Unlike I wrote, it could be implemented in R without need of modifying the Fortran code. A clever implementation would need refactoring current twinspan code. The code would essentially be calling `twinspan(..., levmax=1), getting the heterogeneities and recursively doing the same for the most heterogeneous group. Refactoring is needed, because we cannot have species clustering, and we do not want re-do data preparation before TWINSPAN call.

This is now merged, and I appreciate your comments, criticism and ideas of development @kamapu and @zdealveindy (and for your information @gavinsimpson).

@kamapu
Copy link
Author

kamapu commented Dec 22, 2021

Wow! Thank you, for considering the issue. I'll get through (I'm thinking to run a classification in Juice in parallel and compare the outputs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants