Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fjaccard #42

Open
andreanini opened this issue Jan 29, 2023 · 9 comments
Open

Add fjaccard #42

andreanini opened this issue Jan 29, 2023 · 9 comments

Comments

@andreanini
Copy link

Is the eJaccard in this package equivalent to the min-max similarity (aka Ruzicka Distance aka fuzzy Jaccard)?

@koheiw
Copy link
Owner

koheiw commented Jan 29, 2023

@andreanini
Copy link
Author

Dear Kohei, Thank you very much for this and for all the work you are doing for quanteda. Really amazing! I saw the vignette but I suppose I do not have enough sophistication with the notation to understand how that formula (which looks like the Tanimoto coefficient?) relates to those other variants of Jaccard. Min-max or Ruzicka is:

$\frac{\sum_i min(x_i, y_i)}{\sum_i max(x_i, y_i)}$

I suppose I was hoping for more clarifications on the mathematics rather than on the code implementation.

@koheiw
Copy link
Owner

koheiw commented Jan 29, 2023

This package was originally created to replicate the proxy package for text analysis (for textstat_simil()). I can find above formula as `fjaccard" in their vignette. I did not implement it but I could.

https://cran.r-project.org/web/packages/proxy/vignettes/overview.pdf

@andreanini
Copy link
Author

oh yes, of course. They list both of them so it makes sense that they are distinct coefficients. Sorry about that.
At the moment I'm using proxy to run the "fjaccard" but, of course, I'd love to use textstat_simil() instead. Much faster!

@koheiw koheiw changed the title eJaccard Add fjaccard Jan 31, 2023
@andreanini
Copy link
Author

andreanini commented Oct 11, 2023

Dear Kohei,

I forked your repo as I believed I could easily add this myself and then send a pull request. I made the change, I think, but I'm not an expert of C++ and I'm stuck at loading the R package to test it. It seems there is a library missing or not in the right path.

The change I made is I added the following function to pair.cpp

double simil_fjaccard(colvec& col_i, colvec& col_j) {
    auto joined_mat = arma::join_cols( col_i, col_j );
    return sum(min(joined_mat)) / sum(max(joined_mat)); }

and then of course added "fjaccard" as an option in the similarity functions. If the code above is correct and the change is quite small, would you mind adding it yourself? I'm not sure how long it would take for me to figure out what's wrong with my path.

This similarity measure is very important in stylometry and I am developing a package for stylometry which is dependent on quanteda (https://github.com/andreanini/idiolect) so adding this to proxyC and/or quanteda would actually help lots of future users of my package and of quanteda.

@koheiw
Copy link
Owner

koheiw commented Oct 13, 2023

I am developing the fuzzy Jaccard measure in issue-42, and found disagreement between proxy::simil and proxy::dist. Only 1 - proxy::dist looks correct. Which function are you using?

v1 <- c(0.1, 0.2, 0.3, 0.9)
v2 <- c(0.3, 0.1, 0.2, 0.4)

sum(pmin(v1, v2)) / sum(pmax(v1, v2))
#> [1] 0.4705882

proxyC::simil(v1, v2, method = "fjaccard", margin = 2) 
#> 1 x 1 sparse Matrix of class "dgTMatrix"
#>               
#> [1,] 0.4705882

proxy::simil(v1, v2, method = "fjaccard", by_rows = FALSE)
#>      [,1]     
#> [1,] 0.6538462
1 - proxy::dist(v1, v2, method = "fjaccard", by_rows = FALSE)
#>      [,1]     
#> [1,] 0.4705882

@andreanini
Copy link
Author

I use proxy::dist, as proxy lists the fuzzy jaccard coefficient among the distances. I have previously found issues in the way proxy transforms similarities to distances. For example, proxy transformed the cosine similarity to distance by doing 1 - similarity, which is incorrect. This has now been fixed after I reported it. It could be that there is a similar bug here. Thanks for your help with this.

@koheiw
Copy link
Owner

koheiw commented Oct 14, 2023

I think the C code for proxy::simil() is wrong, but the R code for proxy::dist() is correct. I reported to the maintainer. I don't understand why there are two sets of code.

@andreanini
Copy link
Author

yeah, this should be an easy coefficient to transform. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants