Skip to content

Releases: jjlastra/HESML

Release_HESML_V2R1

19 Sep 21:17
Compare
Choose a tag to compare

This new release of HESML http://hesml.lsi.uned.es is a special version focused on sentence similarity methods in the biomedical domain. The main novelties introduced by HESML V2R1 are as follows: (1) the software implementation of a new package for the evaluation of sentence similarity methods; (2) the software implementation of most of the sentence similarity methods in the biomedical domain; (3) the implementation of a new package for sentence pre-processing together with a set of sentence pre-processing configurations; (4) the integration of the three main biomedical NER tools, Metamap , MetamapLite and cTAKES; (5) the software implementation of a parser based on the averaging Simple Word EMbeddings (SWEM) models introduced by Shen et al. for efficiently loading and evaluating FastText-based and other word embedding models; (6) the integration of Python wrappers for the evaluation of BERT Universal Sentence Encoder (USE) and Flair models; and finally, (7) the software implementation of a new string-based sentence similarity method based on the aggregation of the Li et al. similarity and Block distance measures, called LiBlock, as well as eight new variants of the ontology-based methods proposed by Sogancioglu et al., and a new pre-trained word embedding model based on FastText and trained on the full-text of the articles in the PMC-BioC corpus.

Release_HESML_V1R5.0.2

29 Apr 21:30
Compare
Choose a tag to compare

HESML V1R5 implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature which are based on WordNet, SNOMED-CT, MeSH, GO, and OBO-based ontologies. The main novelties provided by HESML V1R5 are as follows: (1) implementation of the SNOMED-CT, MeSH, GO and OBO-based ontologies for the biomedical domain; (2) software implementation of six new groupwise similarity measures as follows: SimUI [8], SimLP [8], SimGIC [9], Average [10], Maximum [11], Best-match-Average (BMA) [10]; and finally, (3) the introduction of a new family of efficient path-based semantic similarity measures based on the reformulation of path-based measures using the new AncSPL [7] algorithm for the real-time computation of the length of the shortest path between concepts.

HESML V1R5

27 Jul 18:36
Compare
Choose a tag to compare

HESML V1R5 implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature which are based on WordNet, SNOMED-CT, MeSH, GO, and OBO-based ontologies. The main novelties provided by HESML V1R5 are as follows: (1) implementation of the SNOMED-CT, MeSH, GO and OBO-based ontologies for the biomedical domain; (2) software implementation of six new groupwise similarity measures as follows: SimUI [8], SimLP [8], SimGIC [9], Average [10], Maximum [11], Best-match-Average (BMA) [10]; and finally, (3) the introduction of a new family of efficient path-based semantic similarity measures based on the reformulation of path-based measures using the new AncSPL [7] algorithm for the real-time computation of the length of the shortest path between concepts.

HESML V1R4

19 Sep 15:31
Compare
Choose a tag to compare

HESML V1R4

HESML V1R4 is the fourth release of the Half-Edge Semantic Measures Library (HESML) [1], which is a new, scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models based on WordNet.

HESML V1R4 implements most ontology-based semantic similarity measures and Information Content (IC) models based on WordNet reported in the literature. In addition, it provides a XML-based input file format in order to specify the execution of reproducible experiments on WordNet-based similarity, even with no software coding.

HESML is introduced and detailed in a companion reproducibility paper [1] of the methods and experiments introduced in [2,3,4].

The main features of HESML are as follows: (1) it is based on an efficient and linearly scalable representation for taxonomies called PosetHERep introduced in [1], (2) its performance exhibits a linear scalability as regards the size of the taxonomy, and (3) it does not use any caching strategy of vertex sets.

Main novelties in HESML V1R4

Main novelties provided by HESML V1R4 are as follows:

(1) software implementation of a fast version of the Wu&Palmer [7] similarity measure defined by the formula sim(a,b) = 2*depth(LCS(a,b)) / (depth(a) + depth(b))

(2) software implementation of three new similarity measures based on the evaluation of pre-trained word embedding models in EMB, UKB(ppv) and Nasari file formats. Each row of the EMB file format contains a word vector with the raw coordinates of each word in a fixed dimension, whilst coordinates of word vectors in the UKB file format are defined by WordNet synsets, and finally Nasari vectors represent BabelNet synsets by a weighted and ranked set of other BabelNet synsets. In addition, Nasari provides an additional file with the BabelNet synsets corresponding to each word. Similarity measure based on the EMB and UKB files use the standard cosine function to compute the degree of similarity between words, whilst the similarity measure based on the Nasari files use the weighted overlap fucntion as detaield by the authors in [8]. HESML V1R4 release distribution is also published in Mendeley. HESML V1R4 is distributed with a dozen of pre-trained embedding models which cannot be included in Github because of their large size. However, you can download the 'WordEmbeddings.zip' file from our Dataverse repository cited below (e-cienciaDatos), then you should extract its content onto the 'HESML_Library\WordEmbeddings' directory.

Lastra-Díaz, Juan J.;Goikoetxea, Josu; Hadj Taieb, Mohamed; Agirre, Eneko;García-Serrano, Ana;Ben Aouicha, Mohamed, 2018, "Word similarity benchmarks of recent word embedding models and ontology-based semantic similarity measures", https://doi.org/10.21950/AQ1CVX, e-cienciaDatos,

(3) software implementation of the IC model proposed by Cai et al. (2017)[6]

(4) software implementation of a IC-based similarity measure proposed by Cai et al. (2017)[6]

Licensing information

HESML library is freely distributed for any non-commercial purpose under a CC By-NC-SA-4.0 license, subject to the citing of the main HESML paper [1] as attribution requirement. On other hand, the commercial use of the similarity measures introduced in [2], as well as part of the intrinsic IC models introduced in [3] and [4], is protected by a patent application [5]. In addition, any user of HESML must fulfill other licensing terms described in [1] related to other resources distributed with the library, such as WordNet and a dataset of corpus-based IC models, among others.

References:

[1] Lastra-Díaz, J. J., García-Serrano, A., Batet, M., Fernández, M. and Chirigati, F. (2017). HESML: a scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Information Systems 66, 97-118. http://dx.doi.org/10.1016/j.is.2017.02.002

[2] Lastra-Díaz, J. J., & García-Serrano, A. (2015). A novel family of IC-based similarity measures with a detailed experimental survey on WordNet. Engineering Applications of Artificial Intelligence Journal, 46, 140-153. http://dx.doi.org/10.1016/j.engappai.2015.09.006

[3] Lastra-Díaz, J. J., & García-Serrano, A. (2015). A new family of information content models with an experimental survey on WordNet. Knowledge-Based Systems, 89, 509-526. http://dx.doi.org/10.1016/j.knosys.2015.08.019

[4] Lastra-Díaz, J. J., & García-Serrano, A. (2016). A refinement of the well-founded Information Content models with a very detailed experimental survey on WordNet. Universidad Nacional de Educación a Distancia (UNED). http://e-spacio.uned.es/fez/view/bibliuned:DptoLSI-ETSI-Informes-Jlastra-refinement

[5] Lastra Díaz, J. J., & García Serrano, A. (2016). System and method for the indexing and retrieval of semantically annotated data using an ontology-based information retrieval model. United States Patent and Trademark Office (USPTO) Application, US2016/0179945 A1.

[6] Cai, Y., Zhang, Q., Lu, W., & Che, X. (2017). A hybrid approach for measuring semantic similarity based on IC-weighted path distance in WordNet. Journal of Intelligent Information Systems, 1–25.

[7] Wu, Z., & Palmer, M. (1994). Verbs Semantics and Lexical Selection. In Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics (pp. 133–138). Stroudsburg, PA, USA: Association for Computational Linguistics.

[8] Camacho-Collados, J., Pilehvar, M. T., & Navigli, R. (2016). Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artificial Intelligence, 240, 36–64.

Steps to reproduce the library

HESML V1R3 is distributed as a Java class library (HESML-V1R3.jar) plus a test driver application (HESMLclient.jar), which have been developed using NetBeans 8.0.2 for Windows, although it has been also compiled and evaluated on Linux-based platforms using the corresponding NetBeans versions.

The HESML-V1R3.jar file is already included in the HESML_Library\HESML\dist folder of the HESML_Release_V1R3.zip distribution file. In order to compile HESML from its source files, you must follow the following steps:

(1) Download the full distribution of HESML V1R3.

(2) Install Java 8, Java SE Dev Kit 8 and NetBeans 8.0.2 or higher in your workstation.

(3) Launch NetBeans IDE and open the HESML and HESMLclient projects contained in the root folder. NetBeans automatically detects the presence of a nbproject subfolder with the project files.

(4) Select HESML and HESMLclient projects in the project treeview respectively. Then, invoke the "Clean and Build project (Shift + F11)" command in order to compile both projects.

In order to remain up to date on new HESML versions, as well as asking for technical support, we invite the readers to subscribe to the HESML forum by sending an email to the following address:

hesml+subscribe@googlegroups.com

Steps to use the library

You can use the HESMLclient program to run reproducible experiments or create your own client programs using the HESMLclient code as example. For more information, including a detailed description of how to run the reproducible experiments and extending the library, we refer the reader to the paper [1] above.

HESML V1R3

02 Oct 20:10
Compare
Choose a tag to compare

HESML V1R3 Java software library of ontology-based semantic similarity measures and Information Content (IC) models

HESML V1R3 is the third release of the Half-Edge Semantic Measures Library (HESML) [1], which is a new, scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models based on WordNet.

HESML V1R3 implements most ontology-based semantic similarity measures and Information Content (IC) models based on WordNet reported in the literature. In addition, it provides a XML-based input file format in order to specify the execution of reproducible experiments on WordNet-based similarity, even with no software coding.

HESML is introduced and detailed in a companion reproducibility paper [1] of the methods and experiments introduced in [2,3,4].

The main features of HESML are as follows: (1) it is based on an efficient and linearly scalable representation for taxonomies called PosetHERep introduced in [1], (2) its performance exhibits a linear scalability as regards the size of the taxonomy, and (3) it does not use any caching strategy of vertex sets.

Main novelties in HESML V1R3

HESML V1R3 introduces two minor novelties as follows: (1) the vertex ID has been updated from Integer to Long type in order to support a larger number of vertexes, and (2) it includes five new similarity measures introduced in the papers [6-9] cited below.

Permanent GitHub repository

HESML permanent GitHub repository is available at https://github.com/jjlastra/HESML.git

The initial master code of this GitHub repository matches the HESML V1R2 version available as Mendeley dataset at http://dx.doi.org/10.17632/t87s78dg78.2.

Licensing information

HESML library is freely distributed for any non-commercial purpose under a CC By-NC-SA-4.0 license, subject to the citing of the main HESML paper [1] as attribution requirement. On other hand, the commercial use of the similarity measures introduced in [2], as well as part of the intrinsic IC models introduced in [3] and [4], is protected by a patent application [5]. In addition, any user of HESML must fulfill other licensing terms described in [1] related to other resources distributed with the library, such as WordNet and a dataset of corpus-based IC models, among others.

References:

[1] Lastra-Díaz, J. J., García-Serrano, A., Batet, M., Fernández, M. and Chirigati, F. (2017). HESML: a scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Information Systems 66, 97-118. http://dx.doi.org/10.1016/j.is.2017.02.002

[2] Lastra-Díaz, J. J., & García-Serrano, A. (2015). A novel family of IC-based similarity measures with a detailed experimental survey on WordNet. Engineering Applications of Artificial Intelligence Journal, 46, 140-153. http://dx.doi.org/10.1016/j.engappai.2015.09.006

[3] Lastra-Díaz, J. J., & García-Serrano, A. (2015). A new family of information content models with an experimental survey on WordNet. Knowledge-Based Systems, 89, 509-526. http://dx.doi.org/10.1016/j.knosys.2015.08.019

[4] Lastra-Díaz, J. J., & García-Serrano, A. (2016). A refinement of the well-founded Information Content models with a very detailed experimental survey on WordNet. Universidad Nacional de Educación a Distancia (UNED). http://e-spacio.uned.es/fez/view/bibliuned:DptoLSI-ETSI-Informes-Jlastra-refinement

[5] Lastra Díaz, J. J., & García Serrano, A. (2016). System and method for the indexing and retrieval of semantically annotated data using an ontology-based information retrieval model. United States Patent and Trademark Office (USPTO) Application, US2016/0179945 A1.

[6] Hao, D., Zuo, W., Peng, T., & He, F. (2011). An Approach for Calculating Semantic Similarity between Words Using WordNet. In Proc. of the Second International Conference on Digital Manufacturing Automation (pp. 177–180). IEEE.

[7] Liu, X. Y., Zhou, Y. M., & Zheng, R. S. (2007). Measuring Semantic Similarity in Wordnet. In Proc. of the 2007 International Conference on Machine Learning and Cybernetics (Vol. 6, pp. 3431–3435). IEEE.

[8] Pekar, V., & Staab, S. (2002). Taxonomy Learning: Factoring the Structure of a Taxonomy into a Semantic Classification Decision. In Proceedings of the 19th International Conference on Computational Linguistics (Vol. 1, pp. 1–7). Stroudsburg, PA, USA: Association for Computational Linguistics.

[9] Stojanovic, N., Maedche, A., Staab, S., Studer, R., & Sure, Y. (2001). SEAL: A Framework for Developing SEmantic PortALs. In Proceedings of the 1st International Conference on Knowledge Capture (pp. 155–162). New York, NY, USA: ACM.

Steps to reproduce the library

HESML V1R3 is distributed as a Java class library (HESML-V1R3.jar) plus a test driver application (HESMLclient.jar), which have been developed using NetBeans 8.0.2 for Windows, although it has been also compiled and evaluated on Linux-based platforms using the corresponding NetBeans versions.

The HESML-V1R3.jar file is already included in the HESML_Library\HESML\dist folder of the HESML_Release_V1R3.zip distribution file. In order to compile HESML from its source files, you must follow the following steps:

(1) Download the full distribution of HESML V1R3.

(2) Install Java 8, Java SE Dev Kit 8 and NetBeans 8.0.2 or higher in your workstation.

(3) Launch NetBeans IDE and open the HESML and HESMLclient projects contained in the root folder. NetBeans automatically detects the presence of a nbproject subfolder with the project files.

(4) Select HESML and HESMLclient projects in the project treeview respectively. Then, invoke the "Clean and Build project (Shift + F11)" command in order to compile both projects.

In order to remain up to date on new HESML versions, as well as asking for technical support, we invite the readers to subscribe to the HESML forum by sending an email to the following address:

hesml+subscribe@googlegroups.com

Steps to use the library

You can use the HESMLclient program to run reproducible experiments or create your own client programs using the HESMLclient code as example. For more information, including a detailed description of how to run the reproducible experiments and extending the library, we refer the reader to the paper [1] above.

HESML V1R2

27 Mar 13:54
Compare
Choose a tag to compare

HESML V1R2 is the second release of the Half-Edge Semantic Measures Library (HESML) [1], which is a new, scalable and efficient Java software library of ontology-based semantic similarity measures and Information Content (IC) models based on WordNet.

HESML V1R2 implements most ontology-based semantic similarity measures and Information Content (IC) models based on WordNet reported in the literature. In addition, it provides a XML-based input file format in order to specify the execution of reproducible experiments on WordNet-based similarity, even with no software coding.

The V1R2 release significantly improves the performance of HESML V1R1. HESML is introduced and detailed in a companion reproducibility paper [1] of the methods and experiments introduced in [2,3,4].

The main features of HEMSL are as follows: (1) it is based on an efficient and linearly scalable representation for taxonomies called PosetHERep introduced in [1], (2) its performance exhibits a linear scalability as regards the size of the taxonomy, and (3) it does not use any caching strategy of vertex sets.

HESML V1R2 is freely distributed for any non-commercial purpose under a CC By-NC-SA-4.0 license, subject to the citing of the main HESML paper [1] as attribution requirement. On other hand, the commercial use of the similarity measures introduced in [2], as well as part of the intrinsic IC models introduced in [3] and [4], is protected by a patent application [5]. In addition, any user of HESML must fulfill other licensing terms described in [1] related to other resources distributed with the library, such as WordNet and a dataset of corpus-based IC models, among others.

References:

[1] Lastra-Díaz, J. J., García-Serrano, A., Batet, M., Fernández, M., & Chirigati, F. (2017). HESML: a scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Information Systems, 66, 97–118.
http://dx.doi.org/10.1016/j.is.2017.02.002

[2] Lastra-Díaz, J. J., & García-Serrano, A. (2015). A novel family of IC-based similarity measures with a detailed experimental survey on WordNet. Engineering Applications of Artificial Intelligence Journal, 46, 140–153.
http://dx.doi.org/10.1016/j.engappai.2015.09.006

[3] Lastra-Díaz, J. J., & García-Serrano, A. (2015). A new family of information content models with an experimental survey on WordNet. Knowledge-Based Systems, 89, 509–526.
http://dx.doi.org/10.1016/j.knosys.2015.08.019

[4] Lastra-Díaz, J. J., & García-Serrano, A. (2016). A refinement of the well-founded Information Content models with a very detailed experimental survey on WordNet. Universidad Nacional de Educación a Distancia (UNED).
http://e-spacio.uned.es/fez/view/bibliuned:DptoLSI-ETSI-Informes-Jlastra-refinement

[5] Lastra Díaz, J. J., & García Serrano, A. (2016). System and method for the indexing and retrieval of semantically annotated data using an ontology-based information retrieval model. United States Patent and Trademark Office (USPTO) Application, US2016/0179945 A1.