Skip to content

Jivnesh/EvalSan

Repository files navigation

Official code for the paper "Evaluating Neural Word Embeddings for Sanskrit".

EvalSan: Evaluation Toolkit for Sanskrit Embeddings

SanEval is a toolkit for evaluating the quality of Sanskrit embeddings. We assess their generalization power by using them as features on a broad and diverse set of tasks. We include a suite of 4 intrinsic tasks which evaluate on what linguistic properties are encoded in word embeddings. Our goal is to ease the study and the development of general-purpose fixed-size word representations for Sanskrit.

Dependencies

This code is written in python. The dependencies are:

  • Python 3.6
pip install -r requirements.txt

Evaluation tasks

Intrinsic tasks

  • SanEval includes a series of Intrinsic tasks to evaluate what linguistic properties are encoded in your word embeddings.
  • We use SLP1 transliteration scheme for our data. You can change it to another scheme using this code.
Task Metric #dev #test
Relatedness F-score 4.5k 9k
Similarity Accuracy na 3k
Categorization Syntactic Purity na 1.1k
Categorization Semantic Purity na 150
Analogy Syntactic Accuracy na 10k
Analogy Semantic Accuracy na 6.4k

Pretrained models

  • You can download the pretrained models from this link. README.md is given for each model.
  • Place the models folder in the parent directory path.
  • Pretrained vectors can be downloaded from this link. Place this folder in EvalSan/evaluations/Intrinsic/ path. This vectors are being used in evaluation script.

How to train the models

Please refer to the models folder for more details.

bash train_embeddings.sh

How to run evaluation

To evaluate your word embeddings, run the following command:

bash run_SanEval.sh

Citation

If you use our tool, we'd appreciate if you cite the following paper:

@inproceedings{sandhan-etal-2023-evaluating,
    title = "Evaluating Neural Word Embeddings for {S}anskrit",
    author = "Sandhan, Jivnesh  and
      Paranjay, Om Adideva  and
      Digumarthi, Komal  and
      Behra, Laxmidhar  and
      Goyal, Pawan",
    booktitle = "Proceedings of the Computational {S}anskrit {\&} Digital Humanities: Selected papers presented at the 18th World {S}anskrit Conference",
    month = jan,
    year = "2023",
    address = "Canberra, Australia (Online mode)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.wsc-csdh.2",
    pages = "21--37",
}

License

This project is licensed under the terms of the Apache license 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published