Skip to content

An NLP system for detecting syntactic neologisms in Latvian language.

License

Notifications You must be signed in to change notification settings

pavelsivanovs/lv-neologism-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🇱🇻 Automatic Neologism Detector

Automatic neologism detector for Latvian language. Author: Pavels Ivanovs

Description

This is Pavels Ivanovs' project for bachelor thesis Automatic Neologism Detection [1].

The goal of this project is to create a NLP tool which extracts from the submitted text words which are most likely to be included into the vocabulary of Latvian language, specifically, Tēzaurs.lv: the biggest publicly available thesaurus of Latvian language.

Methodology

Two main approaches are used to achieve the goal of the project:

  1. Exclusion lists. Words from the input text are being filtered out if their lemmas are located in the vocabulary. Lemmatization functionality provided by LVTagger and NLP-PIPE.
  2. Classification by machine-learning model. Classification using neural network. Input features, like word length, Levenshtein distance to the closest vocabulary entry, are being extracted from the word which are being fed to the neural network which outputs a possibility of a word being included into the vocabulary.

Results

After training the model its efficiency is as follows (x-axis: batch number; y-axis: metric):

Testing metrics of the model

  • Accuracy (Pareizība): 77.86%
  • Precision (Precizitāte): 40.56%
  • Recall (Pārklājums): 61.73%
  • F-score (F-mērs): 46.80%

Based on the metrics received from testing the model it is seen that there are still ways to improve the efficiency of the model. Two main options: optimization of the dataset (oversampling and overall increase of records) and model optimization, including neural network strucure changes and additional experimenting with epoch number and learning rate.

Requirements

  • Python v3.10
  • Docker compose

References

[1] P. Ivanovs, "Jaunvārdu automātiska atpazīšana," Bakalaura darbs, Datorikas fakultāte, Latvijas Universitāte, Rīga, Latvija, 2023

About

An NLP system for detecting syntactic neologisms in Latvian language.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published