-
Notifications
You must be signed in to change notification settings - Fork 0
Repository for FlexiTerm: a software tool to automatically recognise multi-word terms in text documents.
License
ispasic/FlexiTerm-Python
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
FlexiTerm: a software tool to automatically recognise multi-word terms in text documents. FlexiTerm takes as input a corpus of ASCII documents and outputs a ranked list of automatically recognised multi-word terms. If you use FlexiTerm in your work/research, please cite the following papers: [1] Spasic I, Greenwood M, Preece A, Francis N, & Elwyn, G. (2013) FlexiTerm: a flexible term recognition method. Journal of Biomedical Semantics, 4(1), 27. (https://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-4-27) [2] Spasic I. (2018) Acronyms as an integral part of multi-word term recognition - A token of appreciation. IEEE Access, 6, pp. 8351-8363 (https://ieeexplore.ieee.org/document/8293774) [3] Spasic I. (2021) FlexiTerm: A more efficient implementation of flexible multi-word term recognition. arXiv:2110.06981 [cs.CL] (https://arxiv.org/abs/2110.06981) For more information, please visit the FlexiTerm web site: http://users.cs.cf.ac.uk/I.Spasic/flexiterm/ Python requirements: Python version "3.7.4" Dependencies: jellyfish version "0.8.2" nltk version "3.4.5" numpy version "1.20.3" spacy version "3.0.6" Folders: config : System configuration files. out : Output files. text : Input files (plain text only). Files: flexiterm.py : The main python file. flexiterm.ipynb : Jupyter notebook version of flexiterm.py. flexiterm.sqlite : An sqlite database used by flexiterm.py. out/terminology.csv : A table of results: id | variant | c | f | df | c_idf out/terminology.html : A table of results: Term ID | Termhood | Term variant | Term variant frequency out/concordances.html : Concordances of terms listed in terminology.html. out/corpus.html : Input text annotated with occurrences of terms listed in terminology.html. out/annotations.json : Annotations of term occurrences in the input files using the spaCy format for training data: https://spacy.io/usage/training#training-data They can be used for visualisation or downstream processing by other applications. config/settings.txt : Specifies: * pattern : term formation pattern(s) * stoplist : the location of the stoplist * Smin : Jaro-Winkler similarity threshold * Amin : minimum (implicit) acronym frequency * Fmin : minimum term candidate frequency * Cmin : minimum C-value * acronyms : acronym recognition mode (implicit or explicit) Default settings: * pattern : "(((((NN|JJ) )*NN) IN (((NN|JJ) )*NN))|((NN|JJ )*NN POS (NN|JJ )*NN))|(((NN|JJ) )+NN( CD)?)" * stoplist : ./config/stoplist.txt * Smin : 0.962 * Amin : 5 * Fmin : 2 * Cmin : 1 * acronyms : explicit config/stoplist.txt : A list of stopwords. config/schema.sql : A schema of the database stored in flexiterm.sqlite. FlexiTerm takes as input a corpus of ASCII documents and outputs a ranked list of automatically recognised multi-word terms. To run FlexiTerm: 1. Place input files (plain text only) into a folder named "text". 2. OPTIONAL: Replace file config/stoplist.txt with your own if needed. 3. Execute flexiterm.py from the command line: python flexiterm.py OR run the following Jupyter notebook: flexiterm.ipynb 4. Check the results by double-clicking out/terminology.html from which you can navigate to out/concordances.html and then to out/corpus.html.
About
Repository for FlexiTerm: a software tool to automatically recognise multi-word terms in text documents.
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published