GitHub - ispasic/FlexiTerm-Python: Repository for FlexiTerm: a software tool to automatically recognise multi-word terms in text documents.

ispasic / FlexiTerm-Python Public

Notifications You must be signed in to change notification settings
Fork 0
Star 4

Repository for FlexiTerm: a software tool to automatically recognise multi-word terms in text documents.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
config		config
out		out
text		text
LICENSE.txt		LICENSE.txt
README.txt		README.txt
flexiterm.ipynb		flexiterm.ipynb
flexiterm.py		flexiterm.py

Repository files navigation

FlexiTerm: a software tool to automatically recognise multi-word terms in text documents.

FlexiTerm takes as input a corpus of ASCII documents and outputs a ranked list of automatically recognised multi-word terms.

If you use FlexiTerm in your work/research, please cite the following papers:

[1] Spasic I, Greenwood M, Preece A, Francis N, & Elwyn, G. (2013) FlexiTerm: a flexible term recognition method. Journal of Biomedical Semantics, 4(1), 27. (https://jbiomedsem.biomedcentral.com/articles/10.1186/2041-1480-4-27)
[2] Spasic I. (2018) Acronyms as an integral part of multi-word term recognition - A token of appreciation. IEEE Access, 6, pp. 8351-8363 (https://ieeexplore.ieee.org/document/8293774)
[3] Spasic I. (2021) FlexiTerm: A more efficient implementation of flexible multi-word term recognition. arXiv:2110.06981 [cs.CL] (https://arxiv.org/abs/2110.06981)

For more information, please visit the FlexiTerm web site: http://users.cs.cf.ac.uk/I.Spasic/flexiterm/

Python requirements:

Python 		version "3.7.4"

Dependencies:

jellyfish 	version "0.8.2"
nltk 		version "3.4.5"
numpy 		version "1.20.3"
spacy 		version "3.0.6"

Folders:

config : System configuration files.
out    : Output files.
text   : Input files (plain text only).

Files:

flexiterm.py          : The main python file.
flexiterm.ipynb       : Jupyter notebook version of flexiterm.py.
flexiterm.sqlite      : An sqlite database used by flexiterm.py.
out/terminology.csv   : A table of results: id | variant | c | f | df | c_idf
out/terminology.html  : A table of results: Term ID | Termhood | Term variant | Term variant frequency
out/concordances.html : Concordances of terms listed in terminology.html.
out/corpus.html       : Input text annotated with occurrences of terms listed in terminology.html.
out/annotations.json  : Annotations of term occurrences in the input files using the spaCy format for 
                        training data: https://spacy.io/usage/training#training-data
                        They can be used for visualisation or downstream processing by other applications.
config/settings.txt   : Specifies:
                        * pattern  : term formation pattern(s)
                        * stoplist : the location of the stoplist
                        * Smin     : Jaro-Winkler similarity threshold
                        * Amin     : minimum (implicit) acronym frequency
                        * Fmin     : minimum term candidate frequency
                        * Cmin     : minimum C-value
                        * acronyms : acronym recognition mode (implicit or explicit)

                        Default settings:
                        * pattern  : "(((((NN|JJ) )*NN) IN (((NN|JJ) )*NN))|((NN|JJ )*NN POS (NN|JJ )*NN))|(((NN|JJ) )+NN( CD)?)"
                        * stoplist : ./config/stoplist.txt
                        * Smin     : 0.962
                        * Amin     : 5
                        * Fmin     : 2
                        * Cmin     : 1
                        * acronyms : explicit
config/stoplist.txt   : A list of stopwords.
config/schema.sql     : A schema of the database stored in flexiterm.sqlite.
                        
                        
FlexiTerm takes as input a corpus of ASCII documents and outputs 
a ranked list of automatically recognised multi-word terms.

To run FlexiTerm:

1. Place input files (plain text only) into a folder named "text".

2. OPTIONAL: Replace file config/stoplist.txt with your own if needed.

3. Execute flexiterm.py from the command line: python flexiterm.py
   OR run the following Jupyter notebook: flexiterm.ipynb

4. Check the results by double-clicking out/terminology.html from which 
   you can navigate to out/concordances.html and then to out/corpus.html.