AutoLexi

Automated extraction of specialised lexicon from text documents

AutoLexi is meant as a quick-and-dirty vocabulary-building tool for translation and similar tasks.

How it works

AutoLexi loads a pure text file (.txt or analogous), or a webpage URL, and compares the frequency of all the words contained in it with their frequencies in average English.

Then, it displays the ones that are more common in the input text compared to English, which form the specialised lexicon of the text under analysis.

Options:

minoccurrences enforces a threshold on the number of occurrences of the word in the input text. You may want to include words that appear at least twice in your text, in which case, set minoccurrences=2, and so on.
nshown sets the maximum number of specialised words to display. Note that the output is sorted based on the ratio of frequency of occurrence in english / frequency in the text. Showing more words is equivalent to including less and less specialised words.

Example

For example, running AutoLexi with minoccurrences=3 and nshown=40 on a World Bank report about mental health interventions in Ukraine gives:

zaporizhia tintle onehealth bromet nonspecialized nonspecialists narcology narcologists narcologist narcological mhpss lekhan kostyuchenko gluzman mhgap pobratim cmds polyclinics oblasts poltava raion pinchuk yll ingos dalys psychotherapists lviv ncds ucu noncommunicable wbg idps informants polyclinic giz ceta kyiv kharkiv dispensary yld

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
frequencies_english.txt		frequencies_english.txt
get_lexicon.py		get_lexicon.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoLexi

How it works

Example

About

Releases

Packages

Languages

License

martinosorb/AutoLexi

Folders and files

Latest commit

History

Repository files navigation

AutoLexi

How it works

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages