Skip to content

Automated extraction of specialised lexicon from text documents

License

Notifications You must be signed in to change notification settings

martinosorb/AutoLexi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoLexi

Automated extraction of specialised lexicon from text documents

AutoLexi is meant as a quick-and-dirty vocabulary-building tool for translation and similar tasks.

How it works

AutoLexi loads a pure text file (.txt or analogous), or a webpage URL, and compares the frequency of all the words contained in it with their frequencies in average English.

Then, it displays the ones that are more common in the input text compared to English, which form the specialised lexicon of the text under analysis.

Options:

  • minoccurrences enforces a threshold on the number of occurrences of the word in the input text. You may want to include words that appear at least twice in your text, in which case, set minoccurrences=2, and so on.
  • nshown sets the maximum number of specialised words to display. Note that the output is sorted based on the ratio of frequency of occurrence in english / frequency in the text. Showing more words is equivalent to including less and less specialised words.

Example

For example, running AutoLexi with minoccurrences=3 and nshown=40 on a World Bank report about mental health interventions in Ukraine gives:

zaporizhia tintle onehealth bromet nonspecialized nonspecialists narcology narcologists narcologist narcological mhpss lekhan kostyuchenko gluzman mhgap pobratim cmds polyclinics oblasts poltava raion pinchuk yll ingos dalys psychotherapists lviv ncds ucu noncommunicable wbg idps informants polyclinic giz ceta kyiv kharkiv dispensary yld

About

Automated extraction of specialised lexicon from text documents

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages