Keywords.py is a python script developed during the SPINDLE project [[blog] (http://blogs.oucs.ox.ac.uk/openspires/category/spindle/)] [website] that generates keywords from a text. It has been used during the [SPINDLE] (http://openspires.oucs.ox.ac.uk/spindle/) project to [generate keywords from automatic transcriptions] (http://blogs.oucs.ox.ac.uk/openspires/2012/09/12/spindle-automatic-keyword-generation-step-by-step/).
How to use it
python keywords.py text.txt
>>> from keywords import keywords_and_ngrams
The keywords.py script expects the text file to be in plain text format. If you have a transcription in another format, XMP for example, you should convert it to plain text before using keywords.py.
List object containing two lists of tuples. The first list of tuples contains keywords, log-likelihood values. The second list of tuples contains bigrams, number of occurrences values.
keyword-0 ll-0 keyword-1 ll-1 keyword-2 ll-2 bigram-0 n-occurrences-bigram-0 bigram-1 n-occurrences-bigram-1 bigram-2 n-occurrences-bigram-2
We include in the repository the automatic transcription of the podcast [Global Recession: How Did it Happen?] (http://podcasts.ox.ac.uk/global-recession-how-did-it-happen-audio) (Correct Words = 32.9%). We selected a bad automatic transcription to show that even with a low number of correct words we can extract some relevant keywords and bigrams automatically.
python keywords.py 01_crunch-medium-audio_automatic_transcription.txt
Keywords Generated (word: Log-likelihood)
banks: 141.12175627 crisis: 73.3976004078 companies: 67.8498685789 assets: 61.8910800051 haiti: 47.7956942776 interest: 41.3390170289 credit: 39.6149918395 crunch: 35.9334074944 senate: 32.4501608202 profited: 30.625124757 sitcom: 30.625124757 ansa: 30.625124757 nineteen: 29.0864140753 economy: 28.6440250819 nineties: 27.5138518651 haitian: 26.8069860979 sanctioning: 26.8069860979 center: 26.8069860979 regulate: 25.4923775621 hashing: 25.0818400138 haitians: 25.0818400138 stimulus: 24.5089608603 united: 24.1102094531 successful: 21.8091735308 financial: 21.7481087661 key: 21.6791751296 caught: 21.1648006228 eases: 21.0970376283 bankruptcy: 21.0970376283 rates: 21.0105869453 kind: 20.8040324729 cited: 20.6246470912 backs: 19.9877139071 borrowing: 19.9877139071 crimes: 19.5817617075 countries: 19.5490491082 essentially: 19.334521352 fiscal: 19.1532240523
Collocations Generated (collocation: #occurrences)
interest rates: 5 financial crisis: 4 all street: 3 nineteen nineties: 3 credit crunch: 3 british government: 3
Word Cloud (using Wordle)
- nKeywords: number of keywords generated by the script (default 100)
- thresholdLL: log-likelihood value threshold (default 19)
- nBigrams: number of bigrams generated by the script (default 25)
- thresholdBigrams: minimun of occurrences of a bigram (default 2)
Keywords.py expects bnc.py to be in the same directory. The bnc.py script contains a list of word frequencies obtained from the spoken part of the British National Corpus (BNC) and the total number of words of the spoken part of the BNC.
You could use the script with any other word frequencies obtained from a different corpus.
Keywords.py expects stopwords.py to be in the same directory. The stopwords.py script contains a list of stopwords (words that are too common to be considered keywords). You can add or remove any word adding it or deleting it from the list.
Please check the following blog post [SPINDLE Automatic Keyword Generation: Step by Step] (http://blogs.oucs.ox.ac.uk/openspires/2012/09/12/spindle-automatic-keyword-generation-step-by-step/).
The keywords.py script can be used as a stand-alone application or as a module importing it into your python code. It can also be used within a web framework.
All files on the directory with the exception of 01_crunch-medium-audio_automatic_transcription.txt (that is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 UK: England & Wales License) are licensed under an MIT License.
#spindle #openspires #ukoer #oerri