Skip to content

Topic modelling of a large psychology corpus with LDA

Notifications You must be signed in to change notification settings

ottmartens/psy-topic-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Psychology topic models

Requires: docker, node, python

1. Parsing, storing in db

in pubmed-to-db:

1.1 Install dependencies: npm install
1.2 Start database: docker-compose up
1.3 Create database table structure: node db-setup.js
1.4 Download xml-results from the source: https://www.ncbi.nlm.nih.gov/pubmed/?term=psychology
1.5 Parse file: node parse-from-xml.js <path-to-xml-file>

(xml-file refers to an export from pubmed)

2. Preprocessing

in topic-modelling:

2.1 Install modules pip install nltk spacy gensim
2.2 Download nltk stopwords: python download_stopwords.py
2.3 Download spacy en module: python -m spacy download en
2.4 Preprocess the texts: python preprocess.py
2.5 Transform corpus to dictionary and bag-of-words structure: python -c "from transform_corpus import *; save_corpus_and_dictionary_to_file()"

3. LDA

in topic-modelling:

3.1 Download Mallet, set MALLET_PATH environment variable
3.2 Generate a topic model: python generate_model.py <'gensim' | 'mallet'> ...topic_number_configurations
3.3 Calculate coherence scores: python get_coherence.py ...model_names
3.4 Extract topics to a csv file: python extract_topics.py <model-name> <number of topics>

About

Topic modelling of a large psychology corpus with LDA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published