The purpose of this repository is to share some of the text mining work people have done in the Fund. We are trying to provide a set of well-written code examples (or tutorials) that people with little text mining experience can easily grasp and apply to their own problems.
Ideally, we want to cover as many programming languages as possible. Contributors with R and MATLAB experience are especially needed.
- Intro to text analysis - introductions to some basic text analysis concepts (tokenizing, stemming, removing stop words etc)
- Download and process COM's XML data - basic clean ups for COM's xml database
- Basic keyword search - using IMF Staff Reports
- Word Embedding - Word 2 vector, document 2 vector
- Topic modeling - such as LDA
- Sentiment analysis - both dictionary-based and machine-learning based
- Document similarity measure [coming]
- Data visualization - word cloud, embedding projection, ldaViz, knowledge graph etc