TODO

Project Closure

Activities to be done as part of project closure - the phase from July to Aug 2015. Focus on the following areas:

Usability of the code
- the code should function out of the box
- provide examples, and ready made trees (pickled trees)
- package this code for pypy distribution
Usefulness of the code
- Returns a result with a degree of accuracy (or perplexity).
Document the approach(es) taken to build the tree - capture the approach, and the resulting graphs which capture statistical facts about the algorithm.
- Approach 1 - word probability distribution, with Word2Vec. The tree mirrors the training corpus. The tree contains more than 100,000 nodes
- Approach 2 - set probability distribution, with Word2Vec. The tree contains multiple of 10,000 nodes. Less than Approach 1 though.
- Approach 3 - set probability distribution, with unsuPos.
- Create unit test cases to capture the tree structuring decision points (the depth of the tree, the reduction across nodes, the sum of probabilities in a peer group, etc)
Document future work, and why it is needed. Also prioritize future work tasks.
Document the limitation, and advantages of the current work

Include this header in all .py files # -*- coding: utf-8 -*-. Python source code encoding
Include a Copyright on top of all files - # Copyright of the Indian Institute of Science's Speech and Audio group. View LICENSE file for details.
logging - setup a logger for trsl, with defined parameters, and access it globally using the get_logger mechanism
generators - finding best_question
xrange - xrange in place of range
Counter - collections. Counter usage for freq count

parent probability distribution becomes child distribution, partitioned data considered only down the level
check leaf distribution and determine if significant or not [ threshold -> statistical? how many levels ] [ if less than 0.5 bit stop there ]
nouns, verbs, -> grammar classification
htk grammar specific [ certain seq, word group -> fixed finite state machine [ CFG ] ]
sphinx [ allows grammar spec ] -> english predefined Heuris[ probably multiple ]
topic model -> document structuring [ classify documents ] [ LDA ]
stemming is necessary [ look at them uniquely [removal of word]]
Smoothing -> [less freq so prob dist might not be computed] [ smooth current level based on root level probability (convex combination) ]
2 levels of clustering [ accuracy as you go down the level ]
- level1 broad categories [verb, adverb, adjective?]
- level2 tree spanning in words used in past or future tense
If punctuation is a part of the corpus, should the punctuation symbols be a part of the sets being considered?

Indian Institute of Science (IISc) speech and audio group.
http://sites.google.com/site/sagiisc/