-
Notifications
You must be signed in to change notification settings - Fork 2
TODO
Activities to be done as part of project closure - the phase from July to Aug 2015. Focus on the following areas:
- Usability of the code
- the code should function out of the box
- provide examples, and ready made trees (pickled trees)
- package this code for pypy distribution
- Usefulness of the code
- Returns a result with a degree of accuracy (or perplexity).
- Document the approach(es) taken to build the tree - capture the approach, and the resulting graphs which capture statistical facts about the algorithm.
- Approach 1 - word probability distribution, with Word2Vec. The tree mirrors the training corpus. The tree contains more than 100,000 nodes
- Approach 2 - set probability distribution, with Word2Vec. The tree contains multiple of 10,000 nodes. Less than Approach 1 though.
- Approach 3 - set probability distribution, with unsuPos.
- Create unit test cases to capture the tree structuring decision points (the depth of the tree, the reduction across nodes, the sum of probabilities in a peer group, etc)
- Document future work, and why it is needed. Also prioritize future work tasks.
- Document the limitation, and advantages of the current work
- Include this header in all .py files
# -*- coding: utf-8 -*-
. Python source code encoding - Include a Copyright on top of all files -
# Copyright of the Indian Institute of Science's Speech and Audio group. View LICENSE file for details.
- logging - setup a logger for trsl, with defined parameters, and access it globally using the get_logger mechanism
- generators - finding best_question
- xrange - xrange in place of range
- Counter - collections. Counter usage for freq count
- Research on Set building heuristics
- Basic implementation of algorithm
- Obtaining Word2vec vectors for the vocabulary
- Obtaining technologies that fit in our implementation
- Pseudocode for implementation
- Basic tokenizing
- Validation of our understanding of the Algorithm
- Basic conditional entropy calculation
- Native DS implementation
- Obtaining info regarding clustering techniques for vectors obtained from word2vec
- Heuristic for reduction threshold
- Smoothing
- Pylons
-
parent probability distribution becomes child distribution, partitioned data considered only down the level
-
check leaf distribution and determine if significant or not [ threshold -> statistical? how many levels ] [ if less than 0.5 bit stop there ]
-
nouns, verbs, -> grammar classification
-
htk grammar specific [ certain seq, word group -> fixed finite state machine [ CFG ] ]
-
sphinx [ allows grammar spec ] -> english predefined Heuris[ probably multiple ]
-
topic model -> document structuring [ classify documents ] [ LDA ]
-
stemming is necessary [ look at them uniquely [removal of word]]
-
Smoothing -> [less freq so prob dist might not be computed] [ smooth current level based on root level probability (convex combination) ]
-
2 levels of clustering [ accuracy as you go down the level ]
- level1 broad categories [verb, adverb, adjective?]
- level2 tree spanning in words used in past or future tense
-
If punctuation is a part of the corpus, should the punctuation symbols be a part of the sets being considered?
- size of clusters [ largest ]
- reduction from root to current child
- depth of tree
Indian Institute of Science (IISc) speech and audio group.
http://sites.google.com/site/sagiisc/