Skip to content
Kiran Subbaraman edited this page Aug 24, 2015 · 9 revisions

Project Closure

Activities to be done as part of project closure - the phase from July to Aug 2015. Focus on the following areas:

  • Usability of the code
    • the code should function out of the box
    • provide examples, and ready made trees (pickled trees)
    • package this code for pypy distribution
  • Usefulness of the code
    • Returns a result with a degree of accuracy (or perplexity).
  • Document the approach(es) taken to build the tree - capture the approach, and the resulting graphs which capture statistical facts about the algorithm.
    • Approach 1 - word probability distribution, with Word2Vec. The tree mirrors the training corpus. The tree contains more than 100,000 nodes
    • Approach 2 - set probability distribution, with Word2Vec. The tree contains multiple of 10,000 nodes. Less than Approach 1 though.
    • Approach 3 - set probability distribution, with unsuPos.
    • Create unit test cases to capture the tree structuring decision points (the depth of the tree, the reduction across nodes, the sum of probabilities in a peer group, etc)
  • Document future work, and why it is needed. Also prioritize future work tasks.
  • Document the limitation, and advantages of the current work

Code cleanup

  • Include this header in all .py files # -*- coding: utf-8 -*-. Python source code encoding
  • Include a Copyright on top of all files - # Copyright of the Indian Institute of Science's Speech and Audio group. View LICENSE file for details.
  • logging - setup a logger for trsl, with defined parameters, and access it globally using the get_logger mechanism
  • generators - finding best_question
  • xrange - xrange in place of range
  • Counter - collections. Counter usage for freq count

Experiments

  • Research on Set building heuristics
  • Basic implementation of algorithm
  • Obtaining Word2vec vectors for the vocabulary
  • Obtaining technologies that fit in our implementation
  • Pseudocode for implementation
  • Basic tokenizing
  • Validation of our understanding of the Algorithm
  • Basic conditional entropy calculation
  • Native DS implementation
  • Obtaining info regarding clustering techniques for vectors obtained from word2vec
  • Heuristic for reduction threshold
  • Smoothing
  • Pylons

Thoughts regarding the implementation specifics, which are still unclear

  • parent probability distribution becomes child distribution, partitioned data considered only down the level

  • check leaf distribution and determine if significant or not [ threshold -> statistical? how many levels ] [ if less than 0.5 bit stop there ]

  • nouns, verbs, -> grammar classification

  • htk grammar specific [ certain seq, word group -> fixed finite state machine [ CFG ] ]

  • sphinx [ allows grammar spec ] -> english predefined Heuris[ probably multiple ]

  • topic model -> document structuring [ classify documents ] [ LDA ]

  • stemming is necessary [ look at them uniquely [removal of word]]

  • Smoothing -> [less freq so prob dist might not be computed] [ smooth current level based on root level probability (convex combination) ]

  • 2 levels of clustering [ accuracy as you go down the level ]

    • level1 broad categories [verb, adverb, adjective?]
    • level2 tree spanning in words used in past or future tense
  • If punctuation is a part of the corpus, should the punctuation symbols be a part of the sets being considered?

Imp Statistical Considerations

  • size of clusters [ largest ]
  • reduction from root to current child
  • depth of tree