Arpa-reading #8

keli78 · 2017-03-28T23:28:04Z

No description provided.

hainan-xv · 2017-03-29T00:44:39Z

src/rnnlm/sample_a_word.cc

+    if (tokens.size() == 0) continue;
+    if (tokens.size() == 2 && tokens[0] == "ngram") {
+      std::string substring = tokens[1].substr(2);
+      int32 count = std::stoi(substring); // get "123456" from "1=123456"


don't use stoi

hainan-xv · 2017-03-29T00:45:18Z

src/rnnlm/sample_a_word.cc

+            std::cout << "OOV found in history: " << tokens[i] << std::endl;
+          }
+        }
+        assert (history.size() == order_current - 1);


KALDI_ASSERT

hainan-xv · 2017-03-29T00:45:32Z

src/rnnlm/sample_a_word.cc

+          if (it != vocab_.end()) {
+            history.push_back(it->second);
+          } else {
+            std::cout << "OOV found in history: " << tokens[i] << std::endl;


hainan-xv · 2017-03-29T00:46:19Z

src/rnnlm/sample_a_word.cc

+
+float NgramModel::GetProb(int32 order, const int32 word, const HistType& history) {
+  float prob = 0.0;
+  auto it = probs_[order - 1].find(history);


no auto.

OK, please check ALL my comments on your previous pull requests. You're making the same mistakes again.

hainan-xv · 2017-03-29T00:47:45Z

src/rnnlm/sample_a_word.cc

+void NgramModel::ReadARPAModel(char* file) {
+  std::ifstream data_input(file);
+  if (!data_input.is_open()) {
+    std::cerr << "error opening '" << file


KALDI_ERROR or something

hainan-xv

OK, I have seen a lot of issues for which I commented before but you're still doing. Please fix them and I will do another review.

hainan-xv · 2017-03-29T00:58:45Z

src/rnnlm/arpa-sampling.cc

+  if (u >= 0 && u < cdf[1].second) {
+    return cdf[0].first;
+  }
+  for (int32 i = 1; i < num_words_; i++) {


change this to a binary search. Check my SelectOne() function in rnnlm-utils.cc for reference if you need.

hainan-xv · 2017-03-29T01:00:11Z

src/rnnlm/sample_a_word.cc

+    probs = std::make_pair(lower, upper);
+    cdf.push_back(probs);
+  }
+  float u = 1.0 * rand()/RAND_MAX;


keli78 · 2017-03-29T01:21:01Z

Hi Hainan, you can ignore the oldest commit. Sorry I can not delete it.
I have fixed all the issues you mentioned above (except for the binary search) in commits later. You can check commits 1) sample a word version 2 (use fst Symbol table and kaldi io) and 2) fix a bug in computing weights of histories.

hainan-xv · 2017-04-25T19:09:17Z

src/rnnlm/arpa-sampling.h

+typedef int32_t int32;
+
+/// A hashing function-object for vectors of ints.
+struct IntVectorHasher {  // hashing function for vector<Int>.


Is this copied from another file in kaldi?

Oh this is just an implementation of the VectorHasher in Kaldi. I will just import that file.

hainan-xv · 2017-04-25T19:10:40Z

src/rnnlm/arpa-sampling.h

+typedef unordered_map<HistType, WordToProbsMap, IntVectorHasher> NgramType;
+typedef unordered_map<HistType, BaseFloat, IntVectorHasher> HistWeightsType;
+
+class ArpaSampling : public ArpaFileParser {


You need to move all testing-related function/variable out of the class. Write them in the test cc file.

What I mean is you should not write a testing function as a member function of a class.

Sure, got it.

hainan-xv · 2017-04-25T19:11:28Z

src/rnnlm/rnnlm-utils-test.cc

+  const char *usage = "";
+  ParseOptions po(usage);
+  po.Read(argc, argv);
+  std::string arpa_file = po.GetArg(1), history_file = po.GetArg(2);


this is probably OK now but you need to change it so it doesn't need cmd arguments

hainan-xv · 2017-04-25T19:12:43Z

src/rnnlm/arpa-sampling.cc

+
+BaseFloat ArpaSampling::GetProb(int32 order, int32 word, const HistType& history) {
+  BaseFloat prob = 0.0;
+  auto it = probs_[order - 1].find(history);


hainan-xv · 2017-04-25T19:14:11Z

src/rnnlm/arpa-sampling.cc

+
+namespace kaldi {
+
+void ArpaSampling::ConsumeNGram(const NGram& ngram) {


please add comments to functions to describe what they do

hainan-xv · 2017-04-25T19:16:11Z

src/rnnlm/arpa-sampling.cc

+  BaseFloat bow = 0.0;
+  auto it = probs_[order - 1].find(history);
+  if (it != probs_[order - 1].end()) {
+    auto it2 = probs_[order - 1][history].find(word);


no auto please

hainan-xv · 2017-04-25T19:18:36Z

src/rnnlm/arpa-sampling.cc

+}
+
+void ArpaSampling::PrintHist(const HistType& h) {
+  KALDI_LOG << "Current hist is: ";


run this function and you'll see that it has a problem.

hainan-xv · 2017-04-25T19:20:17Z

src/rnnlm/arpa-sampling.h

+  std::string unk_symbol_;
+
+  // Vocab
+  std::vector<std::pair<std::string, int32> > vocab_;


why do you need this instead of a SymbolTable?

So should I use SymbolTable as vocab?

hainan-xv · 2017-04-25T19:24:15Z

src/rnnlm/arpa-sampling.h

+  HistWeightsType hists_weights_;
+
+  // The given N Histories
+  std::vector<HistType> histories_;


you should NOT store histories_ in this class.

This class should only store information about the ngram model (read from the arpa file). Histories should just be a paramter you pass in order to get the prob-distributions.

hainan-xv · 2017-04-25T19:34:29Z

src/rnnlm/arpa-sampling.cc

+void ArpaSampling::ComputeWeightedPdf(std::vector<std::pair<int32, BaseFloat> >* pdf_w) {
+  BaseFloat prob = 0;
+  (*pdf_w).clear();
+  (*pdf_w).resize(num_words_); // if do not do this, (*pdf_w)[word] += prob will get seg fault


delete this comment. it's so obvious

hainan-xv · 2017-04-25T19:40:39Z

src/rnnlm/arpa-sampling.cc

+      history.push_back(word);
+    }
+    if (history.size() >= ngram_order_) {
+      std::reverse(history.begin(), history.end());


this is an extremely inefficient way of doing things. please make it more efficient.

Sure. Thanks for this comment.

hainan-xv · 2017-04-25T19:43:26Z

src/rnnlm/arpa-sampling.cc

+    KALDI_LOG << "Expected number of " << (i + 1) << "-grams: " << ngram_counts_[i];
+    for (auto it1 = probs_[i].begin(); it1 != probs_[i].end(); ++it1) {
+      HistType h(it1->first);
+      for (auto it2 = (probs_[i])[h].begin(); it2 != (probs_[i])[h].end(); ++it2) {


no need to do (v[i])[j] --- just use v[i][j]

hainan-xv · 2017-04-25T19:49:19Z

src/rnnlm/arpa-sampling.cc

+    if (it != probs_[order].end()) {
+      auto it2 = probs_[order][history].find(word);
+      if (it2 != probs_[order][history].end()) {
+        prob = pow(10, (it2->second).first);


no need to do (i->second).first -- just do i->second.first

hainan-xv · 2017-04-25T19:50:03Z

src/rnnlm/arpa-sampling.cc

+      if (it2 != probs_[order][history].end()) {
+        prob = pow(10, (it2->second).first);
+        (*pdf)[i].first = word;
+        (*pdf)[i].second += prob;


i'm very confused why you do += prob

hainan-xv · 2017-04-25T19:50:55Z

src/rnnlm/arpa-sampling.cc

+      int32 word_new = history.back();
+      HistType::const_iterator last_new = history.end() - 1;
+      HistType h_new(history.begin(), last_new);
+      prob = pow(10, GetBackoffWeight(order, word_new, h_new)) *


pow(10, a + b) would be better than pow(10, a) * pow(10, b)

hainan-xv · 2017-04-25T19:53:30Z

src/rnnlm/arpa-sampling.h

+  static const int kPrime = 7853;
+};
+
+// Predefine some symbol values, because any integer is as good than any other.


hainan-xv · 2017-04-25T20:16:45Z

src/rnnlm/arpa-sampling.h

+typedef std::vector<int32> HistType;
+typedef unordered_map<int32, std::pair<BaseFloat, BaseFloat> > WordToProbsMap; 
+typedef unordered_map<HistType, WordToProbsMap, IntVectorHasher> NgramType;
+typedef unordered_map<HistType, BaseFloat, IntVectorHasher> HistWeightsType;


OK change this. I will tell you how.

…sts_weights as class members

hainan-xv · 2017-04-28T20:11:34Z

src/rnnlm/arpa-sampling.cc

+
+// this function computes history weights for given histories
+// the total weights of histories is 1
+HistWeightsType ArpaSampling::ComputeHistoriesWeights(std::vector<HistType> histories) {


hainan-xv · 2017-04-28T20:13:57Z

src/rnnlm/arpa-sampling.cc

+}
+
+// Read histories of integers from a file
+std::vector<HistType> ArpaSampling::ReadHistories(std::istream &is, bool binary) {


need to change to a void function and move the return into argument list as poitner
could do this later

hainan-xv

OK I will merge this now. Just remember there is a couple TODOs:

moving the test code out of the class
making the test binary not require arguments
having separate maps for n-gram probs and backff weights

Ke Li added 3 commits February 7, 2017 22:46

sample a word version1 (IO is written by myself)

62c48b2

sample a word version 2 (use fst Symbol table and kaldi io)

7ae1bff

fix a bug in computing weights of histories

04996e8

hainan-xv reviewed Mar 29, 2017

View reviewed changes

hainan-xv requested changes Mar 29, 2017

View reviewed changes

hainan-xv reviewed Mar 29, 2017

View reviewed changes

Add history-weight test

62e5f9b

keli78 changed the title ~~Rnnlm shortcut~~ Arpa-reading Apr 6, 2017

hainan-xv reviewed Apr 25, 2017

View reviewed changes

Add ComputeOutputWords function; remove auto; remove histories and hi…

2103485

…sts_weights as class members

keli78 force-pushed the rnnlm-shortcut branch from 5180a52 to 2103485 Compare April 28, 2017 04:08

Merge branch 'wangs-update' into rnnlm-shortcut

449c15b

hainan-xv reviewed Apr 28, 2017

View reviewed changes

hainan-xv approved these changes Apr 28, 2017

View reviewed changes

hainan-xv merged commit 9506f72 into hainan-xv:wangs-update Apr 28, 2017

keli78 deleted the rnnlm-shortcut branch August 30, 2017 21:35


		namespace kaldi {

		void ArpaSampling::ConsumeNGram(const NGram& ngram) {

Arpa-reading #8

Arpa-reading #8

Conversation

keli78 commented Mar 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hainan-xv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keli78 commented Mar 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hainan-xv left a comment

Choose a reason for hiding this comment