Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arpa-reading #8

Merged
merged 6 commits into from
Apr 28, 2017
Merged

Arpa-reading #8

merged 6 commits into from
Apr 28, 2017

Conversation

keli78
Copy link

@keli78 keli78 commented Mar 28, 2017

No description provided.

if (tokens.size() == 0) continue;
if (tokens.size() == 2 && tokens[0] == "ngram") {
std::string substring = tokens[1].substr(2);
int32 count = std::stoi(substring); // get "123456" from "1=123456"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use stoi

std::cout << "OOV found in history: " << tokens[i] << std::endl;
}
}
assert (history.size() == order_current - 1);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KALDI_ASSERT

if (it != vocab_.end()) {
history.push_back(it->second);
} else {
std::cout << "OOV found in history: " << tokens[i] << std::endl;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KALDI_LOG


float NgramModel::GetProb(int32 order, const int32 word, const HistType& history) {
float prob = 0.0;
auto it = probs_[order - 1].find(history);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no auto.

OK, please check ALL my comments on your previous pull requests. You're making the same mistakes again.

void NgramModel::ReadARPAModel(char* file) {
std::ifstream data_input(file);
if (!data_input.is_open()) {
std::cerr << "error opening '" << file
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KALDI_ERROR or something

Copy link
Owner

@hainan-xv hainan-xv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have seen a lot of issues for which I commented before but you're still doing. Please fix them and I will do another review.

if (u >= 0 && u < cdf[1].second) {
return cdf[0].first;
}
for (int32 i = 1; i < num_words_; i++) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to a binary search. Check my SelectOne() function in rnnlm-utils.cc for reference if you need.

probs = std::make_pair(lower, upper);
cdf.push_back(probs);
}
float u = 1.0 * rand()/RAND_MAX;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rand

@keli78
Copy link
Author

keli78 commented Mar 29, 2017

Hi Hainan, you can ignore the oldest commit. Sorry I can not delete it.
I have fixed all the issues you mentioned above (except for the binary search) in commits later. You can check commits 1) sample a word version 2 (use fst Symbol table and kaldi io) and 2) fix a bug in computing weights of histories.

@keli78 keli78 changed the title Rnnlm shortcut Arpa-reading Apr 6, 2017
typedef int32_t int32;

/// A hashing function-object for vectors of ints.
struct IntVectorHasher { // hashing function for vector<Int>.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this copied from another file in kaldi?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is just an implementation of the VectorHasher in Kaldi. I will just import that file.

typedef unordered_map<HistType, WordToProbsMap, IntVectorHasher> NgramType;
typedef unordered_map<HistType, BaseFloat, IntVectorHasher> HistWeightsType;

class ArpaSampling : public ArpaFileParser {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to move all testing-related function/variable out of the class. Write them in the test cc file.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is you should not write a testing function as a member function of a class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, got it.

const char *usage = "";
ParseOptions po(usage);
po.Read(argc, argv);
std::string arpa_file = po.GetArg(1), history_file = po.GetArg(2);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably OK now but you need to change it so it doesn't need cmd arguments


BaseFloat ArpaSampling::GetProb(int32 order, int32 word, const HistType& history) {
BaseFloat prob = 0.0;
auto it = probs_[order - 1].find(history);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto?


namespace kaldi {

void ArpaSampling::ConsumeNGram(const NGram& ngram) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments to functions to describe what they do

BaseFloat bow = 0.0;
auto it = probs_[order - 1].find(history);
if (it != probs_[order - 1].end()) {
auto it2 = probs_[order - 1][history].find(word);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no auto please

}

void ArpaSampling::PrintHist(const HistType& h) {
KALDI_LOG << "Current hist is: ";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run this function and you'll see that it has a problem.

std::string unk_symbol_;

// Vocab
std::vector<std::pair<std::string, int32> > vocab_;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this instead of a SymbolTable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I use SymbolTable as vocab?

HistWeightsType hists_weights_;

// The given N Histories
std::vector<HistType> histories_;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should NOT store histories_ in this class.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class should only store information about the ngram model (read from the arpa file). Histories should just be a paramter you pass in order to get the prob-distributions.

void ArpaSampling::ComputeWeightedPdf(std::vector<std::pair<int32, BaseFloat> >* pdf_w) {
BaseFloat prob = 0;
(*pdf_w).clear();
(*pdf_w).resize(num_words_); // if do not do this, (*pdf_w)[word] += prob will get seg fault
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this comment. it's so obvious

history.push_back(word);
}
if (history.size() >= ngram_order_) {
std::reverse(history.begin(), history.end());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an extremely inefficient way of doing things. please make it more efficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks for this comment.

KALDI_LOG << "Expected number of " << (i + 1) << "-grams: " << ngram_counts_[i];
for (auto it1 = probs_[i].begin(); it1 != probs_[i].end(); ++it1) {
HistType h(it1->first);
for (auto it2 = (probs_[i])[h].begin(); it2 != (probs_[i])[h].end(); ++it2) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to do (v[i])[j] --- just use v[i][j]

if (it != probs_[order].end()) {
auto it2 = probs_[order][history].find(word);
if (it2 != probs_[order][history].end()) {
prob = pow(10, (it2->second).first);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to do (i->second).first -- just do i->second.first

if (it2 != probs_[order][history].end()) {
prob = pow(10, (it2->second).first);
(*pdf)[i].first = word;
(*pdf)[i].second += prob;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm very confused why you do += prob

int32 word_new = history.back();
HistType::const_iterator last_new = history.end() - 1;
HistType h_new(history.begin(), last_new);
prob = pow(10, GetBackoffWeight(order, word_new, h_new)) *
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pow(10, a + b) would be better than pow(10, a) * pow(10, b)

static const int kPrime = 7853;
};

// Predefine some symbol values, because any integer is as good than any other.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

typedef std::vector<int32> HistType;
typedef unordered_map<int32, std::pair<BaseFloat, BaseFloat> > WordToProbsMap;
typedef unordered_map<HistType, WordToProbsMap, IntVectorHasher> NgramType;
typedef unordered_map<HistType, BaseFloat, IntVectorHasher> HistWeightsType;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK change this. I will tell you how.


// this function computes history weights for given histories
// the total weights of histories is 1
HistWeightsType ArpaSampling::ComputeHistoriesWeights(std::vector<HistType> histories) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const &

}

// Read histories of integers from a file
std::vector<HistType> ArpaSampling::ReadHistories(std::istream &is, bool binary) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to change to a void function and move the return into argument list as poitner
could do this later

Copy link
Owner

@hainan-xv hainan-xv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will merge this now. Just remember there is a couple TODOs:

  1. moving the test code out of the class
  2. making the test binary not require arguments
  3. having separate maps for n-gram probs and backff weights

@hainan-xv hainan-xv merged commit 9506f72 into hainan-xv:wangs-update Apr 28, 2017
@keli78 keli78 deleted the rnnlm-shortcut branch August 30, 2017 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants