Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual using modified configs #1290

Merged
merged 62 commits into from May 17, 2017

Conversation

pegahgh
Copy link
Contributor

@pegahgh pegahgh commented Dec 27, 2016

This is a modified multilingual setup based on new xconfig and training scripts.
In this setup, xconfig used to create network configuration for multilingual training.
Also the egs generation moved out of training script and multilingual egs dir passed to train_raw_dnn.py.
Also a new script added for average posterior computation and prior adjustment.

@pegahgh
Copy link
Contributor Author

pegahgh commented Dec 27, 2016

@danpovey
I added modified multilingual training to this branch. I wonder if someone can help me to tune the model.
I would be thankful if @vimalmanohar @jtrmal can do review for this branch.

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This combines some code comments with some much more important comments about the use of scps and disk access order.
Please remind me if we had any discussion about this, and whether we agreed on any particular approach. Does the prepare-multilingual-egs script actually re-dump the egs?

I think maybe what I was thinking is that egs from different languages would be merged and randomized in memory and split to multiple archives while maintaining fairly linear access to files. What you are doing seems to have a lot of random-access.

I'd rather get the high-level issues sorted out before talking about the code-level issues again.

@@ -301,7 +125,16 @@ int main(int argc, char *argv[]) {
"feature left-context that we output.");
po.Register("right-context", &right_context, "Can be used to truncate the "
"feature right-context that we output.");

po.Register("weights", &weight_str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be more careful with line length.
Also, I really don't like the use of TableReader to read such a tiny file, it wasn't designed for that kind of use, and it could be confusing.
How about an option like
po.Register("scales", &scales_str,
"Can be used to scale the features on a subset of inputs or "
"outputs of the NnetExamples being copied. "
"e.g. --io-scales='output-1=0.4,output-2=1.2'.");

and you could have a function like this:

// Parses the scales for the --scales option, e.g. "output1=0.4,output2=0.9", and outputs
// it as a map. Dies with an error message if there was a problem doing this.
void ParseScales(const std::string scales_str, std::map<std::string, BaseFloat> *scales);

You will want to use the function SplitStringToVector two times in this function, with "," and with "=".

Please make a similar modification to the "outputs" option, like this:
po.Register("rename-io", &rename_io_str,
"This option can be used to rename a subset of inputs or "
"outputs of the NnetExamples being copied. For example: "
"--rename-io='input=input-1,output=output-1'. ");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel that previous comment, but can you please register these options with better strings?
E.g. specify that it's the Rspecifier of a table indexed by the key of the input examples.

// or crashes if it is not there.
// e.g. output-0, output-xent
int32 NumOutputIndexes(const NnetExample &eg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is no need to crash if nothing named "output" is found; the
--measure-output-frames option is always false now and will be removed within
a couple of weeks. Just return 1 in that case.

@@ -332,14 +168,61 @@ int main(int argc, char *argv[]) {
int32 index = (random ? Rand() : num_written) % num_outputs;
if (frame_str == "" && left_context == -1 && right_context == -1 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid too much duplication of code you can put the shared code that exists in both branches of the if-statement, in a function, or two functions.
Also if both of these options are not used, please avoid the redundant copy; you can have another if-statement, where each branch has Write() on a different variable.


Also computes the minimum and maximum "t" values in the "input" and
"output" NnetIo members.
**/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is too special-purpose to be put in a header, as it combines a couple of different functionalities in one.

min_output_t or t > max_output_t.
Will crash if filtering removes all Indexes of "input" or "output".
*/
void FilterExample(const NnetExample &eg,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I actually think it would be better to leave all these things in nnet3-copy-egs.cc for now... they are all of quite specialized use and they will bulk up the library for no advantage.

@@ -39,7 +39,7 @@ num_utts_subset=300 # number of utterances in validation and training
num_valid_frames_combine=0 # #valid frames for combination weights at the very end.
num_train_frames_combine=10000 # # train frames for the above.
num_frames_diagnostic=4000 # number of frames for "compute_prob" jobs
samples_per_iter=400000 # this is the target number of egs in each archive of egs
samples_per_iter=40000 # this is the target number of egs in each archive of egs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't change defaults like this.

@@ -56,6 +56,7 @@ online_ivector_dir= # can be used if we are including speaker information as iV
cmvn_opts= # can be used for specifying CMVN options, if feature type is not lda (if lda,
# it doesn't make sense to use different options than were used as input to the
# LDA transform). This is used to turn off CMVN in the online-nnet experiments.
generate_egs_scp=false # If true, it will generate egs.JOB.*.scp per egs archive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

er... did we discuss generating scps? This is rather dangerous from a disk-access-order perspective.

if [ $stage -le 0 ]; then
echo "$0: allocating multilingual examples for training."
# Generate egs.*.scp for multilingual setup.
$cmd $megs_dir/log/allocate_multilingual_examples_train.log \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that you were doing this using scps... not sure if we discussed this...
If the training jobs have to effectively do random access, it will be terribly slow and
thrash the disk like crazy (or at least use up too much the memory on the nfs server).
Even if you had them sorted nicely so it was just skipping between different archive
files but accessing the individual files mostly in sequential order, the table-reading
code really isn't optimized for this case; it would opening and closing files like
crazy, and who knows what that will do to nfs caching.

I think it would be better to arrange somehow for this script to merge and maybe shuffle multiple archives of egs in memory and split them to different outputs via nnet3-copy-egs, to create the final archives.
Any renaming and scales could be applied via language-dependent options to language-dependent instances of nnet3-copy-egs (maybe inside pipes in rspecifiers), which would avoid the rather ugly dependence on the key to decide the renaming options and the scale options in nnet3-copy-egs.

@vimalmanohar knows about this stuff... I'm surprised if he would have signed off on this approach.

scaled_mat.Scale(alpha);
cmat.Scale(alpha);
cmat.CopyToMat(&scaled_comp_mat);
scaled_comp_mat.ApproxEqual(scaled_mat, 1.0e-04);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ApproxEqual returns a bool and this would not check anything. You need to check the value.

@@ -97,6 +97,11 @@ class NnetComputeProb {
// or NULL if there is no such info.
const SimpleObjectiveInfo *GetObjective(const std::string &output_name) const;

// return objective info for all outputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nicer to return a string that would summarize the objectives for different outputs, in a way that could be used more easily by the calling code (and would look nice in the normal case where there was just one output).

@danpovey
Copy link
Contributor

Hm, I see now in the allocate-egs code that you try hard to keep things linear.
A solution to the problem of constantly opening and closing files, is that instead of individually allocating egs from particular source files you could allocate them in fairly large blocks-- say, blocks of 50 egs.

Of course the scp files could be quite large in general, they may be 1/10th the size of the actual egs in some setups, or more depending on the path lengths. But this is probably OK, it's not even a common use-case, and 10% is manageable.

I hope your python script is smart about not reading more stuff into memory than it needs.

@danpovey
Copy link
Contributor

... After thinking about it, I'm probably OK with the basic pattern of allocating the per-example weights and the output-names-- I see that it's hard to do it efficiently any other way in this scenario, but I think it needs (a) much better and clearer documentation, and (b) better and clearer code.

@pegahgh
Copy link
Contributor Author

pegahgh commented Dec 28, 2016 via email

@danpovey
Copy link
Contributor

danpovey commented Feb 1, 2017

has conflicts! Please resolve

@@ -222,11 +224,10 @@ void NnetTrainer::PrintMaxChangeStats() const {
void ObjectiveFunctionInfo::UpdateStats(
const std::string &output_name,
int32 minibatches_per_phase,
int32 minibatch_counter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should revert this change and do it another way if possible. It's too likely to cause problems elsewhere (e.g. merge conflicts with adversarial training). When you call it with num_minibatches_processed_++, can't you just delay incrementing num_minibatches_processed_ until later on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous method was wrong, since it assumes single output in egs. The phase should be computed w.r.t num_minibatches per output not total num_minibatches.

for (; iter != end; ++iter) {
const NnetIo &io = *iter;
io_count++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this variable is never read.

from __future__ import print_function
import re, os, argparse, sys, math, warnings, random, io, imp
import numpy as np

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should not have been included in your patch, I think something went wrong resolving the merge.

# The last dir corresponds to multilingual egs directory,
# that is generated using this script, but
# it requires single egs dirs to exist.
multi_egs_dir = args.egs_dir.split()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this script having to deal with the multilingual egs. I was thinking of moving the egs-generation out of the train.py scripts anyway, as it's kind of unrelated to the rest of what they do.
Can't we just insist that in the multilingual case, the egs must already have been generated, and the dir must be supplied using the common-egs-dir option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think it is better idea! I planned to do that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I already generate egs outside train_raw.py for multilingual recipe local/nnet3/run_tdnn_multilingual.sh (I should revert these changes!) and pass it ad egs-dir to train.py.

dest='use_multitask_egs',
default=False, choices=["true", "false"],
help='If true, it is assumed there are different examples '
'correspond to multiple tasks in output layer.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, there is a merge problem here. This code has been refactored and now lives elsewhere, in steps/libs/. Vimal can help you.
But this help message is not great.
Actually I think a better thing would be to get rid of this option altogether, to insist that the egs be supplied externally, and to have the script itself work out somehow whether the egs directory supplied is of the "multilingual" type, or multitask type (let's have a consistent name at the script level, either would be OK).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can add a file num_langs in multilingual egs dir! Is it good?

Copy link
Contributor Author

@pegahgh pegahgh Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that, we compute average posterior and adjust the prior inside the train_*.py script, but in multilingual setting, we need to compute and adjust it per output, so I moved this computation outside train_raw_dnn.py!
I think it shouldn't be really a part of training script.
@vimalmanohar do you have any suggestion??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is not about multitask option, it is about adding separate python script egs/wsj/s5/steps/nnet3/compute_and_adjust_priors.py

@danpovey
Copy link
Contributor

danpovey commented Feb 2, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented Feb 2, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented Feb 2, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented Feb 2, 2017 via email

@danpovey
Copy link
Contributor

danpovey commented Feb 2, 2017 via email

. ./cmd.sh
set -e
stage=4
use_flp=false
Copy link
Contributor

@jtrmal jtrmal May 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete use_flp (we don't use it anymore). Well, at least we should not.

multi_data_dirs[$lang_index]=data/${lang_list[$lang_index]}/train${suffix}${feat_suffix}
multi_egs_dirs[$lang_index]=exp/${lang_list[$lang_index]}/nnet3/egs${ivector_suffix}
multi_ali_dirs[$lang_index]=exp/${lang_list[$lang_index]}/${alidir}${suffix}
multi_ivector_dirs[$lang_index]=exp/${lang_list[$lang_index]}/nnet3/ivectors_train${suffix}${ivector_suffix}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be

multi_ivector_dirs[$lang_index]=exp/${lang_list[$lang_index]}/nnet3/ivectors_train${suffix}${feat_suffix}${ivector_suffix}

@jtrmal
Copy link
Contributor

jtrmal commented May 3, 2017

I'm getting error

./local/nnet3/run_tdnn_multilingual.sh: creating multilingual neural net configs using the xconfig parser
steps/nnet3/xconfig_to_configs.py --xconfig-file exp/nnet3/multi_bnf_sp/configs/network.xconfig --config-dir exp/nnet3/multi_bnf_sp/configs/ --nnet-edits=rename-node old-name=output-0 new-name=output
Traceback (most recent call last):
  File "steps/nnet3/xconfig_to_configs.py", line 309, in <module>
    main()
  File "steps/nnet3/xconfig_to_configs.py", line 304, in main
    check_model_contexts(args.config_dir, args.nnet_edits)
  File "steps/nnet3/xconfig_to_configs.py", line 287, in check_model_contexts
    if ((contexts['init']['left-context'] > contexts['ref']['left-context'])
KeyError: 'left-context'

when I run the script second time, the error disappears. After deleting the directory exp/nnet3/multi_bnf_sp/configs/ and running the script again, error resurfaces.

@jtrmal
Copy link
Contributor

jtrmal commented May 4, 2017 via email

fi

mfccdir=mfcc_hires/$lang
steps/make_mfcc.sh --nj $my_nj --mfcc-config conf/mfcc_hires.conf \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be make_mfcc or make_mfcc_pitch_online depending on the use_pitch variable

@jtrmal
Copy link
Contributor

jtrmal commented May 9, 2017

Pegah, this is still an issue:

steps/nnet3/xconfig_to_configs.py --xconfig-file exp/nnet3/multi_bnf_sp/configs/network.xconfig --config-dir exp/nnet3/multi_bnf_sp/configs/ --nnet-edits=rename-node old-name=output
-0 new-name=output
Traceback (most recent call last):
  File "steps/nnet3/xconfig_to_configs.py", line 309, in <module>
    main()
  File "steps/nnet3/xconfig_to_configs.py", line 304, in main
    check_model_contexts(args.config_dir, args.nnet_edits)
  File "steps/nnet3/xconfig_to_configs.py", line 287, in check_model_contexts
    if ((contexts['init']['left-context'] > contexts['ref']['left-context'])
KeyError: 'left-context'

when running the same thing again, the error disappears.

@pegahgh
Copy link
Contributor Author

pegahgh commented May 10, 2017

@jtrmal I checked xconfig and the problem fixed. I also fixed other issues e.g. decoding script and rerun the experiment.

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are small, can you please address them right away? Vimal says that his speech-silence classification stuff depends on this.
I notice that you have removed
egs/hkust/s5/local/ext/195k_chinese_word2char_map
I think this may be unintentional (probably that file shouldn't be in the repo, but it's a separate issue).

There will be a bunch of merge conflicts when we merge to kaldi_52 but it might be easiest for me to resolve them myself.

@@ -235,5 +235,16 @@ const SimpleObjectiveInfo* NnetComputeProb::GetObjective(
return NULL;
}

double NnetComputeProb::GetTotalObjective(double *tot_weight) const {
double tot_objectives = 0.0;
unordered_map<std::string, SimpleObjectiveInfo, StringHasher>::const_iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should set *tot_weight = 0 at this point.

// returns the objective-function info for this output name (e.g. "output"),
// or NULL if there is no such info.
const SimpleObjectiveInfo *GetObjective(const std::string &output_name) const;

// It returns the objf sum, and it outputs the corresponding total weight
// to '*tot_weight'.
// NnetCombiner::ComputeObjfAndDerivFromNnet() would use this function instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence "NnetCombiner::ComputeObjfAndDerivFromNnet() would use this function instead" is not good documentation: (1) instead of what? (2) this documents the change, that's what git commit comments are for; it should document what the function does currently.

Change the documentation to:
This function returns the total objective over all output nodes recorded here, and
outputs to 'tot_weight' the total weight (typically the number of frames) corresponding to it;
you would divide the return value by 'tot_weight' to get a normalized weight that
would be easily interpretable by a human.

@@ -283,6 +289,9 @@ class GeneralMatrix {
/// Implemented in ../cudamatrix/cu-sparse-matrix.cc
void AddToMat(BaseFloat alpha, CuMatrixBase<BaseFloat> *cu_mat,
MatrixTransposeType trans = kNoTrans) const;

/// scale each element of matrix with a scalar value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change 'with a scalar value' -> 'by alpha'. scale->Scale.

@@ -37,8 +37,8 @@ int main(int argc, char *argv[]) {
"compute-kaldi-pitch-feats --sample-frequency=8000 scp:wav.scp ark:- \n"
"\n"
"See also: process-kaldi-pitch-feats, compute-and-process-kaldi-pitch-feats\n";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are blank spaces added at the ends of lines in this file and in others. Can you please figure out how to automatically get rid of them? You should change your editor settings too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My editor shows blank spaces at the end of line and I removed the blank spaces automatically from lots of files.

@@ -287,6 +287,20 @@ def train(args, run_opts, background_process_handler):
num_hidden_layers, num_archives_expanded,
args.max_models_combine, args.add_layers_period,
args.num_jobs_final)
use_multitask_egs = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove the line: use_multitask_egs = False

use_multitask_egs=False):
""" Generates egs option for multitask(or multilingual) training setup,
if {egs_prefix}output.*.ark or {egs_prefix}weight.*.ark files exists in egs_dir.
Each example line in {egs_prefix}*.scp has corresponding line containing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example line -> line
has -> has a

@pegahgh
Copy link
Contributor Author

pegahgh commented May 15, 2017

I will fix them right now.

@jtrmal
Copy link
Contributor

jtrmal commented May 16, 2017

BTW, Pegah, you still train the ivector extractor on the whole MFCC+pitch features when pitch=true

@pegahgh
Copy link
Contributor Author

pegahgh commented May 16, 2017

@jtrmal yes Yenda, I will do that when pitch is true, and it wouldn't degrade the performance. Did you get any degradation?

@danpovey
Copy link
Contributor

danpovey commented May 16, 2017 via email

@pegahgh
Copy link
Contributor Author

pegahgh commented May 16, 2017

@danpovey I will remove it if you want, but It can improve the setup for some languages! We use online pitch extraction to extract pitch, and there should not be compatibility issue!

@danpovey
Copy link
Contributor

danpovey commented May 16, 2017 via email

@pegahgh
Copy link
Contributor Author

pegahgh commented May 16, 2017

@danpovey , My result is based on multilingual system with two languages(Asamese and Cantonese), but I can try it on setup with more languages.
Should I remove pitch for ivector extraction or do more experiments before removing that?

/// scale each element of matrix with a scalar value.

/// Scale each element of matrix by apla.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you introduced a typo here.

// This function returns the total objective over all output nodes recorded here, and
// outputs to 'tot_weight' the total weight (typically the number of frames)
// corresponding to it.
// NnetCombiner::ComputeObjfAndDerivFromNnet() would use this function instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my previous comment --please remove this line of comment.

Copy link
Contributor

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments, but please address them right away along with my previous two comments.

@@ -22,7 +22,7 @@ echo "$0 $@" # Print the command line for logging
if [ -f path.sh ]; then . ./path.sh; fi
. parse_options.sh || exit 1;

if [ $# -ne 5 ]; then
if [ $# -lt 3 ] || [ $# -gt 5 ]; then
echo "usage: $0 [options] <selector> <src-data-dir> <dest-data-dir> <log-dir> <path-to-storage-dir>";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have changed the usage but you have not adjusted the usage message appropriately.

if (objf_info == NULL)
double tot_weights,
tot_objf = prob_computer_->GetTotalObjective(&tot_weights);
if (tot_objf == 0.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if-statement and the associated error message its prints are not suitable, this should not be an error. Please remove both

@@ -57,10 +57,9 @@ bool IsSimpleNnet(const Nnet &nnet) {
// "input" and everything checks out.
if (NumInputNodes(nnet) == 1)
return true;
// Otherwise, there should be 2 inputs and one
// Otherwise, there should be input node with name input and one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second "input" should be in double quotes.

@danpovey danpovey merged commit 7af2128 into kaldi-asr:master May 17, 2017
@danpovey
Copy link
Contributor

Thanks! Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants