Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug fixes for KWS setup in babel recipe, generation of confusion matr… #3951

Closed
wants to merge 1 commit into from

Conversation

thoshith-s
Copy link
Contributor

bug fixes for KWS setup in babel recipe
1 . generation of confusion matrix (compute-wer compatibility issue) stats out is required.
2 . writing results local/kws_search.sh and utils/write_kwslist.pl

…ix(compute-wer compatibility issue) and writing results
@danpovey
Copy link
Contributor

Thanks... I think @jtrmal should check this.

@thoshith-s
Copy link
Contributor Author

thoshith-s commented Feb 22, 2020

Welcome. Below are the comments which are helpful in reviewing the above commit.
In babel recipe,

  1. local/generate_confusion_matrix.sh line number 67, compute-wer function is used which passes three arguments which leads to compatibility issue with current version of compute-wer

  2. In local/kws_search.sh, results file are in gzip format. But the current version expects it to be in text format

  3. In local/kws_search.sh, stage 2 and 3 map-utter is passed with utter_map file which is to be replaced with utter_id
    utter_map:
    utterance-id utterance-id
    utter-id:
    utterance-id seq-id
    eg: 1-2000-890.wav 29

  4. In utils/write_kwslist.pl, the utter mapping is read as --> which should be
    -->
    $utter_mapper{$col[0]} = $col[1];
    ... changed to
    $utter_mapper{$col[1]} = $col[0];

  5. In utils/write_kwslist.pl, line 196-217,
    Here, segment file is based on utterance-id not on the seq-id, when tried to get start time using seq-id, it will lead to error.
    Therefore first your seq-id is to be updated with utterance-id to continue processing.

@stale
Copy link

stale bot commented Jun 19, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale bot on the loose label Jun 19, 2020
@stale
Copy link

stale bot commented Jul 19, 2020

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

@stale stale bot closed this Jul 19, 2020
@kkm000 kkm000 reopened this Jul 19, 2020
@stale stale bot removed the stale Stale bot on the loose label Jul 19, 2020
@kkm000
Copy link
Contributor

kkm000 commented Jul 21, 2020

@jtrmal, could you PTAL? Dan though you were the right person to review this.

@kkm000 kkm000 requested a review from jtrmal July 21, 2020 21:21
@jtrmal
Copy link
Contributor

jtrmal commented Jul 22, 2020

put on my todo

@jtrmal
Copy link
Contributor

jtrmal commented Jul 22, 2020

yes, LGTM

@stale
Copy link

stale bot commented Sep 20, 2020

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Sep 20, 2020


template<typename T>
void PrintAlignmentStats(const std::vector<T> &ref,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this capability/function used in any external tool. Plus I'm sure we already do have this capability. So could you please remove this from this PR and eventually, if you think it's a "worthy" change, create a new PR, please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtrmal, could you please help us find the tool? Interesting, once I sent a PR to print similar data from the very same program at verbosity 2+, only in the raw form, and you then told me we already have that. I remember I could not find it.

@stale stale bot removed the stale Stale bot on the loose label Nov 24, 2020
@kkm000 kkm000 self-assigned this Nov 25, 2020
@kkm000 kkm000 added the waiting-for-feedback Reporter's feedback has been requested label Nov 25, 2020
@jtrmal
Copy link
Contributor

jtrmal commented Nov 25, 2020 via email

@stale
Copy link

stale bot commented Jan 24, 2021

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

@stale stale bot added the stale Stale bot on the loose label Jan 24, 2021
@stale stale bot removed the stale Stale bot on the loose label Sep 22, 2021
@kkm000 kkm000 added in progress Issue has been taken and is being worked on stale-exclude Stale bot ignore this issue and removed waiting-for-feedback Reporter's feedback has been requested labels Sep 22, 2021
@kkm000
Copy link
Contributor

kkm000 commented Sep 22, 2021

Looks like this is nearly there but abandoned. I'll take over, the changes seem trivial. @jtrmal, you ok with that?

@jtrmal
Copy link
Contributor

jtrmal commented Sep 22, 2021 via email

@kkm000
Copy link
Contributor

kkm000 commented Sep 22, 2021

Bingo!

I'm trying to grok the difference between en_US.UTF-8 and C.UTF-8. ATM the main dissimilarity is that I have the latter on my system but not the former, but I'm trying to bridge the gap...

@jtrmal
Copy link
Contributor

jtrmal commented Sep 22, 2021 via email

@kkm000
Copy link
Contributor

kkm000 commented Sep 22, 2021

@jtrmal ah, I was thinking about a note on the different PR. This is bizarre: https://github.com/kaldi-asr/kaldi/pull/4130/files#diff-683416a3e05ed8af8425a06a31b0f2e4566d71dff67b3aae0f970427f834d58dR130

grep using the C locale finds non-printable non-spaces in Turkish text, but when using the EN locale, doesn't. UTF-8 operates on codepoints, and Unicode categories are assigned to them, AFAIK, in a locale-independent way w.r.t spacing and being printable. E.g., the codepoint U+001F (called US, "terminate and justify line", like stretch it from left to right margin) is a generic Control (C), and also more specific Cc, which is non-printable, but I'm not 100% sure that it's a non-space. But this is all language-independent! Ll and Lu may be locale-dependent, but I'm unsure. Locale affects collation (C : codepoint -> integer for sorting) and case equivalence classes, but should not make codepoints invisible.

And no one provided a repro.

@jtrmal
Copy link
Contributor

jtrmal commented Sep 22, 2021 via email

@kkm000
Copy link
Contributor

kkm000 commented Sep 23, 2021

On my Debian 11 distro, there is no difference.

From /usr/share/i18n/locales/C:

LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
translit_end
END LC_CTYPE

From /usr/share/i18n/locales/en_US:

LC_CTYPE
copy "en_GB"
END LC_CTYPE

From /usr/share/i18n/locales/en_GB:

LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
translit_end
END LC_CTYPE

Both end up with copy "i18n" and include "translit_combining", the same files. LANG=C.UTF-8 is correct, its LC_TYPE is very generic. Their distro may be borked, but fixing that is beyond our control.

The translit_combining removes all diacritics. All like really all, including U+00011A36: ZANABAZAR SQUARE SIGN CANDRABINDU WITH ORNAMENT (remind of it to the ppl who think that Czech diacritics are weird). This should not affect Turkish printable/non-printable w.r.t. the LC_TYPE definition, whether the code is normalized or not.

FWIW, I see no problem with UTF-8 text in Turkish, unless I drop down to ASCII:

kkm@buba:~/.tmp$ LANG=en_US.UTF-8 grep '[^[:print:][:space:]]' futbol.txt
kkm@buba:~/.tmp$ LANG=C.UTF-8 grep '[^[:print:][:space:]]' futbol.txt
kkm@buba:~/.tmp$ LANG=C grep '[^[:print:][:space:]]' futbol.txt
Futbol, on birer oyuncudan olu▒▒an iki tak▒▒m aras▒▒nda, kendine ▒▒zg▒▒ k▒▒resel bir topla oynanan tak▒▒m
sporudur. 21. y▒▒zy▒▒l itibar▒▒yla 200'▒▒n ▒▒zerinde ▒▒lkede 250 milyonu a▒▒k▒▒n oyuncu taraf▒▒ndan
....

@kkm000
Copy link
Contributor

kkm000 commented Sep 23, 2021

@jtrmal, I was thinking about #4193. Sorry, this ticket is unrelated.

@jtrmal
Copy link
Contributor

jtrmal commented Sep 23, 2021 via email

@danpovey
Copy link
Contributor

I think the issue is that not all systems have all locales available, and it probably defaults to C if you choose one that is not available.
See this PR:
https://github.com/kaldi-asr/kaldi/pull/4612/files

@kkm000
Copy link
Contributor

kkm000 commented Sep 24, 2021

Every Linux system necessarily has at least one broken locale. I've seen LC_ALL=en_US.UTF-8 not working, with diagnostics like "cannot set locale LC_TYPE", while LANG=en_US.UTF-8 did work and set it just fine. I would rather use LANG= or LC_CTYPE= in this script than LC_ALL. The less you ask of a locale, the safer the code is. The complete list of locale facets include a lot of bizarre stuff, like postal address formatting and prefixes for "Mr.", "Ms." and unknown sex addressee. Later CLDR releases include locally adopted blood glucose measurement units.

en_US.UTF-8 is ambiguous in many ways, as it can be asked nonsensical questions. Is GREEK QUESTION MARK a printable character? American English does not have it. Up to the locale's designer, I guess. In Debian and therefore Ubuntu, it is: en_US and C both extend the category definitions to the full Unicode codepoint set. I'd say that if a system does not have the "neutral" C.UTF-8 locale extended with a complete Unicode character definition (which is not at all guaranteed), everything else is even a worse guesswork.

Oh, and amazingly, ISO 8859-7 Latin+Greek 8-bit encoding does define the Greek question mark...

Also, my locale -a prints C.UTF-8 and en_US.utf8. All case variants and both utf8 and utf-8 are accepted on input, but not canonicalized when printed. So the locale -a | grep is also error-prone in #4612.

jtrmal added a commit that referenced this pull request Sep 24, 2021
* bug fixes for KWS setup in babel recipe, generation of confusion matrix(compute-wer compatibility issue) and writing results

* removing compute-wer.cc changes

Co-authored-by: thoshith-s <thoshith.thoshi@gmail.com>
Co-authored-by: Jan 'Yenda' Trmal <jtrmal@gmail.com>
@jtrmal
Copy link
Contributor

jtrmal commented Sep 24, 2021

cresolved via #4633

@jtrmal jtrmal closed this Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress Issue has been taken and is being worked on stale-exclude Stale bot ignore this issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants