bug fixes for KWS setup in babel recipe, generation of confusion matr… #3951

thoshith-s · 2020-02-22T11:48:12Z

bug fixes for KWS setup in babel recipe
1 . generation of confusion matrix (compute-wer compatibility issue) stats out is required.
2 . writing results local/kws_search.sh and utils/write_kwslist.pl

…ix(compute-wer compatibility issue) and writing results

danpovey · 2020-02-22T12:36:04Z

Thanks... I think @jtrmal should check this.

thoshith-s · 2020-02-22T13:36:13Z

Welcome. Below are the comments which are helpful in reviewing the above commit.
In babel recipe,

local/generate_confusion_matrix.sh line number 67, compute-wer function is used which passes three arguments which leads to compatibility issue with current version of compute-wer
In local/kws_search.sh, results file are in gzip format. But the current version expects it to be in text format
In local/kws_search.sh, stage 2 and 3 map-utter is passed with utter_map file which is to be replaced with utter_id
utter_map:
utterance-id utterance-id
utter-id:
utterance-id seq-id
eg: 1-2000-890.wav 29
In utils/write_kwslist.pl, the utter mapping is read as --> which should be
-->
$utter_mapper{$col[0]} = $col[1];
... changed to
$utter_mapper{$col[1]} = $col[0];
In utils/write_kwslist.pl, line 196-217,
Here, segment file is based on utterance-id not on the seq-id, when tried to get start time using seq-id, it will lead to error.
Therefore first your seq-id is to be updated with utterance-id to continue processing.

stale · 2020-06-19T06:36:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-07-19T06:23:57Z

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

kkm000 · 2020-07-21T21:20:42Z

@jtrmal, could you PTAL? Dan though you were the right person to review this.

jtrmal · 2020-07-22T08:33:03Z

put on my todo

jtrmal · 2020-07-22T13:12:42Z

yes, LGTM

stale · 2020-09-20T14:09:02Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

jtrmal · 2020-11-24T16:17:00Z

src/bin/compute-wer.cc

+
+
+template<typename T>
+void PrintAlignmentStats(const std::vector<T> &ref,


I don't see this capability/function used in any external tool. Plus I'm sure we already do have this capability. So could you please remove this from this PR and eventually, if you think it's a "worthy" change, create a new PR, please?

@jtrmal, could you please help us find the tool? Interesting, once I sent a PR to print similar data from the very same program at verbosity 2+, only in the raw form, and you then told me we already have that. I remember I could not find it.

jtrmal · 2020-11-25T20:36:33Z

it's in steps/score_kaldi.sh ``` 122 $cmd $dir/scoring_kaldi/log/stats1.log \ 123 cat $dir/scoring_kaldi/penalty_$best_wip/$best_lmwt.txt \| \ 124 align-text --special-symbol="'***'" ark:$dir/scoring_kaldi/test_filt.txt ark:- ark,t:- \| \ 125 utils/scoring/wer_per_utt_details.pl --special-symbol "'***'" \| tee $dir/scoring_kaldi/wer_details/per_utt \|\ 126 utils/scoring/wer_per_spk_details.pl $data/utt2spk \> $dir/scoring_kaldi/wer_details/per_spk || exit 1; ```

stale · 2021-01-24T21:08:57Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

kkm000 · 2021-09-22T21:03:44Z

Looks like this is nearly there but abandoned. I'll take over, the changes seem trivial. @jtrmal, you ok with that?

jtrmal · 2021-09-22T21:26:01Z

You know what? Let me handle it. I was asked to revisit the recipes so I will have to dig through my mess again anyway :) y.

…

On Wed, Sep 22, 2021 at 5:03 PM kkm000 ***@***.***> wrote: Looks like this is nearly there but abandoned. I'll take over, the changes seem trivial. @jtrmal <https://github.com/jtrmal>, you ok with that? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3951 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYXY6AHSEEITZYFF3S53UDJADXANCNFSM4KZQ2UOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

kkm000 · 2021-09-22T21:30:29Z

Bingo!

I'm trying to grok the difference between en_US.UTF-8 and C.UTF-8. ATM the main dissimilarity is that I have the latter on my system but not the former, but I'm trying to bridge the gap...

jtrmal · 2021-09-22T21:32:32Z

the differences, IIRC are in chr(127) and up... But I don't recall exactly. Maybe currencies and stuff like that.

…

On Wed, Sep 22, 2021 at 5:30 PM kkm000 ***@***.***> wrote: Bingo! I'm trying to grok the difference between en_US.UTF-8 and C.UTF-8. ATM the main dissimilarity is that I have the latter on my system but not the former, but I'm trying to bridge the gap... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3951 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYX3K35U6AZXDHNAKVA3UDJDIBANCNFSM4KZQ2UOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

kkm000 · 2021-09-22T22:00:58Z

@jtrmal ah, I was thinking about a note on the different PR. This is bizarre: https://github.com/kaldi-asr/kaldi/pull/4130/files#diff-683416a3e05ed8af8425a06a31b0f2e4566d71dff67b3aae0f970427f834d58dR130

grep using the C locale finds non-printable non-spaces in Turkish text, but when using the EN locale, doesn't. UTF-8 operates on codepoints, and Unicode categories are assigned to them, AFAIK, in a locale-independent way w.r.t spacing and being printable. E.g., the codepoint U+001F (called US, "terminate and justify line", like stretch it from left to right margin) is a generic Control (C), and also more specific Cc, which is non-printable, but I'm not 100% sure that it's a non-space. But this is all language-independent! Ll and Lu may be locale-dependent, but I'm unsure. Locale affects collation (C : codepoint -> integer for sorting) and case equivalence classes, but should not make codepoints invisible.

And no one provided a repro.

jtrmal · 2021-09-22T22:04:49Z

yeah, no idea what that was fixing y.

…

On Wed, Sep 22, 2021 at 6:01 PM kkm000 ***@***.***> wrote: @jtrmal <https://github.com/jtrmal> ah, I was thinking about a note on the different PR. This is bizarre: https://github.com/kaldi-asr/kaldi/pull/4130/files#diff-683416a3e05ed8af8425a06a31b0f2e4566d71dff67b3aae0f970427f834d58dR130 grep using the C locale finds non-printable non-spaces in Turkish text, but when using the EN locale, doesn't. UTF-8 operates on codepoints, and Unicode categories are assigned to them, AFAIK, in a locale-independent way w.r.t spacing and being printable. E.g., the codepoint U+001F (called US, "terminate and justify line", like stretch it from left to right margin) is a generic Control (C), and also more specific Cc, which is non-printable, but I'm not 100% sure that it's a non-space. But this is all language-independent! Ll and Lu may be locale-dependent, but I'm unsure. Locale affects collation (C : codepoint -> integer for sorting) and case equivalence classes, but should not make codepoints invisible. And no one provided a repro. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3951 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYXYDM5ZGGGYFG5PNN43UDJG2LANCNFSM4KZQ2UOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

kkm000 · 2021-09-23T00:05:03Z

On my Debian 11 distro, there is no difference.

From /usr/share/i18n/locales/C:

LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
translit_end
END LC_CTYPE

From /usr/share/i18n/locales/en_US:

LC_CTYPE
copy "en_GB"
END LC_CTYPE

From /usr/share/i18n/locales/en_GB:

LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
translit_end
END LC_CTYPE

Both end up with copy "i18n" and include "translit_combining", the same files. LANG=C.UTF-8 is correct, its LC_TYPE is very generic. Their distro may be borked, but fixing that is beyond our control.

The translit_combining removes all diacritics. All like really all, including U+00011A36: ZANABAZAR SQUARE SIGN CANDRABINDU WITH ORNAMENT (remind of it to the ppl who think that Czech diacritics are weird). This should not affect Turkish printable/non-printable w.r.t. the LC_TYPE definition, whether the code is normalized or not.

FWIW, I see no problem with UTF-8 text in Turkish, unless I drop down to ASCII:

kkm@buba:~/.tmp$ LANG=en_US.UTF-8 grep '[^[:print:][:space:]]' futbol.txt
kkm@buba:~/.tmp$ LANG=C.UTF-8 grep '[^[:print:][:space:]]' futbol.txt
kkm@buba:~/.tmp$ LANG=C grep '[^[:print:][:space:]]' futbol.txt
Futbol, on birer oyuncudan olu▒▒an iki tak▒▒m aras▒▒nda, kendine ▒▒zg▒▒ k▒▒resel bir topla oynanan tak▒▒m
sporudur. 21. y▒▒zy▒▒l itibar▒▒yla 200'▒▒n ▒▒zerinde ▒▒lkede 250 milyonu a▒▒k▒▒n oyuncu taraf▒▒ndan
....

kkm000 · 2021-09-23T00:31:41Z

@jtrmal, I was thinking about #4193. Sorry, this ticket is unrelated.

jtrmal · 2021-09-23T00:34:41Z

I figured :)

…

On Wed, Sep 22, 2021 at 8:31 PM kkm000 ***@***.***> wrote: @jtrmal <https://github.com/jtrmal>, I was thinking about #4193 <#4193>. Sorry, this ticket is unrelated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3951 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACUKYXZUO4TVXCP6TOFLMUDUDJYPTANCNFSM4KZQ2UOA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

danpovey · 2021-09-23T05:54:02Z

I think the issue is that not all systems have all locales available, and it probably defaults to C if you choose one that is not available.
See this PR:
https://github.com/kaldi-asr/kaldi/pull/4612/files

kkm000 · 2021-09-24T01:14:43Z

Every Linux system necessarily has at least one broken locale. I've seen LC_ALL=en_US.UTF-8 not working, with diagnostics like "cannot set locale LC_TYPE", while LANG=en_US.UTF-8 did work and set it just fine. I would rather use LANG= or LC_CTYPE= in this script than LC_ALL. The less you ask of a locale, the safer the code is. The complete list of locale facets include a lot of bizarre stuff, like postal address formatting and prefixes for "Mr.", "Ms." and unknown sex addressee. Later CLDR releases include locally adopted blood glucose measurement units.

en_US.UTF-8 is ambiguous in many ways, as it can be asked nonsensical questions. Is GREEK QUESTION MARK a printable character? American English does not have it. Up to the locale's designer, I guess. In Debian and therefore Ubuntu, it is: en_US and C both extend the category definitions to the full Unicode codepoint set. I'd say that if a system does not have the "neutral" C.UTF-8 locale extended with a complete Unicode character definition (which is not at all guaranteed), everything else is even a worse guesswork.

Oh, and amazingly, ISO 8859-7 Latin+Greek 8-bit encoding does define the Greek question mark...

Also, my locale -a prints C.UTF-8 and en_US.utf8. All case variants and both utf8 and utf-8 are accepted on input, but not canonicalized when printed. So the locale -a | grep is also error-prone in #4612.

* bug fixes for KWS setup in babel recipe, generation of confusion matrix(compute-wer compatibility issue) and writing results * removing compute-wer.cc changes Co-authored-by: thoshith-s <thoshith.thoshi@gmail.com> Co-authored-by: Jan 'Yenda' Trmal <jtrmal@gmail.com>

jtrmal · 2021-09-24T16:35:00Z

cresolved via #4633

bug fixes for KWS setup in babel recipe, generation of confusion matr…

b324c4e

…ix(compute-wer compatibility issue) and writing results

thoshith-s mentioned this pull request Feb 23, 2020

fix bug in kws_search.sh for writing results #3950

Merged

stale bot added the stale Stale bot on the loose label Jun 19, 2020

stale bot closed this Jul 19, 2020

kkm000 reopened this Jul 19, 2020

stale bot removed the stale Stale bot on the loose label Jul 19, 2020

kkm000 requested a review from jtrmal July 21, 2020 21:21

jtrmal approved these changes Jul 22, 2020

View reviewed changes

stale bot added the stale Stale bot on the loose label Sep 20, 2020

jtrmal reviewed Nov 24, 2020

View reviewed changes

stale bot removed the stale Stale bot on the loose label Nov 24, 2020

kkm000 self-assigned this Nov 25, 2020

kkm000 added the waiting-for-feedback Reporter's feedback has been requested label Nov 25, 2020

stale bot added the stale Stale bot on the loose label Jan 24, 2021

stale bot removed the stale Stale bot on the loose label Sep 22, 2021

kkm000 added in progress Issue has been taken and is being worked on stale-exclude Stale bot ignore this issue and removed waiting-for-feedback Reporter's feedback has been requested labels Sep 22, 2021

jtrmal closed this Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug fixes for KWS setup in babel recipe, generation of confusion matr… #3951

bug fixes for KWS setup in babel recipe, generation of confusion matr… #3951

thoshith-s commented Feb 22, 2020

danpovey commented Feb 22, 2020

thoshith-s commented Feb 22, 2020 •

edited

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

kkm000 commented Jul 21, 2020

jtrmal commented Jul 22, 2020

jtrmal commented Jul 22, 2020

stale bot commented Sep 20, 2020

jtrmal Nov 24, 2020

kkm000 Nov 25, 2020

jtrmal commented Nov 25, 2020 via email •

edited by kkm000

stale bot commented Jan 24, 2021

kkm000 commented Sep 22, 2021

jtrmal commented Sep 22, 2021 via email

kkm000 commented Sep 22, 2021

jtrmal commented Sep 22, 2021 via email

kkm000 commented Sep 22, 2021

jtrmal commented Sep 22, 2021 via email

kkm000 commented Sep 23, 2021

kkm000 commented Sep 23, 2021

jtrmal commented Sep 23, 2021 via email

danpovey commented Sep 23, 2021

kkm000 commented Sep 24, 2021

jtrmal commented Sep 24, 2021



		template<typename T>
		void PrintAlignmentStats(const std::vector<T> &ref,

bug fixes for KWS setup in babel recipe, generation of confusion matr… #3951

bug fixes for KWS setup in babel recipe, generation of confusion matr… #3951

Conversation

thoshith-s commented Feb 22, 2020

danpovey commented Feb 22, 2020

thoshith-s commented Feb 22, 2020 • edited

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

kkm000 commented Jul 21, 2020

jtrmal commented Jul 22, 2020

jtrmal commented Jul 22, 2020

stale bot commented Sep 20, 2020

jtrmal Nov 24, 2020

Choose a reason for hiding this comment

kkm000 Nov 25, 2020

Choose a reason for hiding this comment

jtrmal commented Nov 25, 2020 via email • edited by kkm000

stale bot commented Jan 24, 2021

kkm000 commented Sep 22, 2021

jtrmal commented Sep 22, 2021 via email

kkm000 commented Sep 22, 2021

jtrmal commented Sep 22, 2021 via email

kkm000 commented Sep 22, 2021

jtrmal commented Sep 22, 2021 via email

kkm000 commented Sep 23, 2021

kkm000 commented Sep 23, 2021

jtrmal commented Sep 23, 2021 via email

danpovey commented Sep 23, 2021

kkm000 commented Sep 24, 2021

jtrmal commented Sep 24, 2021

thoshith-s commented Feb 22, 2020 •

edited

jtrmal commented Nov 25, 2020 via email •

edited by kkm000