New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug fixes for KWS setup in babel recipe, generation of confusion matr… #3951
Conversation
…ix(compute-wer compatibility issue) and writing results
Thanks... I think @jtrmal should check this. |
Welcome. Below are the comments which are helpful in reviewing the above commit.
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it. |
@jtrmal, could you PTAL? Dan though you were the right person to review this. |
put on my todo |
yes, LGTM |
This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open. |
|
||
|
||
template<typename T> | ||
void PrintAlignmentStats(const std::vector<T> &ref, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this capability/function used in any external tool. Plus I'm sure we already do have this capability. So could you please remove this from this PR and eventually, if you think it's a "worthy" change, create a new PR, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtrmal, could you please help us find the tool? Interesting, once I sent a PR to print similar data from the very same program at verbosity 2+, only in the raw form, and you then told me we already have that. I remember I could not find it.
it's in steps/score_kaldi.sh
```
122 $cmd $dir/scoring_kaldi/log/stats1.log \
123 cat $dir/scoring_kaldi/penalty_$best_wip/$best_lmwt.txt \| \
124 align-text --special-symbol="'***'" ark:$dir/scoring_kaldi/test_filt.txt ark:- ark,t:- \| \
125 utils/scoring/wer_per_utt_details.pl --special-symbol "'***'" \| tee $dir/scoring_kaldi/wer_details/per_utt \|\
126 utils/scoring/wer_per_spk_details.pl $data/utt2spk \> $dir/scoring_kaldi/wer_details/per_spk || exit 1;
```
|
This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open. |
Looks like this is nearly there but abandoned. I'll take over, the changes seem trivial. @jtrmal, you ok with that? |
You know what? Let me handle it. I was asked to revisit the recipes so I
will have to dig through my mess again anyway :)
y.
…On Wed, Sep 22, 2021 at 5:03 PM kkm000 ***@***.***> wrote:
Looks like this is nearly there but abandoned. I'll take over, the changes
seem trivial. @jtrmal <https://github.com/jtrmal>, you ok with that?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3951 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYXY6AHSEEITZYFF3S53UDJADXANCNFSM4KZQ2UOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Bingo! I'm trying to grok the difference between |
the differences, IIRC are in chr(127) and up... But I don't recall exactly.
Maybe currencies and stuff like that.
…On Wed, Sep 22, 2021 at 5:30 PM kkm000 ***@***.***> wrote:
Bingo!
I'm trying to grok the difference between en_US.UTF-8 and C.UTF-8. ATM
the main dissimilarity is that I have the latter on my system but not the
former, but I'm trying to bridge the gap...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3951 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX3K35U6AZXDHNAKVA3UDJDIBANCNFSM4KZQ2UOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@jtrmal ah, I was thinking about a note on the different PR. This is bizarre: https://github.com/kaldi-asr/kaldi/pull/4130/files#diff-683416a3e05ed8af8425a06a31b0f2e4566d71dff67b3aae0f970427f834d58dR130 grep using the C locale finds non-printable non-spaces in Turkish text, but when using the EN locale, doesn't. UTF-8 operates on codepoints, and Unicode categories are assigned to them, AFAIK, in a locale-independent way w.r.t spacing and being printable. E.g., the codepoint U+001F (called And no one provided a repro. |
yeah, no idea what that was fixing
y.
…On Wed, Sep 22, 2021 at 6:01 PM kkm000 ***@***.***> wrote:
@jtrmal <https://github.com/jtrmal> ah, I was thinking about a note on
the different PR. This is bizarre:
https://github.com/kaldi-asr/kaldi/pull/4130/files#diff-683416a3e05ed8af8425a06a31b0f2e4566d71dff67b3aae0f970427f834d58dR130
grep using the C locale finds non-printable non-spaces in Turkish text,
but when using the EN locale, doesn't. UTF-8 operates on codepoints, and
Unicode categories are assigned to them, AFAIK, in a locale-independent way
w.r.t spacing and being printable. E.g., the codepoint U+001F (called US,
"terminate and justify line", like stretch it from left to right margin) is
a generic Control (C), and also more specific Cc, which is non-printable,
but I'm not 100% sure that it's a non-space. But this is all
language-independent! Ll and Lu may be locale-dependent, but I'm unsure.
Locale affects collation (C : codepoint -> integer for sorting) and case
equivalence classes, but should not make codepoints invisible.
And no one provided a repro.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3951 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYXYDM5ZGGGYFG5PNN43UDJG2LANCNFSM4KZQ2UOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
On my Debian 11 distro, there is no difference. From /usr/share/i18n/locales/C:
From /usr/share/i18n/locales/en_US:
From /usr/share/i18n/locales/en_GB:
Both end up with The FWIW, I see no problem with UTF-8 text in Turkish, unless I drop down to ASCII:
|
I figured :)
…On Wed, Sep 22, 2021 at 8:31 PM kkm000 ***@***.***> wrote:
@jtrmal <https://github.com/jtrmal>, I was thinking about #4193
<#4193>. Sorry, this ticket is
unrelated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3951 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYXZUO4TVXCP6TOFLMUDUDJYPTANCNFSM4KZQ2UOA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I think the issue is that not all systems have all locales available, and it probably defaults to C if you choose one that is not available. |
Every Linux system necessarily has at least one broken locale. I've seen LC_ALL=en_US.UTF-8 not working, with diagnostics like "cannot set locale LC_TYPE", while LANG=en_US.UTF-8 did work and set it just fine. I would rather use
Oh, and amazingly, ISO 8859-7 Latin+Greek 8-bit encoding does define the Greek question mark... Also, my |
* bug fixes for KWS setup in babel recipe, generation of confusion matrix(compute-wer compatibility issue) and writing results * removing compute-wer.cc changes Co-authored-by: thoshith-s <thoshith.thoshi@gmail.com> Co-authored-by: Jan 'Yenda' Trmal <jtrmal@gmail.com>
cresolved via #4633 |
bug fixes for KWS setup in babel recipe
1 . generation of confusion matrix (compute-wer compatibility issue) stats out is required.
2 . writing results local/kws_search.sh and utils/write_kwslist.pl