-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding alignments to supervision #304
Conversation
Conflicts: lhotse/qa.py
Some questions/comments:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I left some comments and suggestions.
Before you go on with adding more stuff, I suggest you add/extend unit tests to cover the new code paths that are being introduced. In particular you might want to see lhotse.testing.dummies
(https://github.com/lhotse-speech/lhotse/blob/master/lhotse/testing/dummies.py) and add alignments to supervision segments to see how many existing tests will break (if any). You can also extend lhotse.testing.fixtures
with it which performs randomized testing with PyTorch datasets (https://github.com/lhotse-speech/lhotse/blob/master/lhotse/testing/fixtures.py#L81).
Your questions:
I don't fully understand what you meant about using post_init in SupervisionSegment. Could you explain a little more?
When you store the manifest in JSON/sth else, it will convert AlignmentItem
to a 3-element list. When we read that JSON, we need to convert these 3-element lists back to AlignmentItem
s. We can do it in two ways: one is using a special member in dataclasses that is called __post_init__
, which is called after __init__
(this allows to use default __init__
in dataclasses and still customize it). But I think there's actually a better way to do it -- I think there is a method called from_dict
in supervision segment that you can adjust instead.
Where should from_ctm and to_ctm be implemented? CTM word supervisions are usually at the recording level, so a CTM would more naturally be associated with a SupervisionSet I think.
Yeah SupervisionSet
makes sense to me. Actually if we read with from_ctm
, we would not have the speaker etc. information anymore -- maybe we can instead make it sth like add_ctm_info
(not sure how to name it well...) that adds alignments to an existing object? Or we can read and merge two supervision sets. Or maybe there is another way -- it's up to you.
Some functions in cuts.py which prepare supervision masks can be modified to optionally use the alignments for mask generation. This would be particularly useful for VAD training etc.
Makes sense!
What's a natural workflow for adding alignments? Should they be included in Lhotse recipes, or do users get to add them to the prepared supervisions manifest?
I think users will typically extend existing supervision manifests, as not too many corpora provide alignments.
Guys, I haven't looked at this in detail, but I do want to mention something that's been on my mind recently RE snowfall and (eventual) Icefall. Firstly, the timeline for designing icefall is basically in the next few weeks we'll be figuring out how to start, and we'd like Icefall to be "officially releasable" in time for the Intespeech tutorial around September 1st. Think of Icefall as a "properly designed" version of snowfall. Anyway, the issue related to this is, I'd like to be able to deal with things like alignments in Icefall/Snowfall, e.g. for purposes of building trees and the like, or for segmenting training data, data-cleaning and analysis, and so on. I'm not saying the format we'd use for such things would necessarily be identical to the the way they are represented in Lhotse. But a workflow that I think might end up being fairly common, is we train some fairly basic neural net model that only sees limited acoustic context (like a TDNN), for purposes of data alignment. If we're using a model where there is a phone for silence or at least a unit that represents "end-of-word", it may be useful to have a blank-free topology here. I don't believe we currently have code for such a topology in snowfall. And we might want some way to represent those data alignments. E.g. one simple representation might just be the label sequences, indexed somehow by utterance-id. There are several different label-sequences that might be relevant here, depending on the type of system: the ilabel from the model, the phone-label "without repetitions", which we could store as a separate attribute on the graphs by using the "inner_labels=xxx" arg to compose in the appropriate stage of graph creation; and the olabel which is the word label. One possibility is to just store these as a dict indexed by utterance-id and then by 'ilabel', 'olabel' and 'phone_label' (for phone labels without repetitions) or something like that, and store it as a .pt file with torch.save(). |
That makes sense to me. In another project, where I prototyped some of the alignment-related stuff, I'm using this CTM-like format together with methods to convert to/from a frame-level alignment (int sequence). It handles conversions such as 10ms - 12ms frame shift etc. I think it's going to play well with what you're planning for Icefall. |
Something else the alignments might be useful for, if we can get word-level alignments, is in designing BPE-type units. It's desirable to have units where words that have a short pronunciation also have a short representation, so we can keep the frame rate slow. |
lhotse/supervision.py
Outdated
@@ -251,12 +254,49 @@ def from_segments(segments: Iterable[SupervisionSegment]) -> 'SupervisionSet': | |||
def from_dicts(data: Iterable[Dict]) -> 'SupervisionSet': | |||
return SupervisionSet.from_segments(SupervisionSegment.from_dict(s) for s in data) | |||
|
|||
def add_alignments_from_ctm(ctm) -> 'SupervisionSet': | |||
def add_alignments_from_ctm(self, ctm_file: Pathlike, type: str = 'word') -> 'SupervisionSet': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest with_alignments_from_ctm
, add
suggests we're mutating the original object
lhotse/supervision.py
Outdated
:return: A new SupervisionSet with AlignmentItem objects added to the segments. | ||
""" | ||
ctm_words = [] | ||
with open(ctm_file, 'r') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with open(ctm_file, 'r') as f: | |
with open(ctm_file) as f: |
lhotse/supervision.py
Outdated
with open(ctm_file, 'r') as f: | ||
for line in f: | ||
reco_id, channel, start, duration, symbol = line.strip().split() | ||
ctm_words.append((reco_id, channel, float(start), float(duration), symbol)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't channel
be an int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(at least in Lhotse we always map them to ints and expect int, so be careful there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, yes. But sometimes they are also denoted by A
, B
, etc. in CTM files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, channels do not have to be int
lhotse/supervision.py
Outdated
for line in f: | ||
reco_id, channel, start, duration, symbol = line.strip().split() | ||
ctm_words.append((reco_id, channel, float(start), float(duration), symbol)) | ||
ctm_words = sorted(ctm_words, key=lambda x:x[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is python's sorted
guaranteed to be stable? otherwise you might want to change the key to be a tuple of (reco_id, start)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is guaranteed (https://docs.python.org/3/library/functions.html#sorted). Also, we don't particularly need the alignments to be sorted by time I suppose? This sorting by reco_id
is just to enable groupby
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't, but I think it's a reasonable expectation that they are sorted (I know I'd be surprised if they're not). If it's guaranteed then it's cool :)
lhotse/supervision.py
Outdated
reco_to_ctm = defaultdict(list, {k: list(v) for k,v in groupby(ctm_words, key=lambda x:x[0])}) | ||
segments = [] | ||
for reco_id in reco_to_ctm: | ||
segs = [s for s in self if s.recording_id == reco_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a faster variant is self.find(recording_id=reco_id)
which internally builds an index and caches it
lhotse/supervision.py
Outdated
for seg in segs: | ||
alignment = [AlignmentItem(word[4], word[2], word[3]) for word in ctm_words | ||
if overspans( | ||
TimeSpan(start=seg.start, end=seg.end), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to just pass seg
here due to duck typing
lhotse/supervision.py
Outdated
return SupervisionSet.from_segments(segments) | ||
|
||
|
||
def write_ctm(self, ctm_file: Pathlike, type: str = 'word') -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about write_alignment_to_ctm
? it's un-ambiguous, as it's also possible to convert "normal" supervisions to a CTM
lhotse/supervision.py
Outdated
segs = [s for s in self if s.recording_id == reco_id] | ||
for seg in segs: | ||
alignment = [AlignmentItem(word[4], word[2], word[3]) for word in ctm_words | ||
if overspans( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what should happen if the alignment item overlaps a supervision segment? maybe we should issue some warning about potential mismatch? I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @desh2608! I left some comments, could you also add unit tests for the two new methods of import/export to CTM?
(you can add dummy CTMs to |
@pzelasko I have made the changes you suggested and also added tests for reading/writing CTM. Perhaps the PR can be merged now if it looks good to you (I will try to make the changes to supervision mask generation in Cuts later). |
OK I am merging as it is. We can try them out and adjust or optimize as needed. Thanks for all the work on this, great job! |
for item in ali | ||
] | ||
for ali_type, ali in self.alignment.items() | ||
} if self.alignment else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 192 already checks whether self.alignment
is None
, so no need to do an extra check here.
segments.append(fastcopy(seg, alignment={type: alignment})) | ||
else: | ||
segments.append([s for s in self.find(recording_id=reco_id)]) | ||
print (segments) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of debugging code should not be checked into the final commit.
We use the recursive conversion only if alignments are present, since it may | ||
potentially be slower due to type checking of member objects. | ||
""" | ||
return asdict_nonull(self) if self.alignment is None else asdict_nonull_recursive(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AlignmentItem
class is no longer a subclass of NamedTuple
, is asdict_nonull_recursive
really necessary? It makes the code slow, I believe.
Also, the code in cut.py
should also be updated to remove asdict_nonull_recursive
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, @desh2608 could you address Fangjuns comments?
Btw @csukuangfj we should make sure that Lhotse alignments and Snowfall alignments interact well together. If there are any additions/changes that are helpful on Lhotse side, please let us know.
@csukuangfj thanks for your review. I have made suggested changes in #313. |
This PR will add alignments to SupervisionSegment, as discussed in #298.