Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding alignments to supervision #304

Merged
merged 38 commits into from
May 20, 2021
Merged

Conversation

desh2608
Copy link
Collaborator

@desh2608 desh2608 commented May 11, 2021

This PR will add alignments to SupervisionSegment, as discussed in #298.

@desh2608
Copy link
Collaborator Author

Some questions/comments:

  1. I don't fully understand what you meant about using __post_init__ in SupervisionSegment. Could you explain a little more?
  2. Where should from_ctm and to_ctm be implemented? CTM word supervisions are usually at the recording level, so a CTM would more naturally be associated with a SupervisionSet I think.
  3. Some functions in cuts.py which prepare supervision masks can be modified to optionally use the alignments for mask generation. This would be particularly useful for VAD training etc.
  4. What's a natural workflow for adding alignments? Should they be included in Lhotse recipes, or do users get to add them to the prepared supervisions manifest?

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left some comments and suggestions.

Before you go on with adding more stuff, I suggest you add/extend unit tests to cover the new code paths that are being introduced. In particular you might want to see lhotse.testing.dummies (https://github.com/lhotse-speech/lhotse/blob/master/lhotse/testing/dummies.py) and add alignments to supervision segments to see how many existing tests will break (if any). You can also extend lhotse.testing.fixtures with it which performs randomized testing with PyTorch datasets (https://github.com/lhotse-speech/lhotse/blob/master/lhotse/testing/fixtures.py#L81).

Your questions:

I don't fully understand what you meant about using post_init in SupervisionSegment. Could you explain a little more?

When you store the manifest in JSON/sth else, it will convert AlignmentItem to a 3-element list. When we read that JSON, we need to convert these 3-element lists back to AlignmentItems. We can do it in two ways: one is using a special member in dataclasses that is called __post_init__, which is called after __init__ (this allows to use default __init__ in dataclasses and still customize it). But I think there's actually a better way to do it -- I think there is a method called from_dict in supervision segment that you can adjust instead.

Where should from_ctm and to_ctm be implemented? CTM word supervisions are usually at the recording level, so a CTM would more naturally be associated with a SupervisionSet I think.

Yeah SupervisionSet makes sense to me. Actually if we read with from_ctm, we would not have the speaker etc. information anymore -- maybe we can instead make it sth like add_ctm_info (not sure how to name it well...) that adds alignments to an existing object? Or we can read and merge two supervision sets. Or maybe there is another way -- it's up to you.

Some functions in cuts.py which prepare supervision masks can be modified to optionally use the alignments for mask generation. This would be particularly useful for VAD training etc.

Makes sense!

What's a natural workflow for adding alignments? Should they be included in Lhotse recipes, or do users get to add them to the prepared supervisions manifest?

I think users will typically extend existing supervision manifests, as not too many corpora provide alignments.

lhotse/supervision.py Show resolved Hide resolved
lhotse/supervision.py Outdated Show resolved Hide resolved
lhotse/supervision.py Outdated Show resolved Hide resolved
lhotse/supervision.py Outdated Show resolved Hide resolved
@danpovey
Copy link
Collaborator

Guys, I haven't looked at this in detail, but I do want to mention something that's been on my mind recently RE snowfall and (eventual) Icefall.

Firstly, the timeline for designing icefall is basically in the next few weeks we'll be figuring out how to start, and we'd like Icefall to be "officially releasable" in time for the Intespeech tutorial around September 1st. Think of Icefall as a "properly designed" version of snowfall.

Anyway, the issue related to this is, I'd like to be able to deal with things like alignments in Icefall/Snowfall, e.g. for purposes of building trees and the like, or for segmenting training data, data-cleaning and analysis, and so on. I'm not saying the format we'd use for such things would necessarily be identical to the the way they are represented in Lhotse. But a workflow that I think might end up being fairly common, is we train some fairly basic neural net model that only sees limited acoustic context (like a TDNN), for purposes of data alignment. If we're using a model where there is a phone for silence or at least a unit that represents "end-of-word", it may be useful to have a blank-free topology here. I don't believe we currently have code for such a topology in snowfall.

And we might want some way to represent those data alignments. E.g. one simple representation might just be the label sequences, indexed somehow by utterance-id. There are several different label-sequences that might be relevant here, depending on the type of system: the ilabel from the model, the phone-label "without repetitions", which we could store as a separate attribute on the graphs by using the "inner_labels=xxx" arg to compose in the appropriate stage of graph creation; and the olabel which is the word label. One possibility is to just store these as a dict indexed by utterance-id and then by 'ilabel', 'olabel' and 'phone_label' (for phone labels without repetitions) or something like that, and store it as a .pt file with torch.save().

@pzelasko
Copy link
Collaborator

That makes sense to me. In another project, where I prototyped some of the alignment-related stuff, I'm using this CTM-like format together with methods to convert to/from a frame-level alignment (int sequence). It handles conversions such as 10ms - 12ms frame shift etc. I think it's going to play well with what you're planning for Icefall.

@danpovey
Copy link
Collaborator

Something else the alignments might be useful for, if we can get word-level alignments, is in designing BPE-type units. It's desirable to have units where words that have a short pronunciation also have a short representation, so we can keep the frame rate slow.

@pzelasko pzelasko added this to the v0.7 milestone May 18, 2021
@@ -251,12 +254,49 @@ def from_segments(segments: Iterable[SupervisionSegment]) -> 'SupervisionSet':
def from_dicts(data: Iterable[Dict]) -> 'SupervisionSet':
return SupervisionSet.from_segments(SupervisionSegment.from_dict(s) for s in data)

def add_alignments_from_ctm(ctm) -> 'SupervisionSet':
def add_alignments_from_ctm(self, ctm_file: Pathlike, type: str = 'word') -> 'SupervisionSet':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest with_alignments_from_ctm, add suggests we're mutating the original object

:return: A new SupervisionSet with AlignmentItem objects added to the segments.
"""
ctm_words = []
with open(ctm_file, 'r') as f:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with open(ctm_file, 'r') as f:
with open(ctm_file) as f:

with open(ctm_file, 'r') as f:
for line in f:
reco_id, channel, start, duration, symbol = line.strip().split()
ctm_words.append((reco_id, channel, float(start), float(duration), symbol))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't channel be an int?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(at least in Lhotse we always map them to ints and expect int, so be careful there)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes. But sometimes they are also denoted by A, B, etc. in CTM files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, channels do not have to be int

for line in f:
reco_id, channel, start, duration, symbol = line.strip().split()
ctm_words.append((reco_id, channel, float(start), float(duration), symbol))
ctm_words = sorted(ctm_words, key=lambda x:x[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is python's sorted guaranteed to be stable? otherwise you might want to change the key to be a tuple of (reco_id, start)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is guaranteed (https://docs.python.org/3/library/functions.html#sorted). Also, we don't particularly need the alignments to be sorted by time I suppose? This sorting by reco_id is just to enable groupby.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't, but I think it's a reasonable expectation that they are sorted (I know I'd be surprised if they're not). If it's guaranteed then it's cool :)

reco_to_ctm = defaultdict(list, {k: list(v) for k,v in groupby(ctm_words, key=lambda x:x[0])})
segments = []
for reco_id in reco_to_ctm:
segs = [s for s in self if s.recording_id == reco_id]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a faster variant is self.find(recording_id=reco_id) which internally builds an index and caches it

for seg in segs:
alignment = [AlignmentItem(word[4], word[2], word[3]) for word in ctm_words
if overspans(
TimeSpan(start=seg.start, end=seg.end),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to just pass seg here due to duck typing

return SupervisionSet.from_segments(segments)


def write_ctm(self, ctm_file: Pathlike, type: str = 'word') -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about write_alignment_to_ctm? it's un-ambiguous, as it's also possible to convert "normal" supervisions to a CTM

segs = [s for s in self if s.recording_id == reco_id]
for seg in segs:
alignment = [AlignmentItem(word[4], word[2], word[3]) for word in ctm_words
if overspans(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should happen if the alignment item overlaps a supervision segment? maybe we should issue some warning about potential mismatch? I'm not sure.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @desh2608! I left some comments, could you also add unit tests for the two new methods of import/export to CTM?

@pzelasko
Copy link
Collaborator

(you can add dummy CTMs to test/fixtures, don't forget to add them to git)

@desh2608
Copy link
Collaborator Author

desh2608 commented May 19, 2021

@pzelasko I have made the changes you suggested and also added tests for reading/writing CTM. Perhaps the PR can be merged now if it looks good to you (I will try to make the changes to supervision mask generation in Cuts later).

@pzelasko
Copy link
Collaborator

OK I am merging as it is. We can try them out and adjust or optimize as needed.

Thanks for all the work on this, great job!

@pzelasko pzelasko merged commit a3aa32e into lhotse-speech:master May 20, 2021
@pzelasko pzelasko changed the title [WIP] Adding alignments to supervision Adding alignments to supervision May 20, 2021
for item in ali
]
for ali_type, ali in self.alignment.items()
} if self.alignment else None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 192 already checks whether self.alignment is None, so no need to do an extra check here.

segments.append(fastcopy(seg, alignment={type: alignment}))
else:
segments.append([s for s in self.find(recording_id=reco_id)])
print (segments)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of debugging code should not be checked into the final commit.

We use the recursive conversion only if alignments are present, since it may
potentially be slower due to type checking of member objects.
"""
return asdict_nonull(self) if self.alignment is None else asdict_nonull_recursive(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AlignmentItem class is no longer a subclass of NamedTuple, is asdict_nonull_recursive really necessary? It makes the code slow, I believe.

Also, the code in cut.py should also be updated to remove asdict_nonull_recursive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, @desh2608 could you address Fangjuns comments?

Btw @csukuangfj we should make sure that Lhotse alignments and Snowfall alignments interact well together. If there are any additions/changes that are helpful on Lhotse side, please let us know.

@desh2608
Copy link
Collaborator Author

desh2608 commented Jun 5, 2021

@csukuangfj thanks for your review. I have made suggested changes in #313.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants