Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

Building lexicons in Python #191

Open
pzelasko opened this issue May 11, 2021 · 16 comments
Open

Building lexicons in Python #191

pzelasko opened this issue May 11, 2021 · 16 comments

Comments

@pzelasko
Copy link
Collaborator

The current setup inherits building lexicon FSTs from Kaldi. I think it makes sense to have the ability to build it directly in Python, which should make building new recipes easier, as well as (eventually) allow for some things like dynamic expansion of the lexicon without leaving Python.

The data structure would basically resemble that of Kaldi, e.g.:

class Dict:
  # a list of words and their phone transcripts, possibly with scores to resemble lexiconp.txt
  lexicon: List[str, List[str]]

  # OOV word symbol
  oov: str

  # optional silence phone symbol
  optional_silence: str

  # a list of silence phone symbols (maybe we should call them special symbols? spoken noise is not really silence)
  silence_phones: List[str]

  # a list of nonsilence phone symbols
  nonsilence_phones: List[str]

  @property
  def words(self) -> List[str]:
    """A sorted list of unique words in Dict. Includes <eps>, #0, <s> and </s>"""
  
  @property
  def phones(self) -> List[str]:
    """A sorted list of unique phones in Dict."""

and methods:

def save(self, path):
  """Save into a file or a directory (maybe same as Kaldi's data dir)"""

@classmethod
def load(cls, path) -> 'Dict':
  """Read all the information from a path"""

def compile_lexicon_fst(self) -> k2.Fsa:
  """Adds disambiguation symbols and compiles L.fst"""

def extend(self, lexicon: List[str, List[str]]) -> k2.Fsa:
  """Adds new words and their corresponding phone transcripts into Dict. Checks for compatibility with the phone set."""

Kaldi's prepare_lang.sh has accumulated a lot of options, so I'd like to get some feedback which of them are useful to keep and which are not:

  • num sil/nonsil states and share_silence_phones are currently unused and probably not needed anymore?
  • position dependent phones seems superficial in our current setups, not sure if it'll be useful?
  • could unk-fst be still useful?
  • silprob/sil_prob - is it worth supporting it?

We can of course start from something minimal and extend it... It does seem like a substantial amount of work but I think it's worth it and I can give it a shot, or at least lay some groundwork. What do you guys think? Also, I want to make sure I wouldn't be duplicating anybody's effort.

@danpovey
Copy link
Contributor

danpovey commented May 12, 2021

num sil/nonsil states relates to the topo, so probably doesn't belong in the dict.

If we do use word-position dependent phones, we'd probably want to simplify them into 2 classes instead of 5.
I did some experiments with them but saw no clear gain, but this could be revisited. However, it can be done simply as a transformation on the dict, doubling the phone-set size, so may not need to be represented in the Dict itself.

unk-fst may still be useful, I guess, but I think we can leave it separate from Dict, for now at least.

silprob: my feeling is we may not need it since it can just be absorbed into the probability of silence in the acoustic model
(if we're training with LF-MMI and other sequence criteria, removing it shouldn't remove any modeling power).

There is even a question whether the silence_phones / nonsilence_phones belongs in the Dict. It's not clear what uses we
have for that right now. We do need the opt_sil, though, so we can turn the Dict into an Fst (note: None should be allowable).

Also: turning the Dict into an Fsa may not be the most efficient method of graph-building (at least for supervisions) One possibility is to turn the Dict into an FsaVec and introducing a new indexing operation whereby an FsaVec can be indexed by an Fsa or FsaVec. The idea is this: that an expression a[b] gives you something with the top-level structure of b, but where each arc in b with a label x is replaced by the Fsa a[x], with the start-state and final-state of a[x] being identified with the source-state and destination-state of the arc in b, and any additional states in a[x] being inserted somewhere in the result (e.g. just after the source-state of the arc). I would propose to have epsilon be treated as a normal symbol and element 0 of a being what we replace epsilon arcs with (would likely be just a single arc from start-state to final-state); the last element of a being used when the symbol in b is -1's; and -1 arcs in a being replaced with 0 if their destination-state in a[b] is not a final-state. This way, we could put the optional silence at the start of all the individual FSA's, and the final FSA in a would also have the optional-silence which may be present at end-of-sentence.

@danpovey
Copy link
Contributor

@csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming.

@csukuangfj
Copy link
Collaborator

@csukuangfj I'll talk to Kangwei about doing this,

Cool!

@francisr
Copy link

If we do use word-position dependent phones, we'd probably want to simplify them into 2 classes instead of 5.
I did some experiments with them but saw no clear gain, but this could be revisited. However, it can be done simply as a transformation on the dict, doubling the phone-set size, so may not need to be represented in the Dict itself.

Even if there's no WER gain with position dependent phones, it's useful for fast lattice alignment.

@pzelasko
Copy link
Collaborator Author

@csukuangfj I'll talk to Kangwei about doing this, it would be a good first project that can let him understand the basics of k2 C++ programming.

Cool! In that case, I won't start working on it to avoid duplicated effort.

@danpovey
Copy link
Contributor

danpovey commented May 12, 2021 via email

@pzelasko
Copy link
Collaborator Author

Could be a good opportunity to get more familiar with k2's C++ code. I'll start with the Python part and let's see then.

@jtrmal
Copy link

jtrmal commented May 12, 2021 via email

@jtrmal
Copy link

jtrmal commented May 12, 2021 via email

@danpovey
Copy link
Contributor

danpovey commented May 13, 2021 via email

@danpovey
Copy link
Contributor

danpovey commented May 13, 2021 via email

@pzelasko
Copy link
Collaborator Author

@jtrmal I won't find the time to work on it this week -- if you want to, feel free to start (just let me know if you do).

@jtrmal
Copy link

jtrmal commented May 17, 2021 via email

@danpovey
Copy link
Contributor

danpovey commented May 28, 2021 via email

@jtrmal
Copy link

jtrmal commented May 28, 2021 via email

@danpovey
Copy link
Contributor

danpovey commented May 28, 2021 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants