Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Praat TextGrid file output #9

Closed
loretoparisi opened this issue Mar 21, 2013 · 18 comments
Closed

No Praat TextGrid file output #9

loretoparisi opened this issue Mar 21, 2013 · 18 comments

Comments

@loretoparisi
Copy link

Running a training and a sample test alignement, the procedure ends up with no TextGrid data, and no errors:

macbookproloreto:Prosodylab-Aligner loreto$ ./align.py -s 44010 -t data data
Nearest viable SR is 40000 Hz
Initializing...
Training...
Modeling silence...
More training...
Realigning...
WARNING [-8221] InitPronHolders: Total of 296 duplicate pronunciations removed in HVite
More training...
Final aligning...
Making TextGrids...
Alignment complete.

No TextGrid file was found in ./data/ directory

@kylebgorman
Copy link
Contributor

Hi Loreto,

I haven't seen that WARNING (from HTK, not the script) before. It strikes me as a weird one though: is your pronunciation dictionary have a lot of ambiguity (i.e., multiple pronunciations for a single orthographic word?). Presumably this could result in search failure (which results in no TextGrids). In general, if HTK is not completely silent, something probably went wrong once the data was handed over to HTK.

See the recently posted issue about error codes; if this warning gives a non-zero return code, a future version of Prosodylab-Aligner will catch it. I am working on said issue, and in general making error messages more informative, later this week.

What's the data like? How much data (in minutes)? It takes a considerable amount of data to train a new acoustic model (as mentioned in the README)

Kyle

On Mar 20, 2013, at 7:32 PM, Loreto Parisi notifications@github.com wrote:

Running a training and a sample test alignement, the procedure ends up with no TextGrid data, and no errors:

macbookproloreto:Prosodylab-Aligner loreto$ ./align.py -s 44010 -t data data
Nearest viable SR is 40000 Hz
Initializing...
Training...
Modeling silence...
More training...
Realigning...
WARNING [-8221] InitPronHolders: Total of 296 duplicate pronunciations removed in HVite
More training...
Final aligning...
Making TextGrids...
Alignment complete.

No TextGrid file was found in ./data/ directory


Reply to this email directly or view it on GitHub.

@loretoparisi
Copy link
Author

When using the bash script it ends up with

macbookproloreto:Prosodylab-Aligner loreto$ ./align_ex.sh data/River.wav data/River.lab
Initializing...
Aligning...
ERROR [+8522] LatFromPaths: Align have dur<=0
FATAL ERROR - Terminating program HVite
Making TextGrids...
Alignment complete.
mv: rename .dat/River.TextGrid to ./River.TextGrid: No such file or directory
Output is in River.TextGrid.

I've found that the error

LatFromPaths: Align have dur<=0

was due to a error in a conditional check in HTKLib/HRec.c, lines 1626 and 1651 where

labid != splabid

should be replaced with

labpr != splabid

After patching and compiling HTK again, it worked in some cases, but now it stopped working in almost all other alignment tests.

@kylebgorman
Copy link
Contributor

On Mar 20, 2013, at 7:46 PM, Loreto Parisi notifications@github.com wrote:

When using the bash script it ends up with

macbookproloreto:Prosodylab-Aligner loreto$ ./align_ex.sh data/River.wav data/River.lab
Initializing...
Aligning...
ERROR [+8522] LatFromPaths: Align have dur<=0
FATAL ERROR - Terminating program HVite
Making TextGrids...
Alignment complete.
mv: rename .dat/River.TextGrid to ./River.TextGrid: No such file or directory
Output is in River.TextGrid.

So, HTK's first HVite pass failed, so no TextGrids were generated apparently.

I've found that the error

LatFromPaths: Align have dur<=0

was due to a error in a conditional check in HTKLib/HRec.c, lines 1626 and 1651 where

labid != splabid

should be replaced with

labpr != splabid

Is this a documented bug? If so, has it been accepted in HTK? Could you provide a citation?

@loretoparisi
Copy link
Author

Hi Kyle,
regarding the bug on HTK, here is the citation:

http://speechtechie.wordpress.com/2009/06/12/using-htk-3-4-1-on-mac-os-10-5/

I'm new to HTK, so I cannot say if it has been accepted, but I will look into.

Regarding, the pronunciation dictionary, I guess I miss something in my workflow. I'm going to build the dictionary again.

I will be back to you as soon as I generate the dictionary again.

Thanks!

@kylebgorman
Copy link
Contributor

I don't see any reason to assume that that bug is real. If changing a line makes it work, but breaks other things, it's probably not a bug. (And who is Felix, and why didn't he submit the bug to the HTK bugtracker?).

How much audio do you have for training?

@loretoparisi
Copy link
Author

Yes I suppose you're right 👍 and I don't know who is Felix. The audio is about 4 minutes.

@kylebgorman
Copy link
Contributor

FYI I just pushed some new features: try the newest version when you get a chance.

@loretoparisi
Copy link
Author

Thanks I tried it and now (without improving the pronunciation dictionary) it capture the error

macbookproloreto:Prosodylab-Aligner loreto$ ./align_ex.sh data/River.wav data/River.lab
Initializing... done.
Aligning... ERROR [+8522] LatFromPaths: Align have dur<=0
FATAL ERROR - Terminating program HVite
Traceback (most recent call last):
File "./align.py", line 733, in
scores_txt))
File "./align.py", line 436, in align_and_score
raise CalledProcessError(retcode, 'HVite')
subprocess.CalledProcessError: Command 'HVite' returned non-zero exit status 74
Alignment failed.
macbookproloreto:

@kylebgorman
Copy link
Contributor

Thanks, this is easier to understand. Could you send to me, or post, the audio and transcription files?

Kyle

On Mar 22, 2013, at 1:32 PM, Loreto Parisi notifications@github.com wrote:

Thanks I tried it and now (without improving the pronunciation dictionary) it capture the error

macbookproloreto:Prosodylab-Aligner loreto$ ./align_ex.sh data/River.wav data/River.lab
Initializing... done.
Aligning... ERROR [+8522] LatFromPaths: Align have dur<=0
FATAL ERROR - Terminating program HVite
Traceback (most recent call last):
File "./align.py", line 733, in
scores_txt))
File "./align.py", line 436, in align_and_score
raise CalledProcessError(retcode, 'HVite')
subprocess.CalledProcessError: Command 'HVite' returned non-zero exit status 74
Alignment failed.
macbookproloreto:


Reply to this email directly or view it on GitHub.

@kylebgorman
Copy link
Contributor

It also would be worthwhile to see if that hack attributed to Felix improves things, or if using HTK 3.4.0 helps.

Kyle

On Mar 22, 2013, at 1:32 PM, Loreto Parisi notifications@github.com wrote:

Thanks I tried it and now (without improving the pronunciation dictionary) it capture the error

macbookproloreto:Prosodylab-Aligner loreto$ ./align_ex.sh data/River.wav data/River.lab
Initializing... done.
Aligning... ERROR [+8522] LatFromPaths: Align have dur<=0
FATAL ERROR - Terminating program HVite
Traceback (most recent call last):
File "./align.py", line 733, in
scores_txt))
File "./align.py", line 436, in align_and_score
raise CalledProcessError(retcode, 'HVite')
subprocess.CalledProcessError: Command 'HVite' returned non-zero exit status 74
Alignment failed.
macbookproloreto:


Reply to this email directly or view it on GitHub.

@loretoparisi
Copy link
Author

Hi Kyle,
as a test, I tried to use The Penn Phonetics Lab Forced Aligner, http://www.ling.upenn.edu/phonetics/p2fa/
and it gave me a similar forced alignment that I was able to do in my first tests with Prosodylab, that I was not able to reproduce later. Actually I achieved the first result with HTK 3.4.1 and the previous update of Prosodylab-Aligner, but I guess I was lucky, because when I changed something in the configuration, I've got the error on HVite.

Considering that p2f seems to work (no errors coming from the HVite) we could say that the patched version of HVite worked out (Felix patch), but I cannot be sure of that in any case.

Going to send you these results by email.

@kylebgorman
Copy link
Contributor

Oh it's music, you should have said so! I am unaware of any working forced alignment with music. This technology only works well for loud, close-mic'ed recordings (preferably in an anechoic chamber) with a minimal amount of background noise at the moment. This is true of just about any speech technology except for in state-of-the-art academic and commercial research systems.

If you wanted to align this, what you'd want to do is to take a look at software for making "a cappellas" and "instrumentals", and do that as a preprocessing step. This is outside of my area of expertise but I'm sure others have ideas about this.

You also generally get better alignments by chopping up things into smaller pieces. While the technology is relatively robust to long files in general, you often get a suboptimal alignment that way, because alignments are computed using heuristic search with a narrow "beam". Also, it helps to put "sil" tokens into the label file where you expect silences (but they have to be real silences, not just pauses in singing…). This is presumably the source of your error.

I took a look at the P2FA alignment (I was vaguely involved in that project, but that aligner has very serious problems: no training, numerous stale bugs, and it doesn't deal with silence in a standard fashion; this is why we developed Prosodylab-Aligner). While I don't have a ready copy of "Blue", I can quickly see from the TextGrid that the P2FA alignments failed quite spectacularly. There is no way there is a single 27-second long nasal sound in that song, for instance.

Kyle

On Mar 22, 2013, at 4:36 PM, Loreto Parisi loretoparisi@gmail.com wrote:

Hi Kyle,
attached the results from HVite. The text is attached as well.
The audio is taken from the song River by Joni Mitchell, that it's too big to attach here I guess.

Here a sample, http://www.youtube.com/watch?v=bVwo9IQMWM0

Dott. Ing. Loreto Parisi
Parisi Labs

Email: loretoparisi@gmail.com
Linkedin: http://www.linkedin.com/in/loretoparisi
<River.txt><out.TextGrid>

@kylebgorman
Copy link
Contributor

Looking at the TextGrid now I realized that it's true!

Of course I'm applying your technology - we can say - in the worst case,
and I just realized that you didn't know (since is what I'm doing I assumed
that all the world was doing the same - you know engineers way of
thinking...).

So, you are right when saying "You also generally get better alignments by
chopping up things into smaller pieces."

In fact it was my aim to start with that, making suboptimal alignement for
small parts of the wav, taking a part for each sentence i.e. phrase, so
splitting the wav in as many files as the lyrics sentences. This was the
basic idea of mine approach.
I assume that a "sp" silence it is a real silence, but the module - as for
my knowledge - should consider small pauses and/or long pauses. In that
case I could add a small pause between each word when needed and a pause
between each sentence, that was an approach I saw in a different research
work by Masakata Goto and Hirosama Fujihara et al.

2013/3/23 Kyle Gorman kylebgorman@gmail.com

Oh it's music, you should have said so! I am unaware of any working forced
alignment with music. This technology only works well for loud,
close-mic'ed recordings (preferably in an anechoic chamber) with a minimal
amount of background noise at the moment. This is true of just about any
speech technology except for in state-of-the-art academic and commercial
research systems.

If you wanted to align this, what you'd want to do is to take a look at
software for making "a cappellas" and "instrumentals", and do that as a
preprocessing step. This is outside of my area of expertise but I'm sure
others have ideas about this.

You also generally get better alignments by chopping up things into
smaller pieces. While the technology is relatively robust to long files in
general, you often get a suboptimal alignment that way, because alignments
are computed using heuristic search with a narrow "beam". Also, it helps to
put "sil" tokens into the label file where you expect silences (but they
have to be real silences, not just pauses in singing…). This is presumably
the source of your error.

I took a look at the P2FA alignment (I was vaguely involved in that
project, but that aligner has very serious problems: no training, numerous
stale bugs, and it doesn't deal with silence in a standard fashion; this is
why we developed Prosodylab-Aligner). While I don't have a ready copy of
"Blue", I can quickly see from the TextGrid that the P2FA alignments failed
quite spectacularly. There is no way there is a single 27-second long nasal
sound in that song, for instance.

Kyle

On Mar 22, 2013, at 4:36 PM, Loreto Parisi loretoparisi@gmail.com wrote:

Hi Kyle,
attached the results from HVite. The text is attached as well.
The audio is taken from the song River by Joni Mitchell, that it's too
big to attach here I guess.

Here a sample, http://www.youtube.com/watch?v=bVwo9IQMWM0

Dott. Ing. Loreto Parisi
Parisi Labs

Email: loretoparisi@gmail.com
Linkedin: http://www.linkedin.com/in/loretoparisi
<River.txt><out.TextGrid>

Dott. Ing. Loreto Parisi
Parisi Labs

Email: loretoparisi@gmail.com
Blog: http://blog.parisilabs.com
Linkedin: http://www.linkedin.com/in/loretoparisi

@loretoparisi
Copy link
Author

dont know why, but sending by email it was poste by yourself 🎱

@kylebgorman
Copy link
Contributor

I have a good guess why: GitHub passes around fancy reply-to headers and you happened to include the ones associated with my username when you replied. But interesting engineering!

@kylebgorman
Copy link
Contributor

Loreto, okay if I close this issue?

@loretoparisi
Copy link
Author

Hi Kyle,
yes I suppose so.
My idea is to apply further analysis to make a good training set and apply Prosodylab scripts after that to this training set. For my knowledge Goto et alia applied this solution (HVite at the basis but different specific training set for the HMMs phonemes for cluster of artists/genres/etc). He also applies the harmonic melody media/variance calculations of the F0 function to discretize the audio source in order to have a better knowledge of the SIL intervals.

It could be interesting to fork this process and continue the work with Prosodylab in the specific case of the Music if it makes sense of course.

Of course this comments is outside the specific problem I raised here, it could be a specific discussione somewhere else maybe.

Thanks! Let's see what happens.

@kylebgorman
Copy link
Contributor

Best of luck , and feel free to fork it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants