Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Punctuations missing in downloaded cvs #1

Closed
Sadam1195 opened this issue Apr 1, 2021 · 3 comments
Closed

Punctuations missing in downloaded cvs #1

Sadam1195 opened this issue Apr 1, 2021 · 3 comments

Comments

@Sadam1195
Copy link

Sadam1195 commented Apr 1, 2021

Hi Pandya, thanks for creating great resource. Everything works great apart from the fact that this package removes punctuation in the subtitles which I suppose you could understand is very bad things for training as without punctuations attention model will fail to converge hence bad output. Do you know any fix for that?

I am generating dataset for a Spanish video with lang='es'

@hetpandya

@Sadam1195 Sadam1195 changed the title Punctuation missing in downloaded cvs Punctuations missing in downloaded cvs Apr 1, 2021
@hetpandya
Copy link
Owner

hetpandya commented Apr 2, 2021

Hi @Sadam1195 , thanks a lot for the appreciation.
Thank you for pointing out the issue. I do understand the importance of punctuations while building speech synthesis algorithms.

Everything works great apart from the fact that this package removes punctuation in the subtitles which I suppose you could understand is very bad things for training as without punctuations attention model will fail to converge hence bad output. Do you know any fix for that?

This happens because my package uses youtube-dl to download subtitles and somehow youtube-dl skips the punctuations while downloading them although they seem to be present in the video's captions. Sadly the dependency doesn't seem to mention anything about punctuations. I shall look upon adding a feature to my library to automatically punctuate the captions soon using deep learning. Meanwhile I have found a repo if that could help you https://github.com/chrisspen/punctuator2 . This seems to be trained only on english data though. Let me know if I can help you.

@Sadam1195
Copy link
Author

Hi @Sadam1195 , thanks a lot for the appreciation.
Thank you for pointing out the issue. I do understand the importance of punctuations while building speech synthesis algorithms.

Everything works great apart from the fact that this package removes punctuation in the subtitles which I suppose you could understand is very bad things for training as without punctuations attention model will fail to converge hence bad output. Do you know any fix for that?

This happens because my package uses youtube-dl to download subtitles and somehow youtube-dl skips the punctuations while downloading them although they seem to be present in the video's captions. Sadly the dependency doesn't seem to mention anything about punctuations. I shall look upon adding a feature to my library to automatically punctuate the captions soon using deep learning. Meanwhile I have found a repo if that could help you https://github.com/chrisspen/punctuator2 . This seems to be trained only on english data though. Let me know if I can help you.

I have fixed the punctuation issue with additional code. I will add a PR later and maybe you can merge it.

@hetpandya
Copy link
Owner

Sure, suggestions are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants