Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"This American Life" dataset recipe #1140

Merged
merged 4 commits into from
Sep 13, 2023

Conversation

flyingleafe
Copy link
Contributor

From here:

"This dataset consists of transcripts for 663 podcasts from the This American Life radio program from 1995 to 2020, covering 637 hours of audio (57.7 minutes per conversation) and an average of 18 unique speakers per conversation.

We hope that this dataset can serve as a new benchmark for the difficult tasks of speech transcription, speaker diarization, and dialog modeling on long, open-domain, multi-speaker conversations."

The website has been updated since the publishing of the transcripts, so I wrote a simple URL scrapper which works with the new website.

I also duplicated the archive from Kaggle to IPFS, so that it can be downloaded automatically without logging in with Kaggle account.

@flyingleafe
Copy link
Contributor Author

@pzelasko ^ weird that every single CI action is currently successful, but nevertheless Github says some are not...

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Don't worry about codecov, it's just for information purposes, doesn't need to pass.

Before we merge, can you add an entry in docs/corpus.rst? Thanks!

@pzelasko pzelasko added this to the v1.17 milestone Sep 12, 2023
@pzelasko
Copy link
Collaborator

Oh, you also need to import the recipe in lhotse/recipes/__init__.py

@flyingleafe
Copy link
Contributor Author

@pzelasko done

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@pzelasko pzelasko merged commit de3f48e into lhotse-speech:master Sep 13, 2023
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants