Switchboard Dialog Act Corpus added under `datasets/swda` #1678

gmihaila · 2021-01-03T03:53:41Z

Switchboard Dialog Act Corpus

Intro:
The Switchboard Dialog Act Corpus (SwDA) extends the Switchboard-1 Telephone Speech Corpus, Release 2,
with turn/utterance-level dialog-act tags. The tags summarize syntactic, semantic, and pragmatic information
about the associated turn. The SwDA project was undertaken at UC Boulder in the late 1990s.

Details:
homepage
repo

I believe this is an important dataset to have since there is no dataset related to dialogue act added.

I didn't find any formatting for pull request. I hope all this information is enough.

For any support please contact me.

lhoestq

Really cool thank you !

I left a few comments

After changing the feature type to ClassLabel you'll need to regenerate the dataset_infos.json file

datasets-cli test ./datasets/swda --save_infos --all_configs --ignore_verifications

datasets/swda/README.md

datasets/swda/swda.py

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

gmihaila · 2021-01-04T22:31:19Z

@lhoestq Thank you for your detailed comments! I fixed everything you suggested.

Please let me know if I'm missing anything else.

lhoestq

Thanks !

datasets/swda/README.md

lhoestq · 2021-01-05T13:56:48Z

It looks like the Transcript and Utterance objects are missing, maybe we can mention it in the README ? Or just add them ? @gmihaila @bhavitvyamalik

bhavitvyamalik · 2021-01-05T16:58:21Z

Hi @lhoestq,
I'm working on this to add the full dataset

gmihaila · 2021-01-05T17:16:41Z

It looks like the Transcript and Utterance objects are missing, maybe we can mention it in the README ? Or just add them ? @gmihaila @bhavitvyamalik

@lhoestq Any info on how to add them?

bhavitvyamalik · 2021-01-05T17:45:22Z

@gmihaila, instead of using the current repo you should look into this. You can use the csv files uploaded in this repo (swda.zip) to access other fields and include them in this dataset. It has one dependency too, swda.py, you can download that separately and include it in your dataset's folder to be imported while reading the csv files.

Almost all the attributes of Transcript and Utterance objects are of the type str, int, or list. As far as trees attribute is concerned in utterance object you can simply parse it as string and user can maybe later convert it to nltk.tree object

gmihaila · 2021-01-06T13:55:47Z

@bhavitvyamalik Thank you for the clarification!

I didn't use that because it doesn't have the splits. I think in combination with what I used would help.

Let me know if I can help! I can make those changes if you don't have the time.

bhavitvyamalik · 2021-01-07T09:42:11Z

I'm a bit busy for the next 2 weeks. I'll be able to complete it by end of January only. Maybe you can start with it and I'll help you?
Also, I looked into the official train/val/test splits and not all the files are there in the repo I used so I think either we'll have to skip them or put all of that into just train

gmihaila · 2021-01-08T18:09:21Z

Yes, I can start working on it and ask you to do a code review.

Yes, not all files are there. I'll try to find papers that have the correct and full splits, if not, I'll do like you suggested.

Thank you again for your help @bhavitvyamalik !

Switchboard Dialog Act Corpus added under datasets/swda

7f89f22

lhoestq reviewed Jan 4, 2021

View reviewed changes

gmihaila and others added 3 commits January 4, 2021 10:56

Update datasets/swda/README.md

9d27465

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Update datasets/swda/README.md

d83e979

Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>

Used datasets.ClassLabel and updated README.md

89688c4

lhoestq approved these changes Jan 5, 2021

View reviewed changes

datasets/swda/README.md Show resolved Hide resolved

datasets/swda/README.md Outdated Show resolved Hide resolved

Apply suggestions from code review

0a7a321

lhoestq merged commit 5ae870c into huggingface:master Jan 5, 2021

gmihaila deleted the swda branch January 5, 2021 15:46

gmihaila mentioned this pull request Jan 18, 2021

Added metadata and correct splits for swda. #1749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switchboard Dialog Act Corpus added under `datasets/swda` #1678

Switchboard Dialog Act Corpus added under `datasets/swda` #1678

gmihaila commented Jan 3, 2021

lhoestq left a comment

gmihaila commented Jan 4, 2021

lhoestq left a comment

lhoestq commented Jan 5, 2021

bhavitvyamalik commented Jan 5, 2021

gmihaila commented Jan 5, 2021 •

edited

bhavitvyamalik commented Jan 5, 2021

gmihaila commented Jan 6, 2021 •

edited

bhavitvyamalik commented Jan 7, 2021 •

edited

gmihaila commented Jan 8, 2021

Switchboard Dialog Act Corpus added under datasets/swda #1678

Switchboard Dialog Act Corpus added under datasets/swda #1678

Conversation

gmihaila commented Jan 3, 2021

lhoestq left a comment

Choose a reason for hiding this comment

gmihaila commented Jan 4, 2021

lhoestq left a comment

Choose a reason for hiding this comment

lhoestq commented Jan 5, 2021

bhavitvyamalik commented Jan 5, 2021

gmihaila commented Jan 5, 2021 • edited

bhavitvyamalik commented Jan 5, 2021

gmihaila commented Jan 6, 2021 • edited

bhavitvyamalik commented Jan 7, 2021 • edited

gmihaila commented Jan 8, 2021

Switchboard Dialog Act Corpus added under `datasets/swda` #1678

Switchboard Dialog Act Corpus added under `datasets/swda` #1678

gmihaila commented Jan 5, 2021 •

edited

gmihaila commented Jan 6, 2021 •

edited

bhavitvyamalik commented Jan 7, 2021 •

edited