Add machine translated multilingual STS benchmark dataset #2090

PhilipMay · 2021-03-20T13:28:07Z

also see here https://github.com/PhilipMay/stsb-multi-mt

PhilipMay · 2021-03-24T08:06:17Z

Hello dear maintainer, are there any comments or questions about this PR?

lhoestq

Really cool thank you :)

The dataset script looks all good ! Good job.
The dummy data and the dataset_infos.json are also perfect :)

For the readme, can you follow the template for the README.md ? You can find the template here:
https://github.com/huggingface/datasets/tree/master/templates

Ideally it would be cool to fill the info for those sections at least:

Dataset Summary
Languages
Data Instances
Data Fields
Data Splits

Let me know if you have questions about this !

- Dataset Summary - Languages - Data Instances - Data Fields - Data Splits

PhilipMay · 2021-03-24T19:24:33Z

@iamollas thanks for the feedback. I did not see the template.
I improved it...

PhilipMay · 2021-03-24T19:46:18Z

Should be clean for merge IMO.

lhoestq

Thank you !
I just added the table of contents and the missing sections in the dataset card :)

PhilipMay · 2021-03-29T12:49:57Z

@lhoestq CI is green. ;-)

lhoestq · 2021-03-29T13:00:31Z

Thanks again ! this is awesome :)

PhilipMay · 2021-03-29T13:24:42Z

Thanks for merging. :-)

PhilipMay added 11 commits March 20, 2021 11:13

first running version

9b08f78

add description and languages

5a15b4c

clean comments

fa0581b

black format cde

4bca1bc

add formatting

d98351e

add dataset info

6c33227

add 1st batch of dummy data

4003cc4

add 2nd batch of dummy data

ae54bb6

add readme

4b86e2f

format with black

e699d7c

fix flake8 issues

48407b4

PhilipMay changed the title ~~Machine translated multilingual STS benchmark dataset.~~ Add machine translated multilingual STS benchmark dataset Mar 20, 2021

PhilipMay added 6 commits March 20, 2021 16:37

add explicit dialect to reader

c1bbb70

format black

2d27198

improved description

10d6ada

add more description

4817fa2

more description

6636cf8

fix readme formatting

e3c2338

lhoestq reviewed Mar 24, 2021

View reviewed changes

update readme with more infos

550d715

- Dataset Summary - Languages - Data Instances - Data Fields - Data Splits

PhilipMay force-pushed the stsb-multi-mt-dataset branch from 3b8a4ff to 550d715 Compare March 24, 2021 18:39

PhilipMay added 4 commits March 24, 2021 20:17

readme json formatting fix

447dfd6

license reference change

3757d8f

fix python formatting to black style

28027c5

change json formatting

f702cef

improved table formatting

fa05eab

add table of contents and missing sections

a240f20

lhoestq approved these changes Mar 29, 2021

View reviewed changes

lhoestq merged commit c98e4b8 into huggingface:master Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add machine translated multilingual STS benchmark dataset #2090

Add machine translated multilingual STS benchmark dataset #2090

PhilipMay commented Mar 20, 2021

PhilipMay commented Mar 24, 2021

lhoestq left a comment

PhilipMay commented Mar 24, 2021

PhilipMay commented Mar 24, 2021

lhoestq left a comment

PhilipMay commented Mar 29, 2021

lhoestq commented Mar 29, 2021

PhilipMay commented Mar 29, 2021

Add machine translated multilingual STS benchmark dataset #2090

Add machine translated multilingual STS benchmark dataset #2090

Conversation

PhilipMay commented Mar 20, 2021

PhilipMay commented Mar 24, 2021

lhoestq left a comment

Choose a reason for hiding this comment

PhilipMay commented Mar 24, 2021

PhilipMay commented Mar 24, 2021

lhoestq left a comment

Choose a reason for hiding this comment

PhilipMay commented Mar 29, 2021

lhoestq commented Mar 29, 2021

PhilipMay commented Mar 29, 2021