Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write some nice, short sentences for people to read. #341

Closed
mikehenrty opened this issue Jul 23, 2017 · 99 comments

Comments

@mikehenrty
Copy link
Contributor

@mikehenrty mikehenrty commented Jul 23, 2017

Due to bugs #23 and #319 in our current sentence collection, we are trying to diversify and refresh our sentences in #333. We would love your help! If you would like to contribute to Common Voice sentences with your own writings, please put your sentences (one per line) in a publicly linkable document (eg. Pastebin), then add a comment to this bug with a link to those sentences.

Criteria:

  • please write them yourselves. don't copy and paste from somewhere else.
  • try to make the sentences conversational, ie. easy to read out load.
  • you must agree to releasing your sentences to the public domain with a cc-0 license.
  • more than 50 sentences per link, but less than 500 please.
  • be nice, don't use offensive language. we aren't collecting that kinda material.
  • i'll be reading each one, and i may remove some but i will let you know why.

Thanks!

@ftrotter

This comment has been minimized.

Copy link

@ftrotter ftrotter commented Jul 24, 2017

Given that this is a specific bug... and is therefore addressable as a software item. I have elsewhere offered up already public domain resources as good sources for sentences:

https://en.wikipedia.org/wiki/Wikipedia:Public_domain_resources

As an open healthcare data person, I obviously would love the opportunity to inject your sentence data set with laymans healthcare terms. Assuming no one else is available to incorporate the various resources that I am discussing here, I might be able to devote some resources towards scrapping some of the resources that have solid APIs, and running them through some per-source filters that would serve to ensure that they do not have unusual acronyms or other confusing industry jargon.

I would recommend that you develop a corpus that details the following information for each source of sentences:

  • The sentence itself
  • The url source for the sentence
  • A reference to the rule (or contribution agreement to the cc-0 license) that provides evidence that the sentence in question is available under the public domain.

I think I could write some scripts that would generate a few million (at least) sentences that met some basic rules that I could demonstrate were public domain. or someone else could if I fail to deliver (which I should warn is a frequent occurrence, this is only tangentially related to my day job, after all)

Let me know if this would be helpful.

Regards,
-FT

@ajaydee

This comment has been minimized.

Copy link

@ajaydee ajaydee commented Jul 24, 2017

I'm back! Here's another hundred:

https://pastebin.com/1CU7GQYs

CC-0 or whatever license you need. There's a good few simple sentences.

@akshit13

This comment has been minimized.

Copy link

@akshit13 akshit13 commented Jul 24, 2017

Hope this helps.
https://pastebin.com/BRsfLuNt

mikehenrty added a commit that referenced this issue Jul 24, 2017
@mikehenrty

This comment has been minimized.

Copy link
Contributor Author

@mikehenrty mikehenrty commented Jul 24, 2017

Thanks @ftrotter, that is very helpful. Let's discuss that on Discourse, and see how many contibutions we can get from this threads.

mikehenrty added a commit that referenced this issue Jul 24, 2017
mikehenrty added a commit that referenced this issue Jul 24, 2017
@mikehenrty

This comment has been minimized.

Copy link
Contributor Author

@mikehenrty mikehenrty commented Jul 24, 2017

Thanks @akshit13 and @ajaydee, I added your contributions.

@orschiro

This comment has been minimized.

@michal-hradis

This comment has been minimized.

Copy link

@michal-hradis michal-hradis commented Jul 25, 2017

This a randomized set of sentences from some of my older "computer vision" reviews. I waive all rights to the text. Use it anyway you like.
https://pastebin.com/DTezZ7rA

However, I would urge you to use some large diverse contemporary corpus.

mikehenrty added a commit that referenced this issue Jul 25, 2017
mikehenrty added a commit that referenced this issue Jul 25, 2017
mikehenrty added a commit that referenced this issue Jul 25, 2017
@mikehenrty

This comment has been minimized.

Copy link
Contributor Author

@mikehenrty mikehenrty commented Jul 25, 2017

Thanks @orschiro and @michal-hradis!

However, I would urge you to use some large diverse contemporary corpus.

We are looking into that on the PR below. In the meantime, we are experimenting with gathering our corpus as we go.
#304

@jf99

This comment has been minimized.

Copy link
Contributor

@jf99 jf99 commented Jul 25, 2017

Here is another 300 sentences:
https://pastebin.com/Qf7Ykcbz

@mikehenrty

This comment has been minimized.

Copy link
Contributor Author

@mikehenrty mikehenrty commented Jul 26, 2017

@jf99, some good ones in there.

The servers of the common voice project couldn't handle the heavy load.

We're working on it! :D

mikehenrty added a commit that referenced this issue Jul 26, 2017
@psullivan6

This comment has been minimized.

Copy link

@psullivan6 psullivan6 commented Jul 27, 2017

Some more for ya:
https://pastebin.com/tCWPrxZJ

@Sposito

This comment has been minimized.

mikehenrty added a commit that referenced this issue Jul 28, 2017
mikehenrty added a commit that referenced this issue Jul 28, 2017
mikehenrty added a commit that referenced this issue Jul 28, 2017
@mikehenrty

This comment has been minimized.

Copy link
Contributor Author

@mikehenrty mikehenrty commented Jul 28, 2017

Thanks @psullivan6 and @Sposito!

@fastrizwaan

This comment has been minimized.

Copy link

@fastrizwaan fastrizwaan commented Jul 29, 2017

@mikehenrty

This comment has been minimized.

Copy link
Contributor Author

@mikehenrty mikehenrty commented Jul 29, 2017

why not take text from wikipedia? https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content#Text_content

In my experience, it is really hard to find interesting, readable sentences from Wikipedia. The language there tends to be quite formal and sometimes awkward to speak. I built some tools to chop up text here, but never got around to adding wikipedia sentences to this project. That said, I'm always open to accepting contributions. 👍

The other thing I like about asking people for sentence donations (which is just an experiment for now), is that we get a lot of clever, interesting messages. Then, someone halfway around the world will read these messages out loud.. I think that's so cool :)

@hmitsch

This comment has been minimized.

Copy link
Contributor

@hmitsch hmitsch commented Jul 25, 2018

Hi @jf99, if you reach out to me via email, I can probably help you. Feel free to contact me on hmitsch@mozilla.com

Thank you!
Best regards,
Henrik

@andrewkrug

This comment has been minimized.

Copy link

@andrewkrug andrewkrug commented Aug 2, 2018

Hi @jf99 ,

Did you know that you can use github with 2FA without providing a phone number?
https://help.github.com/articles/securing-your-account-with-two-factor-authentication-2fa/

This article would seem to indicate that you have your choice of methods to 2FA.

  • SMS
  • TOTP ( An app like Google Authenticator or FreeOTP )
  • or even a U2F token like a Yubikey.

I know I personally use 2FA for every service on the web that I can because I consider it to be good personal operational security. A single account compromise can lead to an attacker pivoting into other parts of your life.

Further when you do use 2FA to sign into things like Github the session lifetimes are pretty long. So you don't have to pull out the 2FA device again for what often feels like months between sign ins.

It sounds like @hmitsch is doing an awesome job getting you some help to go back to email as a single factor and if that's what you decide works best for you that is great! Just wanted to highlight the ways you can keep 2FA+Github a little more useable.

Thanks for reading.

Your friendly neighborhood security engineer,

Andrew

@bagustris

This comment has been minimized.

Copy link
Contributor

@bagustris bagustris commented Aug 28, 2018

I am adding 100 sentences in Indonesian language here. I will add more there.

@AcAntellAno

This comment has been minimized.

Copy link

@AcAntellAno AcAntellAno commented Sep 1, 2018

I created a 50 more sentences, I hope it helps:

https://pastebin.com/J5mke3fz

@YuetAu

This comment has been minimized.

Copy link
Contributor

@YuetAu YuetAu commented Sep 15, 2018

zh-hk
General: https://pastebin.com/fzzesRfB
God of Gamblers II (A zh-hk movie)(Rewritten for better pronunciation): https://pastebin.com/GAEDrjrY
Hope this help.

@plisieck

This comment has been minimized.

Copy link

@plisieck plisieck commented Sep 16, 2018

50 Polish sentences:

https://pastebin.com/zJktPtau

@DonHege

This comment has been minimized.

Copy link

@DonHege DonHege commented Sep 21, 2018

100 german sentences from me:

https://pastebin.com/rT5JtUUs

@jf99

This comment has been minimized.

Copy link
Contributor

@jf99 jf99 commented Sep 21, 2018

@DonHege Thanks for your contribution! I have reviewed your sentences and made a pull request. Have a look! #1481

@nsb-xps

This comment has been minimized.

Copy link

@nsb-xps nsb-xps commented Oct 5, 2018

61 English sentences:
https://pastebin.com/8xWgCaeJ

@MichaelNMaggs

This comment has been minimized.

Copy link

@MichaelNMaggs MichaelNMaggs commented Oct 7, 2018

I submitted two sets of (British English) sentences on 29 Oct and & 7 Nov 2017. So far as I'm aware they have not been used. Were they unsuitable in some way?

https://pastebin.com/ZpWty4LR
https://pastebin.com/vCLjK3DQ

@mikehenrty

This comment has been minimized.

Copy link
Contributor Author

@mikehenrty mikehenrty commented Oct 8, 2018

hi @MichaelNMaggs. Thanks again for you continued help with this project.

Right now we are building a tool to collect public domain sentences for Common Voice. The discussion around that tool happened here:
https://discourse.mozilla.org/t/we-want-your-feedback-improving-the-sentence-collection/30358

Since that discussion, we have started to create a "Sentence Collection tool" (basically a website to submit and review sentences). Once that tool is in good enough shape, we are going to go back through this entire thread (and other places) and get all the sentences into that tool for review and eventually into Common Voice itself to be spoken. This process is taking some time, but we are fully committed to it.

Thanks for your patience!

@nixczhou

This comment has been minimized.

Copy link

@nixczhou nixczhou commented Nov 18, 2018

I wan to contribute 53 sentences (Traditional Chinese)
https://pastebin.com/7JTZ1ncy

@peiying2

This comment has been minimized.

Copy link
Collaborator

@peiying2 peiying2 commented Nov 28, 2018

@Fatimuskii Thank you for your interest. Here are some guidelines regarding contributing to this project. You can put your questions here to get answers more quickly: https://discourse.mozilla.org/t/readme-how-to-see-my-language-on-common-voice/31530/10.

@davidak

This comment has been minimized.

Copy link

@davidak davidak commented Dec 4, 2018

I think this and similar issues can be closed since a tool and guidelines are developed like discussed in https://discourse.mozilla.org/t/we-want-your-feedback-improving-the-sentence-collection/30358.

@train255

This comment has been minimized.

Copy link

@train255 train255 commented Dec 6, 2018

I contribute 50 sentences (Vietnamese sentence)
https://pastebin.com/Dmr2a3BP

@Fatimuskii

This comment has been minimized.

Copy link

@Fatimuskii Fatimuskii commented Dec 26, 2018

https://pastebin.com/8wq3cSm1
More than 50 sentences in Spanish

@nukeador

This comment has been minimized.

Copy link
Collaborator

@nukeador nukeador commented Jan 2, 2019

Hello everyone and happy new year.

As we have been commenting over Common Voice discourse, we want to make sure sentences are properly reviewed and there are some automated quality checks so they are useful for the algorithms.

During the next 6 days we are doing some quality control to the new sentence collection tool and we hope to have it in beta form mid this month so all of you can submit and review sentences yourselves:

https://discourse.mozilla.org/t/sentence-collection-tool-development-topic/33390/5

(This means we won't be using PR or this issue to collect sentences, please keep them in any format you can copy and paste as soon as the tool is ready we'll inform everyone)

Cheers.

@Fatimuskii

This comment has been minimized.

Copy link

@Fatimuskii Fatimuskii commented Jan 3, 2019

50 sentences more in Spanish
https://pastebin.com/rFuBFfJb

@Fatimuskii

This comment has been minimized.

Copy link

@Fatimuskii Fatimuskii commented Jan 4, 2019

50 more sentences in Spanish
https://pastebin.com/wkRC2Xij

@Fatimuskii

This comment has been minimized.

Copy link

@Fatimuskii Fatimuskii commented Jan 4, 2019

50 more sentences
https://pastebin.com/G2G0imMR

@Fatimuskii

This comment has been minimized.

Copy link

@Fatimuskii Fatimuskii commented Jan 4, 2019

50 more in Spanish
https://pastebin.com/pYQpCFSH

@Fatimuskii

This comment has been minimized.

Copy link

@Fatimuskii Fatimuskii commented Jan 4, 2019

50 more sentences in Spanish https://pastebin.com/Mxsf1CJH

@nukeador

This comment has been minimized.

Copy link
Collaborator

@nukeador nukeador commented Jan 4, 2019

Fatima, thanks for your efforts, please check my previous message.

Keep collecting these sentences and we will inform once the sentence collector tool is ready :-)

@nukeador

This comment has been minimized.

Copy link
Collaborator

@nukeador nukeador commented Jan 28, 2019

Hello everyone,

I'm super excited to announce that after a few months of intense work, today we launch the Sentence Collection Tool site for all Common Voice contributors. We are considering this a first beta version, but fully functional after some weeks of testing.

All sentences submitted, reviewed and validated using this tool will be incorporated into the main Commmon Voice site. We will point this as the way to submit sentences to the project moving forward.

What is this tool?

This tools facilitates the task of submitting, reviewing and validating sentences in different locales and to be incorporated into the main Common Voice site, so people can read them and donate their voice.

Why this tool?

The previous process to gather sentences was a but unstructured, too many places to go and unclear guidelines. In order for sentences to be useful for the Deep Speech algorithm, there are certain "hard requirements" this tool enforces to avoid problems in the future.

We also aim to keep improving the tool to make the experience even easier for everyone!

How can I start using it?

Just go to the Sentence Collection tool site and start submitting and reviewing sentences in your locales. Make sure you check the How-to page to understand how to use the tool.

Where do I report issues or ideas?

Our github project page is the best place to report any issues with the site. If you want to discuss with the rest of the community an idea or new feature you can do that in our discourse.

How can I help with the development?

This tool is developed by the Common Voice volunteers. Anyone can be involved in the development, you just need to know react or kinto and chime in our github project to know more.

If you are not technical, don't worry! We usually open conversations on discourse to get everyone the chance to influence the direction of the project.

Special thanks

I would like to extend a special recognition and thank you to some people who have been responsible for this tool to be launched.

Thank you everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.