Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Due to bugs #23 and #319 in our current sentence collection, we are trying to diversify and refresh our sentences in #333. We would love your help! If you would like to contribute to Common Voice sentences with your own writings, please put your sentences (one per line) in a publicly linkable document (eg. Pastebin), then add a comment to this bug with a link to those sentences.
Given that this is a specific bug... and is therefore addressable as a software item. I have elsewhere offered up already public domain resources as good sources for sentences:
As an open healthcare data person, I obviously would love the opportunity to inject your sentence data set with laymans healthcare terms. Assuming no one else is available to incorporate the various resources that I am discussing here, I might be able to devote some resources towards scrapping some of the resources that have solid APIs, and running them through some per-source filters that would serve to ensure that they do not have unusual acronyms or other confusing industry jargon.
I would recommend that you develop a corpus that details the following information for each source of sentences:
I think I could write some scripts that would generate a few million (at least) sentences that met some basic rules that I could demonstrate were public domain. or someone else could if I fail to deliver (which I should warn is a frequent occurrence, this is only tangentially related to my day job, after all)
Let me know if this would be helpful.
There you go. :-)
why not take text from wikipedia? https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content#Text_content
In my experience, it is really hard to find interesting, readable sentences from Wikipedia. The language there tends to be quite formal and sometimes awkward to speak. I built some tools to chop up text here, but never got around to adding wikipedia sentences to this project. That said, I'm always open to accepting contributions.
The other thing I like about asking people for sentence donations (which is just an experiment for now), is that we get a lot of clever, interesting messages. Then, someone halfway around the world will read these messages out loud.. I think that's so cool :)
Hi @jf99 ,
Did you know that you can use github with 2FA without providing a phone number?
I know I personally use 2FA for every service on the web that I can because I consider it to be good personal operational security. A single account compromise can lead to an attacker pivoting into other parts of your life.
Further when you do use 2FA to sign into things like Github the session lifetimes are pretty long. So you don't have to pull out the 2FA device again for what often feels like months between sign ins.
It sounds like @hmitsch is doing an awesome job getting you some help to go back to email as a single factor and if that's what you decide works best for you that is great! Just wanted to highlight the ways you can keep 2FA+Github a little more useable.
Thanks for reading.
Your friendly neighborhood security engineer,
hi @MichaelNMaggs. Thanks again for you continued help with this project.
Right now we are building a tool to collect public domain sentences for Common Voice. The discussion around that tool happened here:
Since that discussion, we have started to create a "Sentence Collection tool" (basically a website to submit and review sentences). Once that tool is in good enough shape, we are going to go back through this entire thread (and other places) and get all the sentences into that tool for review and eventually into Common Voice itself to be spoken. This process is taking some time, but we are fully committed to it.
Thanks for your patience!
@Fatimuskii Thank you for your interest. Here are some guidelines regarding contributing to this project. You can put your questions here to get answers more quickly: https://discourse.mozilla.org/t/readme-how-to-see-my-language-on-common-voice/31530/10.
I think this and similar issues can be closed since a tool and guidelines are developed like discussed in https://discourse.mozilla.org/t/we-want-your-feedback-improving-the-sentence-collection/30358.
Hello everyone and happy new year.
As we have been commenting over Common Voice discourse, we want to make sure sentences are properly reviewed and there are some automated quality checks so they are useful for the algorithms.
During the next 6 days we are doing some quality control to the new sentence collection tool and we hope to have it in beta form mid this month so all of you can submit and review sentences yourselves:
(This means we won't be using PR or this issue to collect sentences, please keep them in any format you can copy and paste as soon as the tool is ready we'll inform everyone)
I'm super excited to announce that after a few months of intense work, today we launch the Sentence Collection Tool site for all Common Voice contributors. We are considering this a first beta version, but fully functional after some weeks of testing.
All sentences submitted, reviewed and validated using this tool will be incorporated into the main Commmon Voice site. We will point this as the way to submit sentences to the project moving forward.
What is this tool?
This tools facilitates the task of submitting, reviewing and validating sentences in different locales and to be incorporated into the main Common Voice site, so people can read them and donate their voice.
Why this tool?
The previous process to gather sentences was a but unstructured, too many places to go and unclear guidelines. In order for sentences to be useful for the Deep Speech algorithm, there are certain "hard requirements" this tool enforces to avoid problems in the future.
We also aim to keep improving the tool to make the experience even easier for everyone!
How can I start using it?
Where do I report issues or ideas?
How can I help with the development?
This tool is developed by the Common Voice volunteers. Anyone can be involved in the development, you just need to know react or kinto and chime in our github project to know more.
If you are not technical, don't worry! We usually open conversations on discourse to get everyone the chance to influence the direction of the project.
I would like to extend a special recognition and thank you to some people who have been responsible for this tool to be launched.
Thank you everyone!