Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need a larger corpus of text input #23

Closed
mikehenrty opened this issue May 11, 2017 · 31 comments

Comments

Projects
None yet
@mikehenrty
Copy link
Contributor

commented May 11, 2017

Right now, we only have 3K sentences or so.

Two possible new sources are

  1. Wikimedia text
  2. BYU Corpus
  3. Leipzig Corpora
@geroter

This comment has been minimized.

Copy link
Collaborator

commented May 11, 2017

Can we get a measure of success here? How big do we need?

@geroter

This comment has been minimized.

Copy link
Collaborator

commented May 11, 2017

@mikehenrty

This comment has been minimized.

Copy link
Contributor Author

commented May 11, 2017

This? http://wortschatz.uni-leipzig.de/en/download/

Added to list.

Can we get a measure of success here? How big do we need?

Right, good question. From what I understand from @kdavis-mozilla, it's not super useful to have a lot of different people repeating the same phrases over and over (although a few repeats is ok). We can put the number of required sentences at around the expected users. One possible projection is:

100K users = 100K sentences
@geroter

This comment has been minimized.

Copy link
Collaborator

commented May 11, 2017

Shouldn't we work back from number of hours?

10K hours = 600K mins = 9M sentences (@3secs per sentence)

@mikehenrty

This comment has been minimized.

Copy link
Contributor Author

commented May 11, 2017

10K hours = 600K mins = 9M sentences (@3secs per sentence)

Yes we will eventually need that to fulfill our goal, but this bug is about June activation so I'm not setting sites quite so high yet.

@kdavis-mozilla

This comment has been minimized.

Copy link
Collaborator

commented May 12, 2017

One must also juggle the extra ball of legality.

For example the Leipzig Corpora terms of use state

Any data provided by Projekt Deutscher Wortschatz are subject to copyright. Permission for use is granted free of charge solely for non-commercial personal and scientific purposes

and further

...any commercial use of the data obtained is forbidden without explicit written permission by the copyright owner

So on our time scale Leipzig is a no. Similarly the BYU Corpus is targeted at low usage academics

We are committed to keeping the BYU corpora free -- for those universities that have light to moderate use, and which cannot afford a license. As a result, there is no cost to use the corpora, as long as your class or department has less than 250 queries each day.

Wikipedia seems the best bet, but them one also has to worry about copyright[1] there too

You are free to:
• Share and Reuse our articles and other media under free and open licenses.
• ..

Under the following conditions:
• Lawful Behavior – You do not violate copyright or other laws.
• ..

@geroter

This comment has been minimized.

Copy link
Collaborator

commented May 12, 2017

I think this might be a place to ask for Brian's help in interpreting these terms. Let's add that to our next agenda with him -- maybe @mikehenrty you can ping him with an email in advance?

@kdavis-mozilla

This comment has been minimized.

Copy link
Collaborator

commented May 12, 2017

@geroter I think he can help with other ToS, but I think these are pretty clear in forbidding us use, and I don't think we have time to parse subtleties.

The easiest way out is to simply use text for which the copyright has expired in all countries.

Gutenberg[1] houses texts for which the copyright has expired in the US[2], a good staring point.

For example the Gutenberg[3] license which, in the non-normative text, states, with respect to texts for which the US copyright has expired:

If you strip the Project Gutenberg license and all references to Project Gutenberg from the text, you are left with a text unprotected by U.S. intellectual property law. You can do anything you want with that text in the United States and most of the rest of the world.

This is the kind of clarity we need.

We don't have time to parse Wikipedia in to copyrighted and non-copyrighted texts, a task which, even if we had time for, would be hard to do.

@geroter

This comment has been minimized.

Copy link
Collaborator

commented May 12, 2017

@mbebenita

This comment has been minimized.

Copy link
Contributor

commented May 15, 2017

For the moment, I've written a script to extract sentences from project Gutenberg, and curate them based on length and some reading complexity metric. The current sentence set is from War of the Worlds.

https://github.com/mozilla/voice-web/blob/master/tools/gen.js

@geroter

This comment has been minimized.

Copy link
Collaborator

commented May 15, 2017

@mbebenita

This comment has been minimized.

Copy link
Contributor

commented May 15, 2017

https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests

I didn't actually test to see how well it works.

@reinhart1010

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2017

Hello, how about adding some example sentences from Wiktionary and GNU Collaborative International Dictionary of English? They contain many short sentences, such as in the GNU Collaborative International Dictionary of English (from the definition of "master"):

Little masters, certain German engravers of the 16th century, so called from the extreme smallness of their prints.

Master in chancery, an officer of courts of equity, who acts as an assistant to the chancellor or judge, by inquiring into various matters referred to him, and reporting thereon to the court.

Master of arts, one who takes the second degree at a university; also, the degree or title itself, indicated by the abbreviation M. A., or A. M.

Here's another example sentences from Wiktionary (from the definition of "whether"):

He chose the correct answer, but I don't know whether it was by luck or by skill.

Do you know whether he's coming?

He's coming, whether you like it or not.

@edunham

This comment has been minimized.

Copy link

commented Jul 19, 2017

The FAQ says that this corpus will be used to train voice recognition. One thing that I frequently say into voice recognition tools, but haven't seen in any of the example sentences, is numbers. It might be worth prioritizing the addition of sentences with spoken numbers in them so that the corpus is more useful to real-world AI.

@rugk

This comment has been minimized.

Copy link

commented Jul 22, 2017

Indeed you need more sources. The current sentences all sound like "Arabs meet in Englishman in the desert" and could all belong to one fairy tail or maybe a fable (considering the animals, which are often acting like humans.).

mikehenrty added a commit that referenced this issue Jul 23, 2017

mikehenrty added a commit that referenced this issue Jul 23, 2017

Merge pull request #340 from mozilla/sentences
refresh the text corpora, fixes #333, addresses #23 and #319
@12people

This comment has been minimized.

Copy link

commented Jul 23, 2017

Consider looking at CC0 (or, if allowed, CC-BY or CC-BY-SA) blogs, like http://dougbelshaw.com/blog/, https://people.gnome.org/~michael/, and http://blog.ninapaley.com/.

Unfortunately, it's hard to find such blogs, so perhaps also ask people to submit CC0-licensed blogs (or license their blogs with this license).

@Djfe

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2017

Maybe you could get a hold onto:
http://catalog.elra.info/product_info.php?products_id=1032
http://catalog.elra.info/product_info.php?products_id=1033

Both datasets have been used by RASR in the past

And since elda has been already reaching out to you ;)
Discourse thread

@rugk

This comment has been minimized.

Copy link

commented Jul 24, 2017

@ftrotter

This comment has been minimized.

Copy link

commented Jul 24, 2017

Hi,
I left this comment on your discussion board, but I thought it might be helpful here as well.

Hello. Thank you for an amazingly simple implementation of a wonderful idea. Rather than randomly including writing, you might consider using some already qualified public domain resources. There is a list of such resources here: https://en.wikipedia.org/wiki/Wikipedia:Public_domain_resources

Of course, while you could consider any resource from that page, I would ask that you specifically consider the inclusion of healthcare terms that included simple medical terminology. Not the kind of things that doctors say about their healthcare (that is too technical and specific) but the types of things that patients might like to read and or discuss in everyday terms.

An amazing resource for this that is written in lay terms is the Medline plus website. Not everything on Medline is public domain, but they specify what is covered and what is not here:
https://medlineplus.gov/copyright.html

Note that the Medline encyclopedia is licensed content and is therefore not public domain. But the Health Topics are public domain:
https://medlineplus.gov/acousticneuroma.html
As are the FAQ answers
https://medlineplus.gov/faq/disease.html
And the medline plus magazine
https://medlineplus.gov/magazine/issues/summer17/articles/summer17pg13-14.html

Obviously, as a healthcare data journalist I have an ax to grind here, but there is a huge amount of english sentences here that are not medically contextual. For instance the sentence "The people who write the materials are the ones who decide if they are easy to read." is found on one of the FAQ pages. Moreover, while the terms in Medline are intended to be "laymans terms" they include words like "Alzheimer" which are common enough words, that will likely have huge pronunciation differences.

I should note that the sections on women's health topics in Medline are likely to include more sentences including female pronouns. I will make that comment on the other github page as well.

Given that you are obviously also interested in resources that are not medical, I would also suggest the Federal Register, which is also without copyright.
example text:
https://www.gpo.gov/fdsys/pkg/FR-2015-01-02/html/2014-30754.htm

It should be relatively simple to run a script which removes all sentences that include the goggly-gook internal reference system and also acronyms. NIST, NASA, etc. Once that is done, this would be a huge corpus of sentences that should be composed of relatively clear text. If you wanted to ensure that the sentences were even more "common language-full" you might simple exclude everything except the contents of the executive summaries of the articles, which are intended to be relatively jargon-free.

If that is still not enough, you should consider including the text of comments made to various regulations on regulations.gov. Most people are unaware that the comments that they make on regulations themselves become public domain. See here: https://www.regulations.gov/userNotice

This data is available via an API, and here is an example:
https://www.regulations.gov/document?D=VA-2016-VHA-0011-184061

HTH,
-FT

@mikehenrty

This comment has been minimized.

Copy link
Contributor Author

commented Jul 24, 2017

From @rugk

Maybe you can use http://shtooka.net/ as another source.

@rugk

This comment has been minimized.

Copy link

commented Jul 28, 2017

However, http://shtooka.net/ already contains the voice examples, so it does not really belong to this issue…

@Djfe

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2017

yeah but you can integrate the text with the correlating voice samples and collect even further voice samples with this service for the same text, that's why it belongs to this particular issue/topic

Mozilla will probably integrate our ideas from this issues from time to time, so it's helpful to have all suggestions in one place

@mikehenrty mikehenrty referenced this issue Aug 10, 2017

Merged

fix grammar #400

mikehenrty added a commit that referenced this issue Oct 6, 2017

mikehenrty added a commit that referenced this issue Oct 6, 2017

@Franck-Dernoncourt

This comment has been minimized.

Copy link

commented Oct 20, 2017

Regarding the use of Wikipedia, I'd point out that there exist the Spoken Wikipedia Corpora: http://nats.gitlab.io/swc/ (CC BY-SA 4.0 license):

image

@bmilde

This comment has been minimized.

Copy link

commented Jan 12, 2018

In an ideal speech corpus, most utterances would be unique. Right now the small text size is a disaster for benchmarking the Common Voice corpus v1, as the same sentences used for training also appear in the test and dev portions of the corpus. This means the "best" performing models are the ones that overfit the most to the 7000 sentences used in the training data.

See also the discussion here: kaldi-asr/kaldi#2141 (Commonvoice results misleading, complete overlap of train/dev/test sentences) and here https://discourse.mozilla.org/t/common-voice-v1-corpus-design-problems-overlapping-train-test-dev-sentences/24288 (Common Voice v1 corpus design problems, overlapping train/test/dev sentences)

The 1 billion word corpus from Google is Apache licensed, not sure if that license would fit.
https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark

Transcriptions of the European Parliament are afaik in public domain (and also cover many more languages than English): http://www.statmt.org/europarl/

@missuniverse

This comment has been minimized.

Copy link
Contributor

commented Jan 19, 2018

Would be nice if some staff from the project looked at discourse: https://discourse.mozilla.org/t/help-wanted-write-some-nice-short-sentences-for-people-to-read/17317/33
There's multiple other speech corpus that have been donated and are still not used.

@missuniverse

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2018

Just a thought, maybe we could use the sentences from pontoon.mozilla.org that fit the required criteria has a good source of text since this is pretty massive and I'm pretty sure we could use it considering the licensing?

@Djfe

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2018

yeah the engagement strings seem to be suitable :)

I found another possible source:
the ccc subtitles (chaos communication congress and related events)
http://mirror.selfnet.de/c3subtitles/
CC BY 4.0 license
lots of English, some german

https://media.ccc.de/

Maybe they are even accurate enough so that we can use the audio from those videos (their goal is to make very accurate subtitles (even if the speaker repeats words etc.; only stuff like "umm" is left out), so if the timestamps are accurate enough, they could be used for training, maybe)

@mwamp

This comment has been minimized.

Copy link

commented Sep 6, 2018

Hi, I'm not sure this is all new ideas (or any of this). I wanted to suggest using :

1- transcripts from political speeches, courts videos, official announcements etc. In France we have a transcript of all debates of the house (but cleaned up) [http://www2.assemblee-nationale.fr/feeds/detail/crs]
3- Today many podcasts publicly available have transcripts of part or all the content. I could be worth going around radio station to find editorials which have complete transcripts.
eg : [https://www.franceculture.fr/emissions/le-tour-du-monde-des-idees/le-tour-du-monde-des-idees-du-mardi-04-septembre-2018]
4 - It might also be possible to get transcripts from programs subtitled for deaf people.

Since these might not be perfectly clean, you might want to process them through your STT first then match and correct discrepencies before adding them to the dataset.
It could also be okay to keep only the correctly matched parts (with length condition for instance). I'm not sure how that would induce bias though?

Best,

@Djfe

This comment has been minimized.

Copy link
Contributor

commented Sep 6, 2018

good suggestions!
very welcome :)
keep them coming

More sentences are better, bias will be reduced by adding more over time

we have to be careful with transcriptions though because of copyright
They don't always can be used since we want to publish the whole dataset/korpus (voice and according text) as cc-0/public domain

Regards

@mwamp

This comment has been minimized.

Copy link

commented Sep 19, 2018

Hi,
So I wanted to bump on my previous post, it appears the French Senate actually has good tech initiatives. Not only do they provide cleaned up transcripts but they also bind sentences to the right video timing. What's more they allow downloading directly the audio file easily.
It appears the license is open, It is very explicit for the transcript, a tad bit less for the media.

Enclosed is a script able to extract all this data by crawling their website (you have to change the ext to python). It downloads a "xml" file containing transcripts, and timestamps corresponding to sentences along with the audio. Of course to be used with caution so as to not overflow their server, this is fully synchronous so it should be fine but I wouldn't thread it too much. Also I wouldn't flag their voices with their names or age or anything of the sort.
I counted 282 sessions so I would expect between 500 and 1000 hours of speech.

If you deem this usable, it could be worth it to open a new category here to regroup international public institutions that do provide cc-0 transcripted speeches and give them credit. And maybe a bunch of people could make a few scripts to fetch the data (or kindly ask)

Edit: So I checked out the EU parliament, their site is a bit wacky but on copyright they say :

As a general rule, the reuse (reproduction or use) of textual data and multimedia items which are the property of the European Union (identified by the words '© European Union, [year(s)] – Source: European Parliament' or '© European Union, [year(s)] – EP' ) or of third parties (© External source, [year(s)]), and for which the European Union holds the rights of use, is authorised, for personal use or for further non-commercial or commercial dissemination, provided that the entire item is reproduced and the source is acknowledged. However, the reuse of certain data may be subject to different conditions in some instances; in this case, the item concerned is accompanied by a mention of the specific conditions relating to it .

The 'entire item is reproduced' clause seems to be a hitch but they actually do some extensive cutting up.
See here Aside from that it looks like we may use it as long as we are careful to use originals (not simultaneous translations)

Gregoor added a commit that referenced this issue Dec 6, 2018

New phrases for italian (#1562)
* Aggiunte frasi (#18)

Nuove frasi parte 1

* Aggiunte frasi (#19)

Nuove frasi p2

* piccola correzione (#20)

Nuove frasi p3

* Correzione frasi Common Voice (#21)

Corretta riga 3296 pseudinimo -> pseudonimo
Corretta riga 1827 c’è -> ce
Corretta riga 3784 sopratutto -> soprattutto
Corretta riga 8320 sopratutto -> soprattutto
Corretta riga 8763 sopratutto -> soprattutto
Corretta riga 9352 pò -> po'
Corretta riga 4289 pò -> po'

* Corretto un modo di dire (#22)

- da "figlio di papa" a "figlio di papà"

* Aggiunta nuove frasi (#23)

Prima parte della tesi del "Software Open Source" [Da pagina 4 a 21]

* Frasi (di Dario) revisionate (#24)

- Rimossi doppi spazi
- Rimosse frasi duplicate (regola n.8)
- Ridotte frasi troppo lunghe (regola n.1)
- Rimosse parentesi (regola n.10)
- Ottenute altre frasi (da parentesi, da ":", ecc.)
- Sostituiti virgolette e apostrofi direzionati con quelli semplici (' e ") (regola n.11)
- Corrette alcune frasi per migliore la comprensione (regola n.3)
- Corretta la punteggiatura
- Accorciate alcune frasi perché presenti date/valori numerici (regola n.7)
- Sostituiti termini inglesi con italiani (dove necessario) (regola n.13)
- Rimossi "Che" e "E" a inizio frase, dove possibile (regola n.9)

* Added italian sentences from pages 74-84 of Ravelli's degree thesis (#25)

Added sentences extrapolated from pages 74-84 of the thesis. Sentences have been adapted according to the [guidelines](https://github.com/Sav22999/Guide/blob/master/Mozilla%20Italia/Common%20Voice/Linee%20guida%20revisione%20Common%20Voice.md) in order to be suitable for usage in Common Voice.

* Frasi controllate e corrette con common-voice-tool (#26)

altre migliorie

* Corretti alcuni errori (#28)

* Frasi (di Dario) revisionate

- Rimossi doppi spazi
- Rimosse frasi duplicate (regola n.8)
- Ridotte frasi troppo lunghe (regola n.1)
- Rimosse parentesi (regola n.10)
- Ottenute altre frasi (da parentesi, da ":", ecc.)
- Sostituiti virgolette e apostrofi direzionati con quelli semplici (' e ") (regola n.11)
- Corrette alcune frasi per migliore la comprensione (regola n.3)
- Corretta la punteggiatura
- Accorciate alcune frasi perché presenti date/valori numerici (regola n.7)
- Sostituiti termini inglesi con italiani (dove necessario) (regola n.13)
- Rimossi "Che" e "E" a inizio frase, dove possibile (regola n.9)

* Altre correzioni varie (#29)

* Correzione frasi Common Voice

Corretta riga 3296 pseudinimo -> pseudonimo
Corretta riga 1827 c’è -> ce
Corretta riga 3784 sopratutto -> soprattutto
Corretta riga 8320 sopratutto -> soprattutto
Corretta riga 8763 sopratutto -> soprattutto
Corretta riga 9352 pò -> po'
Corretta riga 4289 pò -> po'

* Aggiunta di 491 nuove frasi

Prima parte della tesi del "Software Open Source"

- Ogni frase è stata riletta e revisionata
- Aggiunta del punto alla fine di ogni frase
- Ogni frase inizia con la lettera maiuscola

* Update giusto delle frasi

Frasi superflue rimosse.

* Revisione tesi #1

* Update frasi.txt

- Sostituite virgolette
- aggiunto il punto alla fine delle frasi dove mancava

* Update frasi.txt

* Update frasi.txt

eliminato il punto nelle frasi 3839 e 3846

* Update frasi.txt

Righe create a caso dal merge

* Aggiungo il punto alle righe 10672 e 11026

* Correzione frasi

- Corretti errori di ortografia
- Corretto il senso logico di alcune frasi
@nukeador

This comment has been minimized.

Copy link
Collaborator

commented Feb 7, 2019

Hello everyone,

In case you missed the announcement, after a few months of intense work, we launched the Sentence Collection Tool site for all Common Voice contributors. We are considering this a first beta version, but fully functional after some weeks of testing. This tool also includes a How to page with ideas on where and how to find corpus (open to improvements).

All sentences submitted, reviewed and validated using this tool will be incorporated into the main Commmon Voice site. We will point this as the way to submit sentences to the project moving forward.

What is this tool?

This tools facilitates the task of submitting, reviewing and validating sentences in different locales and to be incorporated into the main Common Voice site, so people can read them and donate their voice.

Why this tool?

The previous process to gather sentences was a but unstructured, too many places to go and unclear guidelines. In order for sentences to be useful for the Deep Speech algorithm, there are certain "hard requirements" this tool enforces to avoid problems in the future.

We also aim to keep improving the tool to make the experience even easier for everyone!

How can I start using it?

Just go to the Sentence Collection tool site and start submitting and reviewing sentences in your locales. Make sure you check the How-to page to understand how to use the tool.

Where do I report issues or ideas?

Our github project page is the best place to report any issues with the site. If you want to discuss with the rest of the community an idea or new feature you can do that in our discourse.

How can I help with the development?

This tool is developed by the Common Voice volunteers. Anyone can be involved in the development, you just need to know react or kinto and chime in our github project to know more.

If you are not technical, don't worry! We usually open conversations on discourse to get everyone the chance to influence the direction of the project.

In other to centralize all discussions about where to find corpus, I would like to suggest we use the Common Voice discourse instead. Here there is one topic where community is talking about this:

https://discourse.mozilla.org/t/problems-finding-public-domain-sentences/34790

Thanks for your contributions!

@nukeador nukeador closed this Feb 7, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.