Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting the same sentences multiple times #241

Closed
a3nm opened this issue Jun 24, 2017 · 11 comments
Closed

Getting the same sentences multiple times #241

a3nm opened this issue Jun 24, 2017 · 11 comments
Labels
Enhancement A idea to enhance and existing feature or process on Common Voice

Comments

@a3nm
Copy link

a3nm commented Jun 24, 2017

As I user, when recording, I often get the same sentences multiple times -- sometimes you can get twice the same sentence in one batch of three recordings.

There should be some way to remember which sentences have already been served to a given session, and not serving them again once they have been recorded.

@kdavis-mozilla kdavis-mozilla added Triage Enhancement A idea to enhance and existing feature or process on Common Voice labels Jun 24, 2017
@espadrine
Copy link

espadrine commented Jun 24, 2017

One part of it is that the sentence dataset is really small, at 188 sentences (I expected thousands.) May I suggest the use of Wikisource? Something speech-heavy, like Agatha Christie, or modern speeches, maybe a play?

The second part is that it simply uses Math.random(). It will not yield well-distributed results, so you risk having a large disparity between the number of recordings per sample. One solution is to increment a global variable on the server, and yield the sentence at that index (and wrapping around, obviously). That variable should be initialized to a random value (or persisted), to prevent one part being repeated due to server restarts.

@orschiro
Copy link

I think this should be dealt with high priority. If you as an avid and excited user get to see and record the same sentences over and over again, it takes away your motivation to contribute.

@mikehenrty
Copy link
Member

Is this still happening? We have increased our sentences to several thousand (with more coming soon).

(Note, this is not a duplicate of #260, that one is about listening to sentence, this one is about recording. They come from entirely different pools).

@mikehenrty mikehenrty marked this as a duplicate of #260 Jul 17, 2017
@orschiro
Copy link

@mikehenrty I am receiving a lot of new sentences now. Thanks!

@nmstoker
Copy link
Contributor

Am late to this, but if you're expanding the range of sentences, might be worth considering phonetic pangrams, as by definition these cover a large chunk of sounds quickly (although they are typically unrealistic)

https://www.quora.com/Is-there-a-text-that-covers-the-entire-English-phonetic-range/answer/Sheetal-Srivastava-1

@a3nm
Copy link
Author

a3nm commented Jul 18, 2017

I have tried a bit, and it seems like there are now sufficiently many different sentences to avoid getting the same ones multiple times. Thanks for fixing!

@a3nm a3nm closed this as completed Jul 18, 2017
@a3nm
Copy link
Author

a3nm commented Jul 23, 2017

I'm reopening because the pool of sentences is not so large after all: you can still get the same sentences occasionally when you record a sufficient number of them (around 100), even in the same session.

I think this could be fixed (within one session) by remembering which sentences have already been recorded, and not asking for these same sentences again.

@a3nm a3nm reopened this Jul 23, 2017
@Reginhar
Copy link
Contributor

This will probably be fixed once #304 gets accepted, but before that remembering sentences in a session could be a temporary fix. Allowing users to skip sentences (#278) would probably be a good enough fix as well for now (and skipping would be useful anyway).

@a3nm
Copy link
Author

a3nm commented Jul 23, 2017

Skipping sentences would help but it's a bit more tedious, and also as a user I'm not always sure whether I have already seen a sentence or not. (Did I see it when recording, or when validating? Was it that sentence, or another sentence from the same novel? etc.) So even if users can skip sentences there would probably be some frustration and some duplicate recordings (but I don't know whether having duplicate recordings of the same sentence by the same speaker pollutes the dataset).

@mikehenrty
Copy link
Member

Let's close this bug as we are actively trying to gather new sentences in #341. We are increasing our sentences by the day.

@a3nm
Copy link
Author

a3nm commented May 26, 2018

I think this should be reopened: while recording some sentences this morning I got the same one multiple times again. ("Gossips are frogs, they drink and talk", and "Where did he get it", if I remember correctly.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement A idea to enhance and existing feature or process on Common Voice
Projects
None yet
Development

No branches or pull requests

7 participants