Skip to content

Commit

Permalink
Merge pull request #163 from customprogrammingsolutions/refactor-read…
Browse files Browse the repository at this point in the history
…ycorpus

[MRG] Refactor: remove ReadyCorpus
  • Loading branch information
oadams committed Jun 4, 2018
2 parents 53f9575 + 268fc95 commit bc6fe93
Show file tree
Hide file tree
Showing 9 changed files with 352 additions and 120 deletions.
11 changes: 4 additions & 7 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,10 @@ works::
# Create a corpus from data that has already been preprocessed.
# Among other things, this will divide the corpus into training,
# validation and test sets.
from persephone.corpus import Corpus, ReadyCorpus
corpus = ReadyCorpus(tgt_dir="/path/to/preprocessed/data",
feat_type="fbank",
label_type="phonemes")
from persephone.corpus import Corpus
corpus = Corpus(feat_type="fbank",
label_type="phonemes",
tgt_dir="/path/to/preprocessed/data")

# Create an object that reads the corpus data in batches.
from persephone.corpus_reader import CorpusReader
Expand Down Expand Up @@ -64,9 +64,6 @@ classes. `Utterance` instances comprise `Corpus` instances, which are loaded by
.. label_segmenter=something,
.. tier_prefixes=("xv", "rf"))
.. autoclass:: persephone.corpus.ReadyCorpus
:members: __init__, determine_labels

.. autoclass:: persephone.corpus_reader.CorpusReader
:members: __init__,

Expand Down
19 changes: 9 additions & 10 deletions docs/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ interpreter. Back to the terminal:

$ ipython
> from persephone import corpus
> corp = corpus.ReadyCorpus("data/na_example")
> corp = corpus.Corpus("fbank", "phonemes", "data/na_example")
> from persephone import run
> run.train_ready(corp)

Expand Down Expand Up @@ -149,7 +149,7 @@ format your data in the same way, you can create your own Persephone

.. code:: python
corp = corpus.ReadyCorpus("<your-corpus-directory>", label_type="extension")
corp = corpus.Corpus("fbank", "phonemes", "<your-corpus-directory>")
where extension is "txt", "phonemes", "tones", or whatever your file has
after the dot.
Expand Down Expand Up @@ -197,35 +197,34 @@ Current data formatting requirements:
* Spaces are used to delimit the units that the tool predicts. Typically these units are phonemes or tones, however they could also just be orthographic characters (though performance is likely to be a bit lower: consider trying to transcribe "$100"). The model can't tell the difference between digraphs and unigraphs as long as they're tokenized in this format, demarcated with spaces.

If your data observes this format then you can load it via the
``ReadyCorpus`` class. If your data does not observe this format, you
``Corpus`` class. If your data does not observe this format, you
have two options:

1. Do your own separate preprocessing to get the data in this format. If
you're not a programmer this is probably the best option for you. If
you have ELAN files, this probably means using
``persephone/scripts/split_eaf.py``.
2. Create a Python class that inherits from ``persephone.corpus.Corpus``
(as does ``ReadyCorpus``) and does all your preprocessing. The API
and does all your preprocessing. The API
(and thus documentation) for this is work in progress, but the key
point is that ``<corpusobject>.train_prefixes``,
``<corpusobject>.valid_prefixes``, and
``<corpusobject>.test_prefixes`` are lists of prefixes for the
relevant subset of the data. For now, look at ``ReadyCorpus`` in
``persephone/corpus.py`` for an example. For an example on a full
relevant subset of the data. For an example on a full
dataset, see at ``persephone/datasets/na.py`` (beware: here be
dragons).

Creating validation and test sets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Currently ``ReadyCorpus`` splits the supplied data into three sets
Currently ``Corpus`` splits the supplied data into three sets
(training, validation and test) in a 95:5:5 ratio. The training set is
what your model is exposed to during training. Validation is a held-out
set that is used to gauge during training how well the model is
performing. Testing is what is used to quantitatively assess model
performance after training is complete.

When you first load your corpus, ``ReadyCorpus`` randomly allocates
When you first load your corpus, ``Corpus`` randomly allocates
files to each of these subsets. If you'd like to do change the prefixes
of which utterances are in in each set, modify
``<your-corpus>/valid_prefixes.txt`` and
Expand Down Expand Up @@ -283,7 +282,7 @@ like to hear people's thoughts on this interface.
CorpusReaders and Models
^^^^^^^^^^^^^^^^^^^^^^^^

The ``Corpus`` object (of which ``ReadyCorpus`` is a subclass), is an
The ``Corpus`` object, is an
object that exposes the files in the corpus (among several other
things). Of relevance here is the ``.get_train_fns()``,
``.get_valid_fns()``, ``.get_test_fns()`` methods, which provide lists
Expand All @@ -304,7 +303,7 @@ example na\_corpus):
.. code:: python
from persephone import corpus
na_corpus = corpus.ReadyCorpus("data/na_example/")
na_corpus = corpus.Corpus("fbank", "phonemes", "data/na_example/")
from persephone import corpus_reader
na_reader = corpus_reader.CorpusReader(na_corpus, num_train=512, batch_size=16)
Expand Down

0 comments on commit bc6fe93

Please sign in to comment.