-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #172 from nusdbsystem/master
Update dev branch from updates due to the fixed security alerts
- Loading branch information
Showing
4 changed files
with
401 additions
and
279 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,81 +1,142 @@ | ||
Dataset Types | ||
==================================================================== | ||
============= | ||
|
||
.. note:: | ||
**note** | ||
|
||
Refer to `./examples/datasets/ <https://github.com/nusdbsystem/singa-auto/tree/master/examples/datasets/>`_ for examples on pre-processing | ||
common dataset formats to conform to the SINGA-Auto's own dataset formats. | ||
|
||
|
||
.. _`dataset-type:CORPUS`: | ||
Refer to | ||
`./examples/datasets/ <https://github.com/nusdbsystem/singa-auto/tree/master/examples/datasets/>`__ | ||
for examples on pre-processing common dataset formats to conform to | ||
the SINGA-Auto's own dataset formats. | ||
|
||
CORPUS | ||
-------------------------------------------------------------------- | ||
------ | ||
|
||
The dataset file must be of the ``.zip`` archive format with a ``corpus.tsv`` at the root of the directory. | ||
The dataset file must be of the ``.zip`` archive format with a | ||
``corpus.tsv`` at the root of the directory. | ||
|
||
The ``corpus.tsv`` should be of a `.TSV <https://en.wikipedia.org/wiki/Tab-separated_values>`_ | ||
format with columns of ``token`` and ``N`` other variable column names (*tag columns*). | ||
The ``corpus.tsv`` should be of a | ||
`.TSV <https://en.wikipedia.org/wiki/Tab-separated_values>`__ format | ||
with columns of ``token`` and ``N`` other variable column names (*tag | ||
columns*). | ||
|
||
For each row, | ||
|
||
``token`` should be a string, a token (e.g. word) in the corpus. | ||
These tokens should appear in the order as it is in the text of the corpus. | ||
To delimit sentences, ``token`` can be take the value of ``\n``. | ||
|
||
The other ``N`` columns describe the corresponding token as part of the text of the corpus, *depending on the task*. | ||
|
||
|
||
.. _`dataset-type:IMAGE_FILES`: | ||
|
||
IMAGE_FILES | ||
-------------------------------------------------------------------- | ||
|
||
The dataset file must be of the ``.zip`` archive format with a ``images.csv`` at the root of the directory. | ||
|
||
The ``images.csv`` should be of a `.CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_ | ||
format with columns of ``path`` and ``N`` other variable column names (*tag columns*). | ||
``token`` should be a string, a token (e.g. word) in the corpus. | ||
These tokens should appear in the order as it is in the text of the | ||
corpus. To delimit sentences, ``token`` can be take the value of | ||
``\n``. | ||
|
||
The other ``N`` columns describe the corresponding token as part of | ||
the text of the corpus, *depending on the task*. | ||
|
||
SEGMENTATION\_IMAGES | ||
-------------------- | ||
|
||
- Inside the uploaded ``.zip`` file, the training and validation sets | ||
should be wrapped separately, and be named strictly as ``train`` and | ||
``val``. | ||
- For ``train`` folder (the same for ``val`` folder), the images and | ||
annotated masks should also be wrapped separately, and be named | ||
strictly as ``image`` and ``mask``. | ||
- ``mask`` folder should contain only ``.png`` files and file name | ||
should be the same as each mask's corresponding image. (eg. for an | ||
image named ``0001.jpg``, its corresponding mask should be named as | ||
``0001.png``) | ||
- An JSON file named ``params.json`` must also be included in the | ||
``.zip`` file, in order to indicates the essential training | ||
parameters such as ``num_classes``, for example: | ||
|
||
.. code:: json | ||
{ | ||
"num_classes": 21 | ||
} | ||
An example of the upload ``.zip`` file structure: | ||
|
||
:: | ||
|
||
+ dataset.zip | ||
+ train | ||
+ image | ||
+ 0001.jpg | ||
+ 0002.jpg | ||
+ ... | ||
+ mask | ||
+ 0001.png | ||
+ 0002.png | ||
+ .. | ||
+ val | ||
+ image | ||
+ 0003.jpg | ||
+ ... | ||
+ mask | ||
+ 0003.png | ||
+ ... | ||
+ params.json | ||
|
||
IMAGE\_FILES | ||
------------ | ||
|
||
The dataset file must be of the ``.zip`` archive format with a | ||
``images.csv`` at the root of the directory. | ||
|
||
The ``images.csv`` should be of a | ||
`.CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`__ format | ||
with columns of ``path`` and ``N`` other variable column names (*tag | ||
columns*). | ||
|
||
For each row, | ||
|
||
``path`` should be a file path to a ``.png``, ``.jpg`` or ``.jpeg`` image file within the archive, | ||
relative to the root of the directory. | ||
``path`` should be a file path to a ``.png``, ``.jpg`` or ``.jpeg`` | ||
image file within the archive, relative to the root of the | ||
directory. | ||
|
||
The other ``N`` columns describe the corresponding image, *depending on the task*. | ||
The other ``N`` columns describe the corresponding image, *depending | ||
on the task*. | ||
|
||
.. _`dataset-type:QUESTION_ANSWERING_COVID19`: | ||
QUESTION\_ANSWERING\_COVID19 | ||
---------------------------- | ||
|
||
QUESTION_ANSWERING_COVID19 | ||
-------------------------------------------------------------------- | ||
The dataset file must be of the ``.zip`` archive format, containing | ||
`JSON <https://en.wikipedia.org/wiki/JSON>`__ files. JSON files under | ||
different levels of folders will be automaticly read all together. | ||
|
||
The dataset file must be of the ``.zip`` archive format, containing `JSON <https://en.wikipedia.org/wiki/JSON>`_ files. JSON files under different levels of folders will be automaticly read all together. | ||
Each JSON file is extracted from one paper. `JSON | ||
structure <https://en.wikipedia.org/wiki/JSON#Example>`__ contains field | ||
body\_text, which is a list of {"text": <str>} blocks. Each text block | ||
is namely each paragraph of corresponding paper. | ||
|
||
Each JSON file is extracted from one paper. `JSON structure <https://en.wikipedia.org/wiki/JSON#Example>`_ contains field `body_text`, which is a list of `{"text": <str>}` blocks. Each `text` block is namely each paragraph of corresponding paper. | ||
Meanwhile, a metadata.csv file, at the root of the archive directory, is | ||
optional. It is to provide the model with publish\_time column, each | ||
entry is in Date format, e.g. 2001-12-17. In this condition, each | ||
metadata entry is required to have sha value column in General format, | ||
and each JSON file required to have "sha":<str> field, while both sha | ||
values linked. When neither metadata.csv or publish\_time Date value is | ||
provided, the model would not check the timeliness of corresponding JSON | ||
body\_text field. | ||
|
||
Meanwhile, a `metadata.csv` file, at the root of the archive directory, is optional. It is to provide the model with `publish_time` column, each entry is in Date format, e.g. 2001-12-17. In this condition, each metadata entry is required to have `sha` value column in General format, and each JSON file required to have `"sha":<str>` field, while both sha values linked. When neither metadata.csv or `publish_time` Date value is provided, the model would not check the timeliness of corresponding JSON `body_text` field. | ||
QUESTION\_ANSWERING\_MEDQUAD | ||
---------------------------- | ||
|
||
The dataset file must be of the ``.zip`` archive format, containing | ||
`xml <https://en.wikipedia.org/wiki/XML#/media/File:XMLSample.png>`__ | ||
files. Xml files under different levels of folders will be automaticly | ||
read all together. | ||
|
||
.. _`dataset-type:QUESTION_ANSWERING_MEDQUAD`: | ||
|
||
QUESTION_ANSWERING_MEDQUAD | ||
-------------------------------------------------------------------- | ||
|
||
The dataset file must be of the ``.zip`` archive format, containing `xml <https://en.wikipedia.org/wiki/XML#/media/File:XMLSample.png>`_ files. Xml files under different levels of folders will be automaticly read all together. | ||
|
||
Model would only take <Document> <QAPairs> ... </QAPairs> </Document>field, and this filed contains multiple <QAPair> ... </QAPair>. Each QAPair has one <Question> ... </Question> and its <Answer> ... </Answer> combination. | ||
|
||
|
||
.. _`dataset-type:TABULAR`: | ||
Model would only take <Document> <QAPairs> ... </QAPairs> | ||
</Document>field, and this filed contains multiple <QAPair> ... | ||
</QAPair>. Each QAPair has one <Question> ... </Question> and its | ||
<Answer> ... </Answer> combination. | ||
|
||
TABULAR | ||
-------------------------------------------------------------------- | ||
|
||
The dataset file must be a tabular dataset of the ``.csv`` format with ``N`` columns. | ||
|
||
.. _`dataset-type:AUDIO_FILES`: | ||
------- | ||
|
||
AUDIO_FILES | ||
-------------------------------------------------------------------- | ||
The dataset file must be a tabular dataset of the ``.csv`` format with | ||
``N`` columns. | ||
|
||
The dataset file must be of the ``.zip`` archive format with a ``audios.csv`` at the root of the directory. | ||
AUDIO\_FILES | ||
------------ | ||
|
||
The dataset file must be of the ``.zip`` archive format with a | ||
``audios.csv`` at the root of the directory. |
Oops, something went wrong.