diff --git a/docs/src/user/datasets.rst b/docs/src/user/datasets.rst index 197e5926..9b6b1470 100644 --- a/docs/src/user/datasets.rst +++ b/docs/src/user/datasets.rst @@ -1,81 +1,142 @@ Dataset Types -==================================================================== +============= -.. note:: + **note** - Refer to `./examples/datasets/ `_ for examples on pre-processing - common dataset formats to conform to the SINGA-Auto's own dataset formats. - - -.. _`dataset-type:CORPUS`: + Refer to + `./examples/datasets/ `__ + for examples on pre-processing common dataset formats to conform to + the SINGA-Auto's own dataset formats. CORPUS --------------------------------------------------------------------- +------ -The dataset file must be of the ``.zip`` archive format with a ``corpus.tsv`` at the root of the directory. +The dataset file must be of the ``.zip`` archive format with a +``corpus.tsv`` at the root of the directory. -The ``corpus.tsv`` should be of a `.TSV `_ -format with columns of ``token`` and ``N`` other variable column names (*tag columns*). +The ``corpus.tsv`` should be of a +`.TSV `__ format +with columns of ``token`` and ``N`` other variable column names (*tag +columns*). For each row, - ``token`` should be a string, a token (e.g. word) in the corpus. - These tokens should appear in the order as it is in the text of the corpus. - To delimit sentences, ``token`` can be take the value of ``\n``. - - The other ``N`` columns describe the corresponding token as part of the text of the corpus, *depending on the task*. - - -.. _`dataset-type:IMAGE_FILES`: - -IMAGE_FILES --------------------------------------------------------------------- - -The dataset file must be of the ``.zip`` archive format with a ``images.csv`` at the root of the directory. - -The ``images.csv`` should be of a `.CSV `_ -format with columns of ``path`` and ``N`` other variable column names (*tag columns*). + ``token`` should be a string, a token (e.g. word) in the corpus. + These tokens should appear in the order as it is in the text of the + corpus. To delimit sentences, ``token`` can be take the value of + ``\n``. + + The other ``N`` columns describe the corresponding token as part of + the text of the corpus, *depending on the task*. + +SEGMENTATION\_IMAGES +-------------------- + +- Inside the uploaded ``.zip`` file, the training and validation sets + should be wrapped separately, and be named strictly as ``train`` and + ``val``. +- For ``train`` folder (the same for ``val`` folder), the images and + annotated masks should also be wrapped separately, and be named + strictly as ``image`` and ``mask``. +- ``mask`` folder should contain only ``.png`` files and file name + should be the same as each mask's corresponding image. (eg. for an + image named ``0001.jpg``, its corresponding mask should be named as + ``0001.png``) +- An JSON file named ``params.json`` must also be included in the + ``.zip`` file, in order to indicates the essential training + parameters such as ``num_classes``, for example: + + .. code:: json + + { + "num_classes": 21 + } + +An example of the upload ``.zip`` file structure: + +:: + + + dataset.zip + + train + + image + + 0001.jpg + + 0002.jpg + + ... + + mask + + 0001.png + + 0002.png + + .. + + val + + image + + 0003.jpg + + ... + + mask + + 0003.png + + ... + + params.json + +IMAGE\_FILES +------------ + +The dataset file must be of the ``.zip`` archive format with a +``images.csv`` at the root of the directory. + +The ``images.csv`` should be of a +`.CSV `__ format +with columns of ``path`` and ``N`` other variable column names (*tag +columns*). For each row, - ``path`` should be a file path to a ``.png``, ``.jpg`` or ``.jpeg`` image file within the archive, - relative to the root of the directory. + ``path`` should be a file path to a ``.png``, ``.jpg`` or ``.jpeg`` + image file within the archive, relative to the root of the + directory. - The other ``N`` columns describe the corresponding image, *depending on the task*. + The other ``N`` columns describe the corresponding image, *depending + on the task*. -.. _`dataset-type:QUESTION_ANSWERING_COVID19`: +QUESTION\_ANSWERING\_COVID19 +---------------------------- -QUESTION_ANSWERING_COVID19 --------------------------------------------------------------------- +The dataset file must be of the ``.zip`` archive format, containing +`JSON `__ files. JSON files under +different levels of folders will be automaticly read all together. -The dataset file must be of the ``.zip`` archive format, containing `JSON `_ files. JSON files under different levels of folders will be automaticly read all together. +Each JSON file is extracted from one paper. `JSON +structure `__ contains field +body\_text, which is a list of {"text": } blocks. Each text block +is namely each paragraph of corresponding paper. -Each JSON file is extracted from one paper. `JSON structure `_ contains field `body_text`, which is a list of `{"text": }` blocks. Each `text` block is namely each paragraph of corresponding paper. +Meanwhile, a metadata.csv file, at the root of the archive directory, is +optional. It is to provide the model with publish\_time column, each +entry is in Date format, e.g. 2001-12-17. In this condition, each +metadata entry is required to have sha value column in General format, +and each JSON file required to have "sha": field, while both sha +values linked. When neither metadata.csv or publish\_time Date value is +provided, the model would not check the timeliness of corresponding JSON +body\_text field. -Meanwhile, a `metadata.csv` file, at the root of the archive directory, is optional. It is to provide the model with `publish_time` column, each entry is in Date format, e.g. 2001-12-17. In this condition, each metadata entry is required to have `sha` value column in General format, and each JSON file required to have `"sha":` field, while both sha values linked. When neither metadata.csv or `publish_time` Date value is provided, the model would not check the timeliness of corresponding JSON `body_text` field. +QUESTION\_ANSWERING\_MEDQUAD +---------------------------- +The dataset file must be of the ``.zip`` archive format, containing +`xml `__ +files. Xml files under different levels of folders will be automaticly +read all together. -.. _`dataset-type:QUESTION_ANSWERING_MEDQUAD`: - -QUESTION_ANSWERING_MEDQUAD --------------------------------------------------------------------- - -The dataset file must be of the ``.zip`` archive format, containing `xml `_ files. Xml files under different levels of folders will be automaticly read all together. - -Model would only take ... field, and this filed contains multiple ... . Each QAPair has one ... and its ... combination. - - -.. _`dataset-type:TABULAR`: +Model would only take ... +field, and this filed contains multiple ... +. Each QAPair has one ... and its + ... combination. TABULAR --------------------------------------------------------------------- - -The dataset file must be a tabular dataset of the ``.csv`` format with ``N`` columns. - -.. _`dataset-type:AUDIO_FILES`: +------- -AUDIO_FILES --------------------------------------------------------------------- +The dataset file must be a tabular dataset of the ``.csv`` format with +``N`` columns. -The dataset file must be of the ``.zip`` archive format with a ``audios.csv`` at the root of the directory. +AUDIO\_FILES +------------ +The dataset file must be of the ``.zip`` archive format with a +``audios.csv`` at the root of the directory. diff --git a/docs/src/user/tasks.rst b/docs/src/user/tasks.rst index 31224f35..4dcd9e3e 100644 --- a/docs/src/user/tasks.rst +++ b/docs/src/user/tasks.rst @@ -1,37 +1,74 @@ - .. _`tasks`: - Supported Tasks -==================================================================== +=============== -Each task has an associated a *Dataset Format*, a *Query Format* and a *Prediction Format*. +Each task has an associated a *Dataset Format*, a *Query Format* and a +*Prediction Format*. A task's *Dataset Format* specifies the format of the dataset files. -Datasets are prepared by *Application Developers* when they create *Train Jobs* -and received by *Model Developers* when they define :meth:`singa_auto.model.BaseModel.train` and :meth:`singa_auto.model.BaseModel.evaluate`. +Datasets are prepared by *Application Developers* when they create +*Train Jobs* and received by *Model Developers* when they define +singa\_auto.model.BaseModel.train and +singa\_auto.model.BaseModel.evaluate. + +A task's *Query Format* specifies the format of queries when they are +passed to models. Queries are generated by *Application Users* when they +send queries to *Inference Jobs* and received by *Model Developers* when +they define singa\_auto.model.BaseModel.predict. + +A task's *Prediction Format* specifies the format of predictions made by +models. Predictions are generated by *Model Developers* when they define +singa\_auto.model.BaseModel.predict and received by *Application Users* +as predictions to their queries sent to *Inference Jobs*. + +IMAGE\_SEGMENTATION +------------------- + +Dataset Format +~~~~~~~~~~~~~~ + +dataset-type:SEGMENTATION\_IMAGES -A task's *Query Format* specifies the format of queries when they are passed to models. -Queries are generated by *Application Users* when they send queries to *Inference Jobs* -and received by *Model Developers* when they define :meth:`singa_auto.model.BaseModel.predict`. + **note** -A task's *Prediction Format* specifies the format of predictions made by models. -Predictions are generated by *Model Developers* when they define :meth:`singa_auto.model.BaseModel.predict` -and received by *Application Users* as predictions to their queries sent to *Inference Jobs*. + We use the same annotation format as `Pascal VOC segmentation + dataset `__ + +- An image and its corresponding mask should have the same width and + length while the number of channels can be different. For example, an + image can have three channels representing ``RGB`` values but its + mask should only have one grayscale channel. +- In the mask image, each pixel's grayscale value represents its label, + while the value ``255`` represents the pixel is meaningless such as + paddings or borders. + +Query Format +~~~~~~~~~~~~ + +An image file such as ``.jpg`` or ``.png``. + +Prediction Format +~~~~~~~~~~~~~~~~~ +A ``W x H`` single-channel mask image with each pixel's grayscale value +representing its label. -IMAGE_CLASSIFICATION --------------------------------------------------------------------- +IMAGE\_CLASSIFICATION +--------------------- Dataset Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~ -:ref:`dataset-type:IMAGE_FILES` +dataset-type:IMAGE\_FILES -- There is only 1 tag column of ``class``, corresponding to the class of the image as an integer from ``0`` to ``k - 1``, where ``k`` is the total no. of classes. -- The train & validation dataset's images should be have the same dimensions ``W x H`` and same total no. of classes. +- There is only 1 tag column of ``class``, corresponding to the class + of the image as an integer from ``0`` to ``k - 1``, where ``k`` is + the total no. of classes. +- The train & validation dataset's images should be have the same + dimensions ``W x H`` and same total no. of classes. An example: -.. code-block:: text +.. code:: text path,class image-0-of-class-0.png,0 @@ -40,40 +77,42 @@ An example: image-0-of-class-1.png,1 ... image-99-of-class-9.png,9 - -.. note:: - You can refer to and run `./examples/datasets/image_files/load_folder_format.py `_ - for converting *directories of images* to SINGA-Auto's ``IMAGE_CLASSIFICATION`` format. + **note** + You can refer to and run + `./examples/datasets/image\_files/load\_folder\_format.py `__ + for converting *directories of images* to SINGA-Auto's + ``IMAGE_CLASSIFICATION`` format. -Query Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -A ``W x H x 3`` 3D array representing a *RGB* version of the query image. -The query image can be of *any dimensions*. +Query Format +~~~~~~~~~~~~ -Prediction Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +A ``W x H x 3`` 3D array representing a *RGB* version of the query +image. The query image can be of *any dimensions*. -A size-``k`` array of floats, representing the probabilities of each class, by index, from ``0`` to ``k-1``. -For example, the float at index 0 corresponds to the probability of class 0. +Prediction Format +~~~~~~~~~~~~~~~~~ +A size-\ ``k`` array of floats, representing the probabilities of each +class, by index, from ``0`` to ``k-1``. For example, the float at index +0 corresponds to the probability of class 0. -POS_TAGGING --------------------------------------------------------------------- +POS\_TAGGING +------------ Dataset Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~ -:ref:`dataset-type:CORPUS` +dataset-type:CORPUS -- Sentences are delimited by ``\n`` tokens. -- There is only 1 tag column of ``tag`` corresponding to the POS tag of the token as an integer from ``0`` to ``k-1``. +- Sentences are delimited by ``\n`` tokens. +- There is only 1 tag column of ``tag`` corresponding to the POS tag of + the token as an integer from ``0`` to ``k-1``. An example: -.. code-block:: text +.. code:: text token tag Two 3 @@ -91,35 +130,34 @@ An example: . 4 \n 0 +Query Format +~~~~~~~~~~~~ -Query Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -An array of strings representing a sentence as a list of tokens in that sentence. - -Prediction Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -A array of integers representing the list of predicted tag for each token, in sequence, for the sentence. +An array of strings representing a sentence as a list of tokens in that +sentence. +Prediction Format +~~~~~~~~~~~~~~~~~ +A array of integers representing the list of predicted tag for each +token, in sequence, for the sentence. -QUESTION_ANSWERING --------------------------------------------------------------------- +QUESTION\_ANSWERING +------------------- COVID19 Task Dataset Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -:ref:`dataset-type:QUESTION_ANSWERING_COVID19` +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +dataset-type:QUESTION\_ANSWERING\_COVID19 -Dataset can be used to finetune the SQuAD pre-trained Bert model. +Dataset can be used to finetune the SQuAD pre-trained Bert model. -- The dataset zips folders containing JSON files. JSON files under different level folders will be automaticly read all together. +- The dataset zips folders containing JSON files. JSON files under + different level folders will be automaticly read all together. Dataset structure example: -.. code-block:: text +.. code:: text /DATASET_NAME.zip │ @@ -138,20 +176,29 @@ Dataset structure example: │ └──metadata.csv # if additional information is provided for above JSON files, user can add a metadata.csv -- JSON file includes ``body_text``, providing list of paragraphs in full body which can be used for question answering. ``body_text`` can contain different entries, only the "text" field of each entry will be read. - -1. For JSON files extracted from papers, it comes that one JSON file for one paper. And if additional information is given in metadata.csv for papers, each JSON file and each metadata.csv entries are linked via ``sha`` values of both. - -2. For dataset having their additional information paragraph, the ``body_text``> ``text`` entry is in `` + <\n> + `` string format. In this circumstance, there is no ``sha`` value nor metadata.csv file needed. +- JSON file includes ``body_text``, providing list of paragraphs in + full body which can be used for question answering. ``body_text`` can + contain different entries, only the "text" field of each entry will + be read. + +#. For JSON files extracted from papers, it comes that one JSON file for + one paper. And if additional information is given in metadata.csv for + papers, each JSON file and each metadata.csv entries are linked via + ``sha`` values of both. +#. For dataset having their additional information paragraph, the + ``body_text``> ``text`` entry is in + `` + <\n> + `` string format. In + this circumstance, there is no ``sha`` value nor metadata.csv file + needed. Sample of JSON file: -.. code-block:: text +.. code:: text # JSON file 1 # for example, a JSON file extracted from one paper { "sha": , # 40-character sha1 of the PDF, this field is only required for JSON extracted from papers. it will be read into model in forms of string - + "body_text": [ # list of paragraphs in full body, this is must-have { "text": , # text body for first entry, which is for one paragraph of this paper. this is must-have. it will be read as string into model @@ -159,53 +206,73 @@ Sample of JSON file: ... # other 'text' blocks, i.e. paragraphs blocks the same as above, then all string ‘text’ will be handled and processed into panda datafame ], } - + # ---------------------------------------------------------------------------------------------------------------------- # - + # JSON file 2 # for example, a JSON file extraced from SQuAD2.0 { "body_text": [ # list of paragraphs in full body, this is must-have { "text": 'What are the treatments for Age-related Macular Degeneration ?\n If You Have Advanced AMD Once dry AMD reaches the advanced stage, no form of treatment can prevent vision loss...', # text body for first entry, this is must-have - + }, ... # other 'text' blocks, i.e. paragraphs blocks look the same as above ], } - - -- ``metadata.csv`` is not strictly required. User can provide additional information with it, i.e. authors, title, journal and publish_time, mapping to each JSON files by every sha value. ``cord_uid`` serves unique values serve as the entry identity. Time sensitive entry, is advised to have ``publish_time`` value in Date format. Other values, General format is recommended. - -Sample of ``metadata.csv`` entry: - - ===================== ===================== - Column Names Column Values - --------------------- --------------------- - cord_uid zjufx4fo - sha b2897e1277f56641193a6db73825f707eed3e4c9 - source_x PMC - title Sequence requirements for RNA strand transfer during nidovirus ... - doi 10.1093/emboj/20.24.7220 - pmcid PMC125340 - pubmed_id 11742998 - license unk - abstract Nidovirus subgenomic mRNAs contain a leader sequence derived ... - publish_time 2001-12-17 - ===================== ===================== - -Query Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. note:: - - - The pretrained model should be fine-tuned with a dataset first to adapt to particular question domains when necessary. - - Otherwise, following the question, input should contain relevant information (context paragraph or candidate answers, or both), whether or not addresses the question. - - Optionally, while the relevant information as additional paragraph are provided in query, the question always comes first, followed by additional paragraph. We use “\n” separators between the question and its paragraph of the input. - -Query is in JSON format. It could be a of a single question in ``questions`` field. Model will only read the ``questions`` field. - -.. code-block:: text + +- ``metadata.csv`` is not strictly required. User can provide + additional information with it, i.e. authors, title, journal and + publish\_time, mapping to each JSON files by every sha value. + ``cord_uid`` serves unique values serve as the entry identity. Time + sensitive entry, is advised to have ``publish_time`` value in Date + format. Other values, General format is recommended. + +Sample of ``metadata.csv`` entry: + + +-----------------+----------------------------------------------------------------------+ + | Column Names | Column Values | + +=================+======================================================================+ + | cord\_uid | zjufx4fo | + +-----------------+----------------------------------------------------------------------+ + | sha | b2897e1277f56641193a6db73825f707eed3e4c9 | + +-----------------+----------------------------------------------------------------------+ + | source\_x | PMC | + +-----------------+----------------------------------------------------------------------+ + | title | Sequence requirements for RNA strand transfer during nidovirus ... | + +-----------------+----------------------------------------------------------------------+ + | doi | 10.1093/emboj/20.24.7220 | + +-----------------+----------------------------------------------------------------------+ + | pmcid | PMC125340 | + +-----------------+----------------------------------------------------------------------+ + | pubmed\_id | 11742998 | + +-----------------+----------------------------------------------------------------------+ + | license | unk | + +-----------------+----------------------------------------------------------------------+ + | abstract | Nidovirus subgenomic mRNAs contain a leader sequence derived ... | + +-----------------+----------------------------------------------------------------------+ + | publish\_time | 2001-12-17 | + +-----------------+----------------------------------------------------------------------+ + +Query Format +~~~~~~~~~~~~ + + **note** + + - The pretrained model should be fine-tuned with a dataset first to + adapt to particular question domains when necessary. + - Otherwise, following the question, input should contain relevant + information (context paragraph or candidate answers, or both), + whether or not addresses the question. + - Optionally, while the relevant information as additional + paragraph are provided in query, the question always comes first, + followed by additional paragraph. We use “n” separators between + the question and its paragraph of the input. + +Query is in JSON format. It could be a \\ of a single question in +``questions`` field. Model will only read the ``questions`` field. + +.. code:: text { 'questions': ['Is individual's age considered a potential risk factor of COVID19? \n People of all ages can be infected by the new coronavirus (2019-nCoV). Older people, and people with pre-existing medical conditions (such as asthma, diabetes, heart disease) appear to be more vulnerable to becoming severely ill with the virus. WHO advises people of all ages to take steps to protect themselves from the virus, for example by following good hand hygiene and good respiratory hygiene.', @@ -215,27 +282,26 @@ Query is in JSON format. It could be a of a single question in ``ques ... # other fileds. fields, other than 'questions', won't be read into the model } -Prediction Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Prediction Format +~~~~~~~~~~~~~~~~~ The output is in JSON format. -.. code-block:: text - - ['Given a higher mortality rate for older cases, in one study, li et al showed that more than 50% of early patients with covid-19 in wuhan were more than 60 years old', - 'cardiac involvement has been reported in patients with covid-19, which may be reflected by ecg changes.' - ... - ] # output field is a list of string +.. code:: text + ['Given a higher mortality rate for older cases, in one study, li et al showed that more than 50% of early patients with covid-19 in wuhan were more than 60 years old', + 'cardiac involvement has been reported in patients with covid-19, which may be reflected by ecg changes.' + ... + ] # output field is a list of string MedQuAD Task Dataset Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~~~~~~~~~~~ -:ref:`dataset-type:QUESTION_ANSWERING_MEDQUAD` +dataset-type:QUESTION\_ANSWERING\_MEDQUAD Dataset structure example: -.. code-block:: text +.. code:: text /MedQuAD.zip │ @@ -252,41 +318,48 @@ Dataset structure example: │ ... ... + **note** -.. note:: + - For following .xml sample, model would only take Question and + Answer fields into the question answering processing. + - Each xml file contains multiple \\. Each \\ contains one question + and its answer. - - For following `.xml` sample, model would only take `Question` and `Answer` fields into the question answering processing. - - Each xml file contains multiple . Each contains one question and its answer. - -Sample `.xml` file: +Sample .xml file: -.. code-block:: text +.. code:: text - - - ... - - # pair #1 - A question here ... # question #1, will be read as string by model - An answer here ... # answer of question #1, will be read as string by model - - ... # multiple subsequent blocks, Question and its Answer pair will be combined into one string by model, and strings of QAPair are then processed into panda dataframe - - + + + ... + + # pair #1 + A question here ... # question #1, will be read as string by model + An answer here ... # answer of question #1, will be read as string by model + + ... # multiple subsequent blocks, Question and its Answer pair will be combined into one string by model, and strings of QAPair are then processed into panda dataframe + + +Query Format +~~~~~~~~~~~~ -Query Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + **note** -.. note:: + - The pretrained model should be fine-tuned with a dataset first to + adapt to particular question domains when necessary. + - Otherwise, following the question, input should contain relevant + information (context paragraph or candidate answers, or both), + whether or not addresses the question. + - Optionally, while the relevant information as additional + paragraph are provided in query, the question always comes first, + followed by additional paragraph. We use “n” separators between + the question and its paragraph of the input. - - The pretrained model should be fine-tuned with a dataset first to adapt to particular question domains when necessary. - - Otherwise, following the question, input should contain relevant information (context paragraph or candidate answers, or both), whether or not addresses the question. - - Optionally, while the relevant information as additional paragraph are provided in query, the question always comes first, followed by additional paragraph. We use “\n” separators between the question and its paragraph of the input. - -Query is in JSON format. It could be a of a single question in ``questions`` field. Model will only read the ``questions`` field. +Query is in JSON format. It could be a \\ of a single question in +``questions`` field. Model will only read the ``questions`` field. -.. code-block:: text +.. code:: text { 'questions': ['Who is at risk for Adult Acute Lymphoblastic Leukemia?', @@ -295,40 +368,43 @@ Query is in JSON format. It could be a of a single question in ``ques ... # other fileds. fields, other than 'questions', won't be read into the model } -Prediction Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Prediction Format +~~~~~~~~~~~~~~~~~ The output is in JSON format. -.. code-block:: text - - {'answers':['Past treatment with chemotherapy or radiation therapy. Having certain genetic disorders.', # output 'answers' field is a list of string - 'Chemotherapy. Radiation therapy. Chemotherapy with stem cell transplant. Targeted therapy.' - ... - ]} +.. code:: text + {'answers':['Past treatment with chemotherapy or radiation therapy. Having certain genetic disorders.', # output 'answers' field is a list of string + 'Chemotherapy. Radiation therapy. Chemotherapy with stem cell transplant. Targeted therapy.' + ... + ]} -SPEECH_RECOGNITION --------------------------------------------------------------------- +SPEECH\_RECOGNITION +------------------- Speech recognition for the *English* language. Dataset Type -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~ -:ref:`dataset-type:AUDIO_FILES` +dataset-type:AUDIO\_FILES -The ``audios.csv`` should be of a `.CSV `_ -format with 3 columns of ``wav_filename``, ``wav_filesize`` and ``transcript``. +The ``audios.csv`` should be of a +`.CSV `__ format +with 3 columns of ``wav_filename``, ``wav_filesize`` and ``transcript``. For each row, - ``wav_filename`` should be a file path to a ``.wav`` audio file within the archive, relative to the root of the directory. - Each audio file's sample rate must equal to 16kHz. + ``wav_filename`` should be a file path to a ``.wav`` audio file + within the archive, relative to the root of the directory. Each + audio file's sample rate must equal to 16kHz. - ``wav_filesize`` should be an integer representing the size of the ``.wav`` audio file, in number of bytes. + ``wav_filesize`` should be an integer representing the size of the + ``.wav`` audio file, in number of bytes. - ``transcript`` should be a string of the true transcript for the audio file. Transcripts should only contain the following alphabets: + ``transcript`` should be a string of the true transcript for the + audio file. Transcripts should only contain the following alphabets: :: @@ -359,12 +435,12 @@ For each row, y z - + ' - An example of ``audios.csv`` follows: + An example of ``audios.csv`` follows: -.. code-block:: text +.. code:: text wav_filename,wav_filesize,transcript 6930-81414-0000.wav,412684,audio transcript one @@ -374,72 +450,71 @@ For each row, ... 1995-1837-0001.wav,279404,audio transcript three thousand - Query Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -A `Base64-encoded `_ string of the bytes of the audio as a 16kHz `.wav` file +~~~~~~~~~~~~ +A `Base64-encoded `__ string of +the bytes of the audio as a 16kHz .wav file Prediction Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~~~~ A string, representing the predicted transcript for the audio. - - -TABULAR_CLASSIFICATION --------------------------------------------------------------------- +TABULAR\_CLASSIFICATION +----------------------- Dataset Type -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~ -:ref:`dataset-type:TABULAR` +dataset-type:TABULAR The following optional train arguments are supported: - ===================== ===================== - **Train Argument** **Description** - --------------------- --------------------- - ``features`` List of feature columns' names as a list of strings (defaults to first ``N-1`` columns in the CSV file) - ``target`` Target column name as a string (defaults to the *last* column in the CSV file) - ===================== ===================== + +----------------------+-----------------------------------------------------------------------------------------------------------+ + | **Train Argument** | **Description** | + +======================+===========================================================================================================+ + | ``features`` | List of feature columns' names as a list of strings (defaults to first ``N-1`` columns in the CSV file) | + +----------------------+-----------------------------------------------------------------------------------------------------------+ + | ``target`` | Target column name as a string (defaults to the *last* column in the CSV file) | + +----------------------+-----------------------------------------------------------------------------------------------------------+ -The train & validation datasets should have the same columns. + The train & validation datasets should have the same columns. -Query Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -An size-``N-1`` dictionary representing feature-value pairs. +Query Format +~~~~~~~~~~~~ -Prediction Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +An size-\ ``N-1`` dictionary representing feature-value pairs. -A size-``k`` list of floats, representing the probabilities of each class from ``0`` to ``k-1`` for the target column. +Prediction Format +~~~~~~~~~~~~~~~~~ +A size-\ ``k`` list of floats, representing the probabilities of each +class from ``0`` to ``k-1`` for the target column. -TABULAR_REGRESSION --------------------------------------------------------------------- +TABULAR\_REGRESSION +------------------- Dataset Type -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +~~~~~~~~~~~~ -:ref:`dataset-type:TABULAR` +dataset-type:TABULAR The following optional train arguments are supported: - ===================== ===================== - **Train Argument** **Description** - --------------------- --------------------- - ``features`` List of feature columns' names as a list of strings (defaults to first ``N-1`` columns in the CSV file) - ``target`` Target column name as a string (defaults to the *last* column in the CSV file) - ===================== ===================== - -The train & validation datasets should have the same columns. + +----------------------+-----------------------------------------------------------------------------------------------------------+ + | **Train Argument** | **Description** | + +======================+===========================================================================================================+ + | ``features`` | List of feature columns' names as a list of strings (defaults to first ``N-1`` columns in the CSV file) | + +----------------------+-----------------------------------------------------------------------------------------------------------+ + | ``target`` | Target column name as a string (defaults to the *last* column in the CSV file) | + +----------------------+-----------------------------------------------------------------------------------------------------------+ + + The train & validation datasets should have the same columns. An example of the dataset follows: -.. code-block:: text +.. code:: text density,bodyfat,age,weight,height,neck,chest,abdomen,hip,thigh,knee,ankle,biceps,forearm,wrist 1.0708,12.3,23,154.25,67.75,36.2,93.1,85.2,94.5,59,37.3,21.9,32,27.4,17.1 @@ -447,12 +522,12 @@ An example of the dataset follows: 1.0414,25.3,22,154,66.25,34,95.8,87.9,99.2,59.6,38.9,24,28.8,25.2,16.6 ... -Query Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Query Format +~~~~~~~~~~~~ -An size-``N-1`` dictionary representing feature-value pairs. +An size-\ ``N-1`` dictionary representing feature-value pairs. -Prediction Format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Prediction Format +~~~~~~~~~~~~~~~~~ A float, representing the value of the target column. diff --git a/web/package.json b/web/package.json index 2563f064..c5f26c50 100644 --- a/web/package.json +++ b/web/package.json @@ -10,7 +10,7 @@ "@testing-library/jest-dom": "^4.2.4", "@testing-library/react": "^9.3.2", "@testing-library/user-event": "^8.0.2", - "axios": "^0.19.2", + "axios": "^0.21.1", "connected-react-router": "^6.8.0", "echarts": "^4.8.0", "echarts-for-react": "^2.0.16", diff --git a/web/yarn.lock b/web/yarn.lock index 36c86c00..e7bba470 100644 --- a/web/yarn.lock +++ b/web/yarn.lock @@ -2420,12 +2420,12 @@ aws4@^1.8.0: resolved "https://registry.yarnpkg.com/aws4/-/aws4-1.10.0.tgz#a17b3a8ea811060e74d47d306122400ad4497ae2" integrity sha512-3YDiu347mtVtjpyV3u5kVqQLP242c06zwDOgpeRnybmXlYYsLbtTrUBUm8i8srONt+FWobl5aibnU1030PeeuA== -axios@^0.19.2: - version "0.19.2" - resolved "https://registry.yarnpkg.com/axios/-/axios-0.19.2.tgz#3ea36c5d8818d0d5f8a8a97a6d36b86cdc00cb27" - integrity sha512-fjgm5MvRHLhx+osE2xoekY70AhARk3a6hkN+3Io1jc00jtquGvxYlKlsFUhmUET0V5te6CcZI7lcv2Ym61mjHA== +axios@^0.21.1: + version "0.21.1" + resolved "https://registry.yarnpkg.com/axios/-/axios-0.21.1.tgz#22563481962f4d6bde9a76d516ef0e5d3c09b2b8" + integrity sha512-dKQiRHxGD9PPRIUNIWvZhPTPpl1rf/OxTYKsqKUDjBwYylTvV7SjSHJb9ratfyzM6wCdLCOYLzs73qpg5c4iGA== dependencies: - follow-redirects "1.5.10" + follow-redirects "^1.10.0" axobject-query@^2.0.2: version "2.2.0" @@ -3821,13 +3821,6 @@ debug@2.6.9, debug@^2.2.0, debug@^2.3.3, debug@^2.6.0, debug@^2.6.9: dependencies: ms "2.0.0" -debug@=3.1.0: - version "3.1.0" - resolved "https://registry.yarnpkg.com/debug/-/debug-3.1.0.tgz#5bb5a0672628b64149566ba16819e61518c67261" - integrity sha512-OX8XqP7/1a9cqkxYw2yXss15f26NKWBpDXQd0/uK/KPqdQhxbPa994hnzjcE2VqQpDslf55723cKPUOGSmMY3g== - dependencies: - ms "2.0.0" - debug@^3.1.1, debug@^3.2.5: version "3.2.6" resolved "https://registry.yarnpkg.com/debug/-/debug-3.2.6.tgz#e83d17de16d8a7efb7717edbe5fb10135eee629b" @@ -4947,17 +4940,10 @@ flush-write-stream@^1.0.0: inherits "^2.0.3" readable-stream "^2.3.6" -follow-redirects@1.5.10: - version "1.5.10" - resolved "https://registry.yarnpkg.com/follow-redirects/-/follow-redirects-1.5.10.tgz#7b7a9f9aea2fdff36786a94ff643ed07f4ff5e2a" - integrity sha512-0V5l4Cizzvqt5D44aTXbFZz+FtyXV1vrDN6qrelxtfYQKW0KO0W2T/hkE8xvGa/540LkZlkaUjO4ailYTFtHVQ== - dependencies: - debug "=3.1.0" - -follow-redirects@^1.0.0: - version "1.12.1" - resolved "https://registry.yarnpkg.com/follow-redirects/-/follow-redirects-1.12.1.tgz#de54a6205311b93d60398ebc01cf7015682312b6" - integrity sha512-tmRv0AVuR7ZyouUHLeNSiO6pqulF7dYa3s19c6t+wz9LD69/uSzdMxJ2S91nTI9U3rt/IldxpzMOFejp6f0hjg== +follow-redirects@^1.0.0, follow-redirects@^1.10.0: + version "1.13.1" + resolved "https://registry.yarnpkg.com/follow-redirects/-/follow-redirects-1.13.1.tgz#5f69b813376cee4fd0474a3aba835df04ab763b7" + integrity sha512-SSG5xmZh1mkPGyKzjZP8zLjltIfpW32Y5QpdNJyjcfGxK3qo3NDDkZOZSFiGn1A6SclQxY9GzEwAHQ3dmYRWpg== for-in@^0.1.3: version "0.1.8"