Skip to content

dataset_format

Manlio Morini edited this page May 24, 2017 · 9 revisions

CSV DATASET

We follow the CSV standard with some additional conventions (and minor limitations):

  • no header row is allowed;

  • only one example is allowed per line. A single example cannot contain newlines and cannot span multiple lines;

  • columns are separated by commas. Commas inside a quoted string aren't column delimiters;

  • the first column represents the value (numeric or string) for that example. If the first column:

    • is numeric, this model is a REGRESSION model;
    • is a string, it's a CATEGORIZATION (i.e. classification) model.

    Each column must describe the same kind of information for that example;

  • the column order of features in the table does not weight the results. The first feature is not weighted any more than the last;

  • TEXT STRINGS

    • place double quotes around all text strings;
    • text matching is case-sensitive: "wine" is different from "Wine";
    • if a string contains a double quote, the double quote must be escaped with another double quote, for example: "sentence with a ""double"" quote inside";
  • NUMERIC VALUES

    • both integer and decimal values are supported;
    • numbers in quotes without whitespace will be treated as numbers, even if they are in quotation marks. Multiple numeric values within quotation marks in the same field will be treated as a string. For example:
      • Numbers: "2", "12", "236"
      • Strings: "2 12", "a 23"
  • Test set can have an empty output value.

As a best practice, remove punctuation (other than apostrophes) from your data. This is because commas, periods and other punctuation rarely add meaning to the training data, but are treated as meaningful elements by the learning engine. For example, "end." is not matched to "end".

You can’t perform that action at this time.