Skip to content

improve I/O tools for working with data #146

@hawkrobe

Description

@hawkrobe

Importing data from a file

We want a read() function for the command-line, which reads data from the given file and transforms it into an easily processed data structure for use in webppl functions. We expect this to be used primarily for csv files, but it would be nice to support other input types as well.

The call would be read(filename, [opts]), where opts is an object containing any relevant parameters.

Here are some file types we might want to consider:

  • CSV: Needs to properly get the types of numeric values, strings, and booleans. opts could specify the delimiter (default to comma), whether there's a header (default to true), and so on. We could implement the internals using nodeCSV or babyparse.

    We thought the best output would be a list of objects, indexed by header names, with each element in the list corresponding to a row of the input csv. This would support post-processing steps like filtering, plucking a single column, and omitting irrelevant values.

    [{headerLabel1 : value1,
      headerLabel2 : value2,
      headerLabel3 : value3,
      ...},
     {...}]
  • JSON: Could be implemented using a combination of fs and JSON.parse:
    var data = JSON.parse(fs.readFileSync('file', 'utf8'));
  • text: For NLP applications, sometimes you just want to pull in a bunch of text. opt could take the encoding to use.
  • images: For vision applications, it might be cool to be able to read in a .png or .jpg and parse it into a matrix of pixels. Could be implemented using node-opencv.

We can determine the file type by using the file extension. If it is not .csv, .tsv, .json, or in the set of image encoding we support, it could default to text.

Writing ERPs to file

We want a write() function that will write out one or more ERPs to a file for post-processing and analysis in other languages like R or python. We want to write to a .csv in long-form. There are two major issues to consider here:

  1. The results of many model-fitting exercises are ERPs with lists of lists or lists of objects in the support. To put this in long-form, we need to write one line for each of these internal lists or objects, which share the same probability. We'd also like to be able to support multiple sets of object keys: one element in the support may be {type : 'a', alphaVal : .5} and another may be {type: 'b', betaVal : 1.5}. We'd like to write a csv with 'type', 'alphaVal' and 'betaVal' as column headers and NAs filled for rows where they aren't specified.
  2. For model comparison applications, we often want to write multiple ERPs to the same file with one or more labels identifying which ERP is which. To put this in long-form, we need to prepend these labels to each row. We also need to be able to specify whether we're going to append to the given file or write a new one.

It seems like the right call for this is to first create a writer object:

var myWriter = csvWriter(filename, mode)

where mode can either be w (write) or a (append). This is kind of copying the python way of doing things. Then pass this writer object into the write function with an object setting various options:

write(myWriter, opts)

The two options we had in mind are (1) parameterHeaders, which specifies which headers will be found in the list of lists or list of objects, and (2) additionalLabels, the list of labels to prepend to each line of the csv.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions