-
Notifications
You must be signed in to change notification settings - Fork 84
Description
Importing data from a file
We want a read()
function for the command-line, which reads data from the given file and transforms it into an easily processed data structure for use in webppl functions. We expect this to be used primarily for csv files, but it would be nice to support other input types as well.
The call would be read(filename, [opts])
, where opts
is an object containing any relevant parameters.
Here are some file types we might want to consider:
-
CSV: Needs to properly get the types of numeric values, strings, and booleans.
opts
could specify the delimiter (default to comma), whether there's a header (default to true), and so on. We could implement the internals using nodeCSV or babyparse.We thought the best output would be a list of objects, indexed by header names, with each element in the list corresponding to a row of the input csv. This would support post-processing steps like filtering, plucking a single column, and omitting irrelevant values.
[{headerLabel1 : value1,
headerLabel2 : value2,
headerLabel3 : value3,
...},
{...}]
- JSON: Could be implemented using a combination of fs and JSON.parse:
var data = JSON.parse(fs.readFileSync('file', 'utf8'));
- text: For NLP applications, sometimes you just want to pull in a bunch of text.
opt
could take the encoding to use. - images: For vision applications, it might be cool to be able to read in a .png or .jpg and parse it into a matrix of pixels. Could be implemented using node-opencv.
We can determine the file type by using the file extension. If it is not .csv, .tsv, .json, or in the set of image encoding we support, it could default to text.
Writing ERPs to file
We want a write()
function that will write out one or more ERPs to a file for post-processing and analysis in other languages like R or python. We want to write to a .csv in long-form. There are two major issues to consider here:
- The results of many model-fitting exercises are ERPs with lists of lists or lists of objects in the support. To put this in long-form, we need to write one line for each of these internal lists or objects, which share the same probability. We'd also like to be able to support multiple sets of object keys: one element in the support may be
{type : 'a', alphaVal : .5}
and another may be{type: 'b', betaVal : 1.5}
. We'd like to write a csv with 'type', 'alphaVal' and 'betaVal' as column headers and NAs filled for rows where they aren't specified. - For model comparison applications, we often want to write multiple ERPs to the same file with one or more labels identifying which ERP is which. To put this in long-form, we need to prepend these labels to each row. We also need to be able to specify whether we're going to append to the given file or write a new one.
It seems like the right call for this is to first create a writer object:
var myWriter = csvWriter(filename, mode)
where mode
can either be w
(write) or a
(append). This is kind of copying the python way of doing things. Then pass this writer object into the write
function with an object setting various options:
write(myWriter, opts)
The two options we had in mind are (1) parameterHeaders
, which specifies which headers will be found in the list of lists or list of objects, and (2) additionalLabels
, the list of labels to prepend to each line of the csv.