Adding a new data source #55

retorquere · 2016-09-19T16:24:03Z

Is it possible to add a new data source (such as a database)? Where should I begin to look?

codeanticode · 2016-10-14T12:21:08Z

@retorquere sorry for the delay in my reply. It would be certainly possible to add a new datasource to Mirador. You should look into miralib, which is the package that contains all the low-level data handling and statistical calculations.

In particular, look at how the DataSet class in miralib uses MiraTable, which in turns extends Table from Processing core. Table supports csv, tsv, ods, and bin formats, but should not too difficult to extend this code to use other sources. I'm happy to talk more about it.

retorquere · 2016-10-14T12:38:36Z

Does that mean I should expect to make changes to all three classes, or are DataSet and MiraTable just for reference, and would I only be changing Table? I've looked at table and in particular its csv parsing, as the DB would just deliver columnar data without having to sift through an XML format such as with ODS, but I haven't yet figured out what the parsers do in response to their input. Do they call callbacks (doesn't look like it), construct a data structure (couldn't find it), a string (sort of looks like it does, but I don't know how it is expected to look).

codeanticode · 2016-10-14T15:20:42Z

I would say that some refactoring in MiraTable would be needed in order to switch (internally) between Table to read csv/tsv/ods, and other classes to handle SQL databases, etc., depending on the input source. MiraTable should encapsulate all this functionality so there is no need to know about it at the level of the DataSet API, which is what Mirador relies on.

What I would do is to implement some concept of parser to handle the appropriate source, either at the level of MiraTable, or at the level of DataSet, so the public API of DataSet does not change.

retorquere · 2016-10-20T12:38:16Z

The source in this case would be a databases (InfluxDB in my case), so there wouldn't be parsing involved as such; the source knows what columns are present and what type they are, and would just hand you data row-by-row.

codeanticode · 2016-10-24T15:50:54Z

ok, I did some refactoring in miralib to make it easier to add support for other data sources: https://github.com/mirador/miralib/issues/16

You would need to write your InfluxDB wrapper as an implementation of the new DataSource interface. Use MiraTable as a reference, and let me know if you have any questions.

retorquere · 2016-10-24T19:55:01Z

Super, just a few questions:

I see the interface has getRowCount, this means I'll have to load the (potentially quite big) dataset in memory. Is that OK? What kind of volumes is miralib equipped to handle?
I assume the enum that holds the column type identifiers lives in processing.data.TableRow, but I can't find its source
MiraTable inherits from Table, but I can't find its source
getRow returns TableRow, but I can't find its source to see how I should set it up.

codeanticode · 2016-10-25T00:57:52Z

You don't need to load the entire dataset in memory, as long as you can return any row i when it is requested with the getRow(int i) method then things should be ok. In order to generate pairwise plots and calculate correlations, miralib creates copies or "slices" of the data that are disposed as soon as the plot is out of view, so memory consumption should be reasonable even for "large" datasets. For large, I mean in the order of a few million rows. I have tried Mirador with datasets that big, and it is usable.
processing.data.TableRow is an interface defined here. This is the implementation I'm using in MiraTable.
Table is defined here.

Note that miralib is built on top of the data classes from Processing.

retorquere · 2016-10-25T09:00:35Z

So if I don't know the rowcount beforehand, what should I return from getRowCount?

codeanticode · 2016-10-25T10:57:14Z

Well, Mirador needs a fixed sample size (the row count) to generate all the plots (histograms and eikosograms), as well as to evaluate the mutual information and other statistics.

These plots and calculations are all dynamic though, means that if the row count is in itself variable, the next time they are generated the new count will be used. But I haven't tested such situation. It could be feasible to add an internal timer in miralib to update the dataset at regular intervals, in order to support dynamic sources.

In any case, you would need to provide a row count greater than zero at any given moment in order to generate anything with Mirador.

retorquere · 2016-10-31T20:11:46Z

My java just isn't good enough I'm afraid. I don't know how to put together things so that I can compile and test them. I'm OK with closing this issue.

codeanticode · 2016-11-01T01:30:52Z

I wrote some brief notes on how to compile Mirador with ant from the command line in the wiki. I can add some more details if that helps.

retorquere · 2016-11-01T12:42:11Z

So I start the whole build from the mirador clone, not build the individual projects first?

codeanticode · 2016-11-01T12:47:03Z

I think so, since the main Mirador ant script uses the .class files from the dependencies to build the final package. Give it a try, in the meantime I will do some tests on my own (since I have everything already setup in Eclipse sometimes is easy to overlook problem with fresh installs) and update the wiki accordingly.

retorquere · 2016-11-01T12:49:50Z

I'm doing everything from the command line when I can. I'll set up Eclipse if it's necessary, but if command-line ant works I'll take that.

codeanticode · 2018-07-20T11:04:27Z

Closing as there are no updates on this issue.

codeanticode added the enhancement label Oct 14, 2016

codeanticode closed this as completed Jul 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a new data source #55

Adding a new data source #55

retorquere commented Sep 19, 2016

codeanticode commented Oct 14, 2016 •

edited

retorquere commented Oct 14, 2016

codeanticode commented Oct 14, 2016

retorquere commented Oct 20, 2016

codeanticode commented Oct 24, 2016

retorquere commented Oct 24, 2016

codeanticode commented Oct 25, 2016

retorquere commented Oct 25, 2016

codeanticode commented Oct 25, 2016

retorquere commented Oct 31, 2016

codeanticode commented Nov 1, 2016

retorquere commented Nov 1, 2016

codeanticode commented Nov 1, 2016

retorquere commented Nov 1, 2016

codeanticode commented Jul 20, 2018

Adding a new data source #55

Adding a new data source #55

Comments

retorquere commented Sep 19, 2016

codeanticode commented Oct 14, 2016 • edited

retorquere commented Oct 14, 2016

codeanticode commented Oct 14, 2016

retorquere commented Oct 20, 2016

codeanticode commented Oct 24, 2016

retorquere commented Oct 24, 2016

codeanticode commented Oct 25, 2016

retorquere commented Oct 25, 2016

codeanticode commented Oct 25, 2016

retorquere commented Oct 31, 2016

codeanticode commented Nov 1, 2016

retorquere commented Nov 1, 2016

codeanticode commented Nov 1, 2016

retorquere commented Nov 1, 2016

codeanticode commented Jul 20, 2018

codeanticode commented Oct 14, 2016 •

edited