New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregator in datatable #1077
Comments
@st-pasha @oleksiyskononenko any thoughts on how to go about this. We will have to decide on an interface of how the datatable aggregator interacts with tha java visual server. |
@nikhilshekhar The idea is to implement the Aggregator in C++ in datatable and make it accessible from Python. Sure, we can discuss the interaction with the Java Visual Server. |
Here is a preliminary list of functions that are going to be implemented in Exposed function (Python):
Internal functions (C++):
|
Looks good, except the output should be the original We will also add convenience functions to extract the [exemplar row + nobs] from the grouped frame, as well as the list of indices of rows within each group. |
@oleksiyskononenko I would like to point out a couple of things:
Just to put things in perspective, to visualize one dataset, there can be more than a few hundred/thousand calls to the aggregator depending on the number of columns in the dataset. And the latency from the first aggregator call to the response being picked up by the visual server for all the subsequent aggregation calls is expected to be in seconds. Rest all looks good, but we need to finalize on the interactions between the visual server (which is a JAVA library) and the aggregator. @oleksiyskononenko @lelandwilkinson @mmalohlava The initial chain of call that me and @mmalohlava discussed is as below: @lo5 @mmalohlava this is the new proposed architecture of the visual server/aggregator interaction. Kindly, have a look before we start development on this. |
We already have a channel of communication between the java VizServer and the python DAI. Why can't we piggy-back on that? In terms of retrieving data on Java's end, the data has to get there from C/Python somehow, and I don't think there's faster way than through the As for doing multiple aggregations at once -- it is certainly doable, but will require a separate method. Such method could get a list of frames to aggregate, and produce again a list of aggregated/reduced frames. Under the hood, all those aggregations would be carried out in parallel via OMP. |
@st-pasha The Visual Server as of today communicates to the python DAI only via JSON RPC request/response. There is no actual transfer of data that happens there. And if we want to rely on Visual Server to read data from the Jay file, it would need a Java Parser for reading the Jay file and that needs to be written/maintained. |
@nikhilshekhar Yes, we can add a convenience function to return you the aggregated rows only. As @st-pasha said (I've updated my original comment according to his suggestion), we can keep the data internally as the grouped frame, so that we can also return the members or any other relevant information, if required. As for the way of communication between the visual server and |
My Comments:
The problem:
I would prefer (B), but (A) is easier in short term. WDYT @oleksiyskononenko @st-pasha ? |
@mmalohlava
|
The call sequence is Browser -> DAI -> Procsy -> VisServer. DAI (python) already knows what it wants, hence can -
I don't see a strong reason to have Java read anything from disk. |
From talking to @nikhilshekhar, it turns out that there is a need to aggregate all columns in a dataset for autoviz, so (2) above would be via a file possibly. @nikhilshekhar will check with @lelandwilkinson if it's possible to eagerly aggregate all columns and hand vis server this data, as opposed to vis-server initiating calls to pydatatable (convoluted). |
Thanks for all these comments and deep thinking. A couple of points: vis-data-server is not allowed to see any data object except an aggregated Table. Please take a look at my latest code, where an aggregation happens when you see this line: aggregatedTable = dataSource.getAggregatedTable(continuousVariableIndices); For all purposes, vis-data-server has no idea (and doesn't care) whether the data were aggregated or not. Take a look at where this happens in DatatableDataSource. I don't have options like number of bins, radius, minimum number of rows to aggregate, maximum number of dimensions, seed because I don't need them and they add unnecessary complications to the algorithm. These are things we should consider for V2. Adding them at this point will only complicate the development process. The whole design of Aggregator, if you look at the Java code, is to require only one pass through the data, unlike the one Arno and I did for H2O. I agree with Nikhil that we ought to consider parallelization inside Datatable if it will help with performance. It's not a difficult algorithm to parallelize. A fundamental component of the contract is that vis-data-server ought to be able to access a data table as fast as other parts of DAI do. This implies that the aggregator in datatable has to be super fast and, equally important, vis-data-server cannot read or even know about NFF files. But aggregated Tables are tiny, so if you decide it is better to provide a storage mechanism for them, they shouldn't take a lot of resources. That would mean that datatable has some sort of buffering mechanism to handle repeated queries for aggregated Tables. Again, however, it is not the responsibility of vis-data-server to decide how it accesses an aggregated Table. There is only one way it can see data (see the call above). Take a look at the Table class. All the expensive functions are inside there: getMemberIndices(), getCategories(), getRow(), getDataLimits(), getNumValues() We don't want any of these inside datatable. This is because these functions are not expensive when applied to small datasets that are typical for aggregations. Recall that the entire design of vis-data-server is based on computing weighted statistics, where the weights are in the last column ("counts"). Keep in mind also that vis-data-server is row oriented. I know column-oriented is all the rage today, but for statistical calculations (Mahalanobis distance, etc.), row-oriented is more efficient when applied to files containing a small number of rows. Keep in mind that rows contain elements which are Java Objects. These Objects can be dates, numbers, strings, etc. One needn't know their types in order to process them. A small point: The statement "Whenever a VisServer needs to aggregate a data frame" should be changed to "Whenever vis-data-server needs to see data." Again, vis-data-server knows nothing about aggregation. It thinks all datasets have a "counts" column or otherwise, all counts are 1. The Aggregator (inside datatable) will decide when a file has too few rows to merit aggregation. The aggregation algorithm is not a cluster analysis. So we should avoid terminology like "observation clusters" and instead use the terminology in the paper, namely, "exemplars" (aggregated rows) and "members" (lists of row indices for each exemplar). The statement, "If multiple column names are passed in options, ideally the aggregation should be done over each of the columns in parallel and the result should be returned back to the visual server. For example, if I pass in the file on which aggregation needs to be done along with 100 column names in the file, the aggregator should be able to compute the 100 aggregations in parallel and return the respective aggregated frame." is not quite true. You can't parallelize across columns because the ND aggregation is row-wise and the Euclidean distance calculation has to be computed across columns. You CAN parallelize across rows and that would gain us some traction with deep files. The statement, "We already have a channel of communication between the java VizServer and the python DAI. Why can't we piggy-back on that? In terms of retrieving data on Java's end, the data has to get there from C/Python somehow, and I don't think there's faster way than through the NFF Jay file" is not quite true. Take a look at NFFDataSource, which Nikhil designed for improving performance in V2. We thought it would be faster, but it wasn't. I don't think you want to store NFF files and parse them every time vis-data-server wants to see a Table. Whatever DAI does to get data should be the mechanism for vis-data-server getting the same data, except vis-data-server gets an aggregated Table rather than whole column(s). If DAI deals with memory data objects, then vis-data-server has to do the same. I liked the comment, "I don't see a strong reason to have Java read anything from disk." We've been talking Python throughout, and that's fine. But keep in mind that the primary visualization client is JavaScript, not Python. If we're obliged to make everything accessible to Python (not sure why), then let's be sure we pay no performance penalty for doing that. Again, thanks for the thoughts. It would appear that our main task is figuring out how to talk to Java. Even if an object is in memory, we have to get Java DataSource to see it as a Java Table. I suggested JNI, but this can be inefficient. Maybe one of you knows some magic that is even faster for letting C++ in datatable hand over a Table. |
@lo5 circled around with @lelandwilkinson and the code - the eager aggregations is not possible for all visualizations. |
Trying to comprehend, what would a good chain of calls after the discussion on the thread and in sync with @lelandwilkinson @lo5 @mmalohlava . Kindly have a look at the below and point out changes/fixes or suggest alternative for the same. Browser makes request for visualization --> python DAI --> procsy --> Visual Server Calls the corresponding methods. Each of these method would need some aggregations to be computed --> Call goes back via procsy to --> python DAI --> calls datatable aggregator which writes down csv files ( 1 for each aggregation request) --> the response with the aggregated file path is returned via procsy --> visual server now reads the aggregated csv files and computes values needed for visualization --> returns back computed values via procsy --> python DAI --> UI renders it and plots it The above chain of call will be called multiple times before all the visualization can be rendered and shown on the UI. |
Been thinking this through a bit more, especially after talking with @nikhilshekhar. My CSV parser inside vis-data-server is almost as fast as datatable and produces a Table. So it might be cleaner for datatable to aggregate and output a CSV file that can be read (from disk or memory) by vis-data-server. Because the aggregated file is so small, this should be fairly quick. The credit_card.CSV file in my tests is 2.9 MB and is read in less than a second. This file is considerably larger than a typical aggregated file. This approach would be the cleanest. Whether it is the fastest requires further testing. |
If datatable outputs an aggregated CSV file, it would also be more useful to those Python users who are not using it with DAI (because it's open source). |
@lo5 The @lelandwilkinson The The |
My comments:
|
@nikhilshekhar why you need a call from VisDataServer to "Call goes back via procsy to --> python DAI --> calls datatable "? @lelandwilkinson CSV is an option, but to avoid re-parsing, still NFF/Jay seems more suitable @lelandwilkinson I meant parallelization over multiple invocation of aggregator (as @arnocandel describes above) - the motivation is that Datatable is accessing "big data" so can do better job on parallelizing computation then we can do in VisDataServer. |
I think the problem still remains of figuring out which exactly aggregations to perform, and who is going to control them. If a user loads an
Overall, that's |
@mmalohlava yes Visual Server can definitely call datatable via an exec Python wrapper. It will save a few method calls. But, to preserve the server-client architecture, I thought it would be best if every call is routed via the server. But, if calling datatable directly from the visual server is a better way, surely we can do that. |
It probably always does a full aggregation and then based on calculations on that, asks for up to ~100 more (hopefully at once). |
Includes a general Python/C++ layout and implementations of - 1D continuous aggregation - 2D continuous aggregation - `count()` reduce function - wrappers to enable the usage of the `first()` reducer 1D categorical aggregation can now be done directly from Python through `groupby/count`. It will be implemented in C++ along with the remaining 2D and ND aggregators.
A few things:
|
One more thing: |
Is there something datatable can't do just yet, but you think it'd be nice if it did?
Aggregate
Is it related to some problem you're trying to solve?
Solve slow reading of NFF format files.
What do you think the API for your feature should be?
See API in the Java code. Methods required are in base class DataSource
See Java code in https://github.com/h2oai/vis-data-server/blob/master/library/src/main/java/com/h2o/data/Aggregator.java
Plus other classes in that package for support. All of this should be done in C++
The text was updated successfully, but these errors were encountered: