Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow each Column in a DataTable have its own RowIndex #1188

Closed
st-pasha opened this issue Jul 21, 2018 · 1 comment · Fixed by #1614
Closed

Allow each Column in a DataTable have its own RowIndex #1188

st-pasha opened this issue Jul 21, 2018 · 1 comment · Fixed by #1614
Assignees
Labels
High priority high-priority tasks improve Improvement of an existing functionality join refactor Internal code changes, clean-ups or reorganizations that are not externally visible
Milestone

Comments

@st-pasha
Copy link
Contributor

This is needed to implement efficient joins: the columns from the Left frame and the columns from the Right frame will have different RowIndices applied to them.
This feature may also be handy when selecting regular columns alongside with the computed ones.

This change will remove the rowindex property in DataTable object, possibly replacing it with nrowindices field counting the number of different rowindices for all columns in the frame.

@st-pasha st-pasha added improve Improvement of an existing functionality join labels Jul 21, 2018
@st-pasha st-pasha self-assigned this Jul 21, 2018
@st-pasha st-pasha added the High priority high-priority tasks label Jul 31, 2018
st-pasha added a commit that referenced this issue Aug 2, 2018
Issue #1188 appears to be much harder than it seemed at the first sight, so I'll attempt to implement it in several batches.

* Internal method `columns_from_array()` removed -- same functionality can be achieved via `columns_from_columns()`;
* `ColumnSet` objects can now contain columns with different `RowIndex`es (however `DataTable`s still can't);
* Fixed a bug in DataTable's constructor where the last column was not checked properly during construction (the bug never manifested itself though);
* Python-facing `_Column` object acquired new property `rowindex`, and 2 new methods: `replace_rowindex()` and `topython()` (the last one mostly useful for debug purposes);
* `DataTable.isview` semantics has changed: now it returns True if *any* of the columns has a RowIndex (this change is backward-compatible);
* Python-facing `_RowIndex` object acquired property `ptr`, which can be used in python to compare whether to`RowIndex`es are equal;
* Methods `make_*()` in `EvaluationEngine` were removed, since they served as a useless indirection level;
* `SliceCSNode` no longer uses `columns_from_slice()` method (preparing to remove the latter);
* Added methods to `EvaluationEngine` to distinguish between "source"/"final" RowIndexes -- these are still work in progress, and currently coexist with the old-style `.rowindex` property;
* Attempt to disentangle final/source RowIndex calculation in RowsNode and GroupbyNode -- still WIP;
* Several places where column's "meta" was mentioned were cleaned up -- there is no such concept anymore.
abal5 pushed a commit that referenced this issue Aug 27, 2018
Issue #1188 appears to be much harder than it seemed at the first sight, so I'll attempt to implement it in several batches.

* Internal method `columns_from_array()` removed -- same functionality can be achieved via `columns_from_columns()`;
* `ColumnSet` objects can now contain columns with different `RowIndex`es (however `DataTable`s still can't);
* Fixed a bug in DataTable's constructor where the last column was not checked properly during construction (the bug never manifested itself though);
* Python-facing `_Column` object acquired new property `rowindex`, and 2 new methods: `replace_rowindex()` and `topython()` (the last one mostly useful for debug purposes);
* `DataTable.isview` semantics has changed: now it returns True if *any* of the columns has a RowIndex (this change is backward-compatible);
* Python-facing `_RowIndex` object acquired property `ptr`, which can be used in python to compare whether to`RowIndex`es are equal;
* Methods `make_*()` in `EvaluationEngine` were removed, since they served as a useless indirection level;
* `SliceCSNode` no longer uses `columns_from_slice()` method (preparing to remove the latter);
* Added methods to `EvaluationEngine` to distinguish between "source"/"final" RowIndexes -- these are still work in progress, and currently coexist with the old-style `.rowindex` property;
* Attempt to disentangle final/source RowIndex calculation in RowsNode and GroupbyNode -- still WIP;
* Several places where column's "meta" was mentioned were cleaned up -- there is no such concept anymore.
abal5 pushed a commit that referenced this issue Sep 13, 2018
Issue #1188 appears to be much harder than it seemed at the first sight, so I'll attempt to implement it in several batches.

* Internal method `columns_from_array()` removed -- same functionality can be achieved via `columns_from_columns()`;
* `ColumnSet` objects can now contain columns with different `RowIndex`es (however `DataTable`s still can't);
* Fixed a bug in DataTable's constructor where the last column was not checked properly during construction (the bug never manifested itself though);
* Python-facing `_Column` object acquired new property `rowindex`, and 2 new methods: `replace_rowindex()` and `topython()` (the last one mostly useful for debug purposes);
* `DataTable.isview` semantics has changed: now it returns True if *any* of the columns has a RowIndex (this change is backward-compatible);
* Python-facing `_RowIndex` object acquired property `ptr`, which can be used in python to compare whether to`RowIndex`es are equal;
* Methods `make_*()` in `EvaluationEngine` were removed, since they served as a useless indirection level;
* `SliceCSNode` no longer uses `columns_from_slice()` method (preparing to remove the latter);
* Added methods to `EvaluationEngine` to distinguish between "source"/"final" RowIndexes -- these are still work in progress, and currently coexist with the old-style `.rowindex` property;
* Attempt to disentangle final/source RowIndex calculation in RowsNode and GroupbyNode -- still WIP;
* Several places where column's "meta" was mentioned were cleaned up -- there is no such concept anymore.
@st-pasha st-pasha mentioned this issue Dec 1, 2018
st-pasha added a commit that referenced this issue Dec 1, 2018
- `DataTable.resize_rows()` now works correctly for a DataTable where its columns have different row-indices;
- This method is now more efficient: for multi-column Frames a view is created, without having to modify the data in each column;
- Python `Frame.nrows` setter now always pads the Frame with NAs, even when it had only 1 row (previously 1-row frames were tiled);
- Method `dt.repeat(frame, n)` added to create a frame by repeating an existing frame multiple times.

This is a WIP for #1188
@st-pasha st-pasha added the refactor Internal code changes, clean-ups or reorganizations that are not externally visible label Jan 12, 2019
@st-pasha
Copy link
Contributor Author

Prerequisite: #1477

st-pasha added a commit that referenced this issue Jan 29, 2019
@st-pasha st-pasha added this to the Release 0.8.0 milestone Jan 29, 2019
st-pasha added a commit that referenced this issue Jan 29, 2019
This PR is the final step in implementing #1188: to allow each Column in a Frame to have its own distinct RowIndex. This allows us to: 
- cbind frames that have different rowindices without materializing individual columns;
- perform joins without materializing the resulting frame;
- Select filtered columns alongside the computed ones in `DT[i, j, ...]` expression.

Additional changes:
- Removed deprecated C API functions `datatable_get_column_data`, `datatable_unpack_slicerowindex`, `datatable_unpack_arrayrowindex` which are no longer used;
- `DataTable::rowindex` property was removed;
- Removed obsolete function `_datatable.get_internal_function_ptrs()`;
- Added function `datatable.internal.get_rowindex(frame, i)`;
- Removed `Frame.internal.rowindex` and `Frame.internal.rowindex_type`;
- Fixed sort function when applied to a 1-row view Frame.

Closes #1188
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High priority high-priority tasks improve Improvement of an existing functionality join refactor Internal code changes, clean-ups or reorganizations that are not externally visible
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant