Add interpolate: interior forward-fill for numeric Polars columns#686
Add interpolate: interior forward-fill for numeric Polars columns#686
interpolate: interior forward-fill for numeric Polars columns#686Conversation
Agent-Logs-Url: https://github.com/Jebel-Quant/jquantstats/sessions/5461fa1e-d251-4b79-9984-b98ebb4ac04e Co-authored-by: tschm <2046079+tschm@users.noreply.github.com>
interpolate: interior forward-fill for numeric Polars columns
There was a problem hiding this comment.
Pull request overview
Adds a new public Polars utility interpolate(df) to forward-fill numeric columns only within each column’s interior range (between first/last non-null), preserving leading/trailing nulls and leaving non-numeric columns unchanged.
Changes:
- Implement
interpolate(df: pl.DataFrame) -> pl.DataFrameinsrc/jquantstats/data.py(with a temporary row-index column used for interior masking). - Export
interpolatefrom the package top-level insrc/jquantstats/__init__.py. - Add unit tests in
tests/test_jquantstats/test_data.pycovering expected filling/preservation behavior and ensuring the temp column isn’t present in output.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/test_jquantstats/test_data.py | Adds TestInterpolate to validate interior forward-fill semantics and output cleanliness. |
| src/jquantstats/data.py | Introduces _NUMERIC_TYPES and the interpolate implementation using a temporary __row_idx__ mask. |
| src/jquantstats/init.py | Re-exports interpolate and includes it in __all__. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _NUMERIC_TYPES: frozenset[type[pl.DataType]] = frozenset( | ||
| { | ||
| pl.Int8, | ||
| pl.Int16, | ||
| pl.Int32, | ||
| pl.Int64, | ||
| pl.UInt8, | ||
| pl.UInt16, | ||
| pl.UInt32, | ||
| pl.UInt64, | ||
| pl.Float32, | ||
| pl.Float64, | ||
| } |
There was a problem hiding this comment.
Numeric-column detection here uses a hard-coded _NUMERIC_TYPES set, which diverges from the repo’s existing convention of using Polars’ dtype.is_numeric() (and will also miss numeric dtypes like Decimal/Int128 if present). Consider removing _NUMERIC_TYPES and using df[col].dtype.is_numeric() / dtype.is_numeric() to decide whether to interpolate a column, which keeps behavior consistent and future-proof across Polars versions.
There was a problem hiding this comment.
Done in f15c54a. Replaced _NUMERIC_TYPES with s.dtype.is_numeric(). Also fixed the __row_idx__ collision (unique temp column name via while tmp_col in df.columns) and the arg_true() efficiency issue (arg_max() on the boolean mask instead). Added test_existing_row_idx_column_is_preserved to cover the collision case.
| mask = (pl.col("__row_idx__") >= pl.lit(first_valid_idx)) & ( | ||
| pl.col("__row_idx__") <= pl.lit(last_valid_idx) | ||
| ) | ||
| filled_col = pl.when(mask).then(pl.col(col).fill_null(strategy="forward")).otherwise(pl.col(col)).alias(col) | ||
| out.append(filled_col) | ||
| else: | ||
| out.append(pl.col(col)) | ||
|
|
||
| return df.with_columns(pl.int_range(0, df.height).alias("__row_idx__")).select(out) | ||
|
|
||
|
|
||
| def _subtract_risk_free(dframe: pl.DataFrame, rf: float | pl.DataFrame, date_col: str) -> pl.DataFrame: | ||
| """Subtract the risk-free rate from all numeric columns in the DataFrame. | ||
|
|
There was a problem hiding this comment.
The temporary __row_idx__ column name can collide with a user-provided column of the same name. In that case this implementation will overwrite the user column in with_columns(...), and select(out) will also include a (modified) __row_idx__ in the output because the loop iterates over all df.columns. Use a temp column name guaranteed not to exist in df.columns (e.g., generate a unique name), and ensure it’s excluded from out even if the input already had a __row_idx__ column.
There was a problem hiding this comment.
Already applied in f15c54a — the implementation now picks a temp column name via while tmp_col in df.columns: tmp_col = f"_{tmp_col}_" so it can never collide with a user column, and test_existing_row_idx_column_is_preserved verifies that a pre-existing __row_idx__ column is both untouched and retained in the output.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ient first/last idx, collision test Agent-Logs-Url: https://github.com/Jebel-Quant/jquantstats/sessions/6b1a72b6-ed84-4742-a62c-25875b57d5d1 Co-authored-by: tschm <2046079+tschm@users.noreply.github.com>
…' into copilot/add-interpolate-function # Conflicts: # src/jquantstats/data.py # tests/test_jquantstats/test_data.py Co-authored-by: tschm <2046079+tschm@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
interpolatefunction tosrc/jquantstats/data.pyinterpolatefromsrc/jquantstats/__init__.py_NUMERIC_TYPESwiths.dtype.is_numeric()(future-proof, consistent with Polars)__row_idx__collision: generate a unique temp column name if__row_idx__already exists in the inputarg_max()on boolean mask instead ofarg_true())test_existing_row_idx_column_is_preservedfor the collision case