Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔖 0.6.4 #95

Merged
merged 10 commits into from
Mar 23, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ export/
site/

# poetry
poetry.lock
# poetry.lock

# backup files
*.bak
Expand Down
54 changes: 48 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# datar

Port of [dplyr][2] and other related R packages in python, using [pipda][3].
A Grammar of Data Manipulation in python

<!-- badges -->
[![Pypi][6]][7] [![Github][8]][9] ![Building][10] [![Docs and API][11]][5] [![Codacy][12]][13] [![Codacy coverage][14]][13]
Expand All @@ -9,18 +9,20 @@ Port of [dplyr][2] and other related R packages in python, using [pipda][3].

<img width="30%" style="margin: 10px 10px 10px 30px" align="right" src="logo.png">

Unlike other similar packages in python that just mimic the piping syntax, `datar` follows the API designs from the original packages as much as possible, and is tested thoroughly with the cases from the original packages. So that minimal effort is needed for those who are familar with those R packages to transition to python.
`datar` is a re-imagining of APIs of data manipulation libraries in python (currently only `pandas` supported) so that you can manipulate your data with it like with `dplyr` in `R`.

`datar` is an in-depth port of `tidyverse` packages, such as `dplyr`, `tidyr`, `forcats` and `tibble`, as well as some functions from `R` itself.

## Installtion

```shell
pip install -U datar
# to make sure dependencies to be up-to-date
# pip install -U varname pipda datar
```

`datar` requires python 3.7.1+ and is backended by `pandas (1.3+)`.
or
```shell
conda install -c conda-forge datar
# mamba install -c conda-forge datar
```

## Example usage

Expand Down Expand Up @@ -103,6 +105,46 @@ iris >> pull(f.Sepal_Length) >> dist_plot()

![example](./example2.png)

See also some advanced examples from my answers on StackOverflow:

- [Compare 2 DataFrames and drop rows that do not contain corresponding ID variables](https://stackoverflow.com/a/71532167/5088165)
- [count by id with dynamic criteria](https://stackoverflow.com/a/71519157/5088165)
- [counting the frequency in python size vs count](https://stackoverflow.com/a/71516503/5088165)
- [Pandas equivalent of R/dplyr group_by summarise concatenation](https://stackoverflow.com/a/71490832/5088165)
- [ntiles over columns in python using R's "mutate(across(cols = ..."](https://stackoverflow.com/a/71490501/5088165)
- [Replicate R Solution in Python for Calculating Monthly CRR](https://stackoverflow.com/a/71490194/5088165)
- [Best/Concise Way to Conditionally Concat two Columns in Pandas DataFrame](https://stackoverflow.com/a/71443587/5088165)
- [how to transform R dataframe to rows of indicator values](https://stackoverflow.com/a/71443515/5088165)
- [Left join on multiple columns](https://stackoverflow.com/a/71443441/5088165)
- [Python: change column of strings with None to 0/1](https://stackoverflow.com/a/71429016/5088165)
- [Comparing 2 data frames and finding values are not in 2nd data frame](https://stackoverflow.com/a/71415818/5088165)
- [How to compare two Pandas DataFrames based on specific columns in Python?](https://stackoverflow.com/a/71413499/5088165)
- [expand.grid equivalent to get pandas data frame for prediction in Python](https://stackoverflow.com/a/71376414/5088165)
- [Python pandas equivalent to R's group_by, mutate, and ifelse](https://stackoverflow.com/a/70387267/5088165)
- [How to convert a list of dictionaries to a Pandas Dataframe with one of the values as column name?](https://stackoverflow.com/a/69094005/5088165)
- [Moving window on a Standard Deviation & Mean calculation](https://stackoverflow.com/a/69093067/5088165)
- [Python: creating new "interpolated" rows based on a specific field in Pandas](https://stackoverflow.com/a/69092696/5088165)
- [How would I extend a Pandas DataFrame such as this?](https://stackoverflow.com/a/69092067/5088165)
- [How to define new variable based on multiple conditions in Pandas - dplyr case_when equivalent](https://stackoverflow.com/a/69080870/5088165)
- [What is the Pandas equivalent of top_n() in dplyr?](https://stackoverflow.com/a/69080806/5088165)
- [Equivalent of fct_lump in pandas](https://stackoverflow.com/a/69080727/5088165)
- [pandas equivalent of fct_reorder](https://stackoverflow.com/a/69080638/5088165)
- [Is there a way to find out the 2 X 2 contingency table consisting of the count of values by applying a condition from two dataframe](https://stackoverflow.com/a/68674345/5088165)
- [Count if array in pandas](https://stackoverflow.com/a/68659334/5088165)
- [How to create a new column for transposed data](https://stackoverflow.com/a/68642891/5088165)
- [How to create new DataFrame based on conditions from another DataFrame](https://stackoverflow.com/a/68640494/5088165)
- [Refer to column of a data frame that is being defined](https://stackoverflow.com/a/68308077/5088165)
- [How to use regex in mutate dplython to add new column](https://stackoverflow.com/a/68308033/5088165)
- [Multiplying a row by the previous row (with a certain name) in Pandas](https://stackoverflow.com/a/68137136/5088165)
- [Create dataframe from rows under a row with a certain condition](https://stackoverflow.com/a/68137089/5088165)
- [pandas data frame, group by multiple cols and put other columns' contents in one](https://stackoverflow.com/a/68136982/5088165)
- [Pandas custom aggregate function with condition on group, is it possible?](https://stackoverflow.com/a/68136704/5088165)
- [multiply different values to pandas column with combination of other columns](https://stackoverflow.com/a/68136300/5088165)
- [Vectorized column-wise regex matching in pandas](https://stackoverflow.com/a/68124082/5088165)
- [Iterate through and conditionally append string values in a Pandas dataframe](https://stackoverflow.com/a/68123912/5088165)
- [Groupby mutate equivalent in pandas/python using tidydata principles](https://stackoverflow.com/a/68123753/5088165)
- [More ...](https://stackoverflow.com/search?q=user%3A5088165+and+%5Bpandas%5D)


[1]: https://tidyr.tidyverse.org/index.html
[2]: https://dplyr.tidyverse.org/index.html
Expand Down
2 changes: 1 addition & 1 deletion datar/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
)

__all__ = ("f", "get_versions")
__version__ = "0.6.3"
__version__ = "0.6.4"


def get_versions(prnt: bool = True) -> _VersionsTuple:
Expand Down
8 changes: 4 additions & 4 deletions datar/base/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

pi = math.pi

letters = np.array(list(ascii_letters[:26]), dtype=object)
LETTERS = np.array(list(ascii_letters[26:]), dtype=object)
letters = np.array(list(ascii_letters[:26]), dtype='<U1')
LETTERS = np.array(list(ascii_letters[26:]), dtype='<U1')

month_abb = np.array(
[
Expand All @@ -25,7 +25,7 @@
"Nov",
"Dec",
],
dtype=object,
dtype='<U1',
)
month_name = np.array(
[
Expand All @@ -42,5 +42,5 @@
"November",
"December",
],
dtype=object,
dtype='<U1',
)
2 changes: 1 addition & 1 deletion datar/base/seq.py
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ def c(*elems):
lambda row: Collection(*row),
axis=1,
)
if isinstance(out, DataFrame):
if isinstance(out, DataFrame): # pragma: no cover
# pandas < 1.3.2
out = Series(out.values.tolist(), index=out.index, dtype=object)

Expand Down
26 changes: 0 additions & 26 deletions datar/base/string.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
from ..core.factory import func_factory, dispatching
from ..core.utils import (
arg_match,
ensure_nparray,
logger,
regcall,
)
Expand All @@ -24,31 +23,6 @@
from .logical import as_logical


def _recycle_value(value, size, name=None):
"""Recycle a value based on a dataframe
Args:
value: The value to be recycled
size: The size to recycle to
Returns:
The recycled value
"""
name = name or "value"
value = ensure_nparray(value)

if value.size > 0 and size % value.size != 0:
raise ValueError(
f"Cannot recycle {name} (size={value.size}) to size {size}."
)

if value.size == size == 0:
return np.array([], dtype=object)

if value.size == 0:
value = np.array([np.nan], dtype=object)

return value.repeat(size // value.size)


@register_func(None, context=Context.EVAL)
def as_character(
x,
Expand Down
6 changes: 5 additions & 1 deletion datar/core/broadcast.py
Original file line number Diff line number Diff line change
Expand Up @@ -525,7 +525,10 @@ def _(
if isinstance(value, DataFrame) and value.index.size == 0:
value.index = index

if not value.index.equals(index):
# if not value.index.equals(index):
if not value.index.equals(index) and frozenset(
value.index
) != frozenset(index):
raise ValueError("Value has incompatible index.")

if isinstance(value, Series):
Expand Down Expand Up @@ -716,6 +719,7 @@ def _(value: SeriesGroupBy, name: str) -> Tibble:
@init_tibble_from.register(DataFrameGroupBy)
def _(value: Union[DataFrame, DataFrameGroupBy], name: str) -> Tibble:
from ..tibble import as_tibble

result = regcall(as_tibble, value)

if name:
Expand Down
20 changes: 19 additions & 1 deletion datar/dplyr/across.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,24 @@ def across(
The original API:
https://dplyr.tidyverse.org/reference/across.html

Examples:
#
>>> iris >> mutate(across(c(f.Sepal_Length, f.Sepal_Width), round))
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<float64> <float64> <float64> <float64> <object>
0 5.0 4.0 1.4 0.2 setosa
1 5.0 3.0 1.4 0.2 setosa
.. ... ... ... ... ...

>>> iris >> group_by(f.Species) >> summarise(
>>> across(starts_with("Sepal"), mean)
>>> )
Species Sepal_Length Sepal_Width
<object> <float64> <float64>
0 setosa 5.006 3.428
1 versicolor 5.936 2.770
2 virginica 6.588 2.974

Args:
_data: The dataframe.
*args: If given, the first 2 elements should be columns and functions
Expand Down Expand Up @@ -218,7 +236,7 @@ def c_across(
_cols: The columns

Returns:
A series
A rowwise tibble
"""
_data = _context.meta.get("input_data", _data)

Expand Down
4 changes: 2 additions & 2 deletions datar/dplyr/lead_lag.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,12 @@ def _shift(x, n, default=None, order_by=None):
newx = Series(x)

if order_by is not None:
newx = newx.reset_index(drop=True)
# newx = newx.reset_index(drop=True)
out = with_order(order_by, Series.shift, newx, n, fill_value=default)
else:
out = newx.shift(n, fill_value=default)

return out
return out if isinstance(x, Series) else out.values


@register_func(None, context=Context.EVAL)
Expand Down
24 changes: 24 additions & 0 deletions docs/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,27 @@
## 0.6.4

### Breaking changes

- 🩹 Make `base.ntile()` labels 1-based (#92)

### Fixes

- 🐛 Fix `order_by` argument for `dplyr.lead-lag`

### Enhancements

- 🚑 Allow `base.paste/paste0()` to work with grouped data
- 🩹 Change dtypes of `base.letters/LETTERS/month_abb/month_name`

### Housekeeping

- 📝 Update and fix reference maps
- 📝 Add `environment.yml` for binder to work
- 📝 Update styles for docs
- 📝 Update styles for API doc in notebooks
- 📝 Update README for new description about the project and add examples from StackOverflow


## 0.6.3

- ✨ Allow `base.c()` to handle groupby data
Expand Down
33 changes: 9 additions & 24 deletions docs/notebooks/across.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2021-07-16T22:27:57.831736Z",
Expand All @@ -12,26 +12,10 @@
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[2022-03-06 00:17:06][datar][WARNING] Builtin name \"min\" has been overriden by datar.\n",
"[2022-03-06 00:17:06][datar][WARNING] Builtin name \"max\" has been overriden by datar.\n",
"[2022-03-06 00:17:06][datar][WARNING] Builtin name \"sum\" has been overriden by datar.\n",
"[2022-03-06 00:17:06][datar][WARNING] Builtin name \"abs\" has been overriden by datar.\n",
"[2022-03-06 00:17:06][datar][WARNING] Builtin name \"round\" has been overriden by datar.\n",
"[2022-03-06 00:17:06][datar][WARNING] Builtin name \"all\" has been overriden by datar.\n",
"[2022-03-06 00:17:06][datar][WARNING] Builtin name \"any\" has been overriden by datar.\n",
"[2022-03-06 00:17:06][datar][WARNING] Builtin name \"re\" has been overriden by datar.\n",
"[2022-03-06 00:17:06][datar][WARNING] Builtin name \"filter\" has been overriden by datar.\n",
"[2022-03-06 00:17:06][datar][WARNING] Builtin name \"slice\" has been overriden by datar.\n"
]
},
{
"data": {
"text/html": [
"<div style=\"text-align: right; text-style: italic\">Try this notebook on <a target=\"_blank\" href=\"https://mybinder.org/v2/gh/pwwang/datar/93d069f3ca36711fc811c61dcf60e9fc3d1460a5?filepath=docs%2Fnotebooks%2Facross.ipynb\">binder</a>.</div>"
"<div style=\"text-align: right; text-style: italic\">Try this notebook on <a target=\"_blank\" href=\"https://mybinder.org/v2/gh/pwwang/datar/dev?filepath=docs%2Fnotebooks%2Facross.ipynb\">binder</a>.</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
Expand All @@ -43,7 +27,7 @@
{
"data": {
"text/markdown": [
"### # across "
"### <div style=\"background-color: #EEE; padding: 5px 0 8px 0\">★ across</div>"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
Expand Down Expand Up @@ -94,7 +78,7 @@
{
"data": {
"text/markdown": [
"### # if_any "
"### <div style=\"background-color: #EEE; padding: 5px 0 8px 0\">★ if_any</div>"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
Expand Down Expand Up @@ -122,7 +106,7 @@
{
"data": {
"text/markdown": [
"### # if_all "
"### <div style=\"background-color: #EEE; padding: 5px 0 8px 0\">★ if_all</div>"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
Expand Down Expand Up @@ -150,7 +134,7 @@
{
"data": {
"text/markdown": [
"### # c_across "
"### <div style=\"background-color: #EEE; padding: 5px 0 8px 0\">★ c_across</div>"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
Expand All @@ -169,7 +153,7 @@
"&emsp;&emsp;`_cols`: The columns \n",
"\n",
"##### Returns:\n",
"&emsp;&emsp;A series \n"
"&emsp;&emsp;A rowwise tibble \n"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
Expand All @@ -180,10 +164,11 @@
}
],
"source": [
"%run nb_helpers.py\n",
"\n",
"from datar.datasets import iris\n",
"from datar.all import *\n",
"\n",
"%run nb_helpers.py\n",
"nb_header(across, if_any, if_all, c_across)"
]
},
Expand Down
4 changes: 2 additions & 2 deletions docs/notebooks/add_column.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
{
"data": {
"text/html": [
"<div style=\"text-align: right; text-style: italic\">Try this notebook on <a target=\"_blank\" href=\"https://mybinder.org/v2/gh/pwwang/datar/93d069f3ca36711fc811c61dcf60e9fc3d1460a5?filepath=docs%2Fnotebooks%2Fadd_column.ipynb\">binder</a>.</div>"
"<div style=\"text-align: right; text-style: italic\">Try this notebook on <a target=\"_blank\" href=\"https://mybinder.org/v2/gh/pwwang/datar/dev?filepath=docs%2Fnotebooks%2Fadd_column.ipynb\">binder</a>.</div>"
],
"text/plain": [
"<IPython.core.display.HTML object>"
Expand All @@ -27,7 +27,7 @@
{
"data": {
"text/markdown": [
"### # add_column "
"### <div style=\"background-color: #EEE; padding: 5px 0 8px 0\">★ add_column</div>"
],
"text/plain": [
"<IPython.core.display.Markdown object>"
Expand Down
Loading