pwwang · pwwang · Mar 23, 2022 · Mar 17, 2022 · Mar 18, 2022 · Mar 18, 2022
diff --git a/.gitignore b/.gitignore
@@ -101,7 +101,7 @@ export/
 site/
 
 # poetry
-poetry.lock
+# poetry.lock
 
 # backup files
 *.bak

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # datar
 
-Port of [dplyr][2] and other related R packages in python, using [pipda][3].
+A Grammar of Data Manipulation in python
 
 <!-- badges -->
 [![Pypi][6]][7] [![Github][8]][9] ![Building][10] [![Docs and API][11]][5] [![Codacy][12]][13] [![Codacy coverage][14]][13]
@@ -9,18 +9,20 @@ Port of [dplyr][2] and other related R packages in python, using [pipda][3].
 
 <img width="30%" style="margin: 10px 10px 10px 30px" align="right" src="logo.png">
 
-Unlike other similar packages in python that just mimic the piping syntax, `datar` follows the API designs from the original packages as much as possible, and is tested thoroughly with the cases from the original packages. So that minimal effort is needed for those who are familar with those R packages to transition to python.
+`datar` is a re-imagining of APIs of data manipulation libraries in python (currently only `pandas` supported) so that you can manipulate your data with it like with `dplyr` in `R`.
 
+`datar` is an in-depth port of `tidyverse` packages, such as `dplyr`, `tidyr`, `forcats` and `tibble`, as well as some functions from `R` itself.
 
 ## Installtion
 
 ```shell
 pip install -U datar
-# to make sure dependencies to be up-to-date
-# pip install -U varname pipda datar
 ```
-
-`datar` requires python 3.7.1+ and is backended by `pandas (1.3+)`.
+or
+```shell
+conda install -c conda-forge datar
+# mamba install -c conda-forge datar
+```
 
 ## Example usage
 
@@ -103,6 +105,46 @@ iris >> pull(f.Sepal_Length) >> dist_plot()
 
 ![example](./example2.png)
 
+See also some advanced examples from my answers on StackOverflow:
+
+- [Compare 2 DataFrames and drop rows that do not contain corresponding ID variables](https://stackoverflow.com/a/71532167/5088165)
+- [count by id with dynamic criteria](https://stackoverflow.com/a/71519157/5088165)
+- [counting the frequency in python size vs count](https://stackoverflow.com/a/71516503/5088165)
+- [Pandas equivalent of R/dplyr group_by summarise concatenation](https://stackoverflow.com/a/71490832/5088165)
+- [ntiles over columns in python using R's "mutate(across(cols = ..."](https://stackoverflow.com/a/71490501/5088165)
+- [Replicate R Solution in Python for Calculating Monthly CRR](https://stackoverflow.com/a/71490194/5088165)
+- [Best/Concise Way to Conditionally Concat two Columns in Pandas DataFrame](https://stackoverflow.com/a/71443587/5088165)
+- [how to transform R dataframe to rows of indicator values](https://stackoverflow.com/a/71443515/5088165)
+- [Left join on multiple columns](https://stackoverflow.com/a/71443441/5088165)
+- [Python: change column of strings with None to 0/1](https://stackoverflow.com/a/71429016/5088165)
+- [Comparing 2 data frames and finding values are not in 2nd data frame](https://stackoverflow.com/a/71415818/5088165)
+- [How to compare two Pandas DataFrames based on specific columns in Python?](https://stackoverflow.com/a/71413499/5088165)
+- [expand.grid equivalent to get pandas data frame for prediction in Python](https://stackoverflow.com/a/71376414/5088165)
+- [Python pandas equivalent to R's group_by, mutate, and ifelse](https://stackoverflow.com/a/70387267/5088165)
+- [How to convert a list of dictionaries to a Pandas Dataframe with one of the values as column name?](https://stackoverflow.com/a/69094005/5088165)
+- [Moving window on a Standard Deviation & Mean calculation](https://stackoverflow.com/a/69093067/5088165)
+- [Python: creating new "interpolated" rows based on a specific field in Pandas](https://stackoverflow.com/a/69092696/5088165)
+- [How would I extend a Pandas DataFrame such as this?](https://stackoverflow.com/a/69092067/5088165)
+- [How to define new variable based on multiple conditions in Pandas - dplyr case_when equivalent](https://stackoverflow.com/a/69080870/5088165)
+- [What is the Pandas equivalent of top_n() in dplyr?](https://stackoverflow.com/a/69080806/5088165)
+- [Equivalent of fct_lump in pandas](https://stackoverflow.com/a/69080727/5088165)
+- [pandas equivalent of fct_reorder](https://stackoverflow.com/a/69080638/5088165)
+- [Is there a way to find out the 2 X 2 contingency table consisting of the count of values by applying a condition from two dataframe](https://stackoverflow.com/a/68674345/5088165)
+- [Count if array in pandas](https://stackoverflow.com/a/68659334/5088165)
+- [How to create a new column for transposed data](https://stackoverflow.com/a/68642891/5088165)
+- [How to create new DataFrame based on conditions from another DataFrame](https://stackoverflow.com/a/68640494/5088165)
+- [Refer to column of a data frame that is being defined](https://stackoverflow.com/a/68308077/5088165)
+- [How to use regex in mutate dplython to add new column](https://stackoverflow.com/a/68308033/5088165)
+- [Multiplying a row by the previous row (with a certain name) in Pandas](https://stackoverflow.com/a/68137136/5088165)
+- [Create dataframe from rows under a row with a certain condition](https://stackoverflow.com/a/68137089/5088165)
+- [pandas data frame, group by multiple cols and put other columns' contents in one](https://stackoverflow.com/a/68136982/5088165)
+- [Pandas custom aggregate function with condition on group, is it possible?](https://stackoverflow.com/a/68136704/5088165)
+- [multiply different values to pandas column with combination of other columns](https://stackoverflow.com/a/68136300/5088165)
+- [Vectorized column-wise regex matching in pandas](https://stackoverflow.com/a/68124082/5088165)
+- [Iterate through and conditionally append string values in a Pandas dataframe](https://stackoverflow.com/a/68123912/5088165)
+- [Groupby mutate equivalent in pandas/python using tidydata principles](https://stackoverflow.com/a/68123753/5088165)
+- [More ...](https://stackoverflow.com/search?q=user%3A5088165+and+%5Bpandas%5D)
+
 
 [1]: https://tidyr.tidyverse.org/index.html
 [2]: https://dplyr.tidyverse.org/index.html

diff --git a/datar/__init__.py b/datar/__init__.py
@@ -30,7 +30,7 @@
 )
 
 __all__ = ("f", "get_versions")
-__version__ = "0.6.3"
+__version__ = "0.6.4"
 
 
 def get_versions(prnt: bool = True) -> _VersionsTuple:

diff --git a/datar/base/constants.py b/datar/base/constants.py
@@ -7,8 +7,8 @@
 
 pi = math.pi
 
-letters = np.array(list(ascii_letters[:26]), dtype=object)
-LETTERS = np.array(list(ascii_letters[26:]), dtype=object)
+letters = np.array(list(ascii_letters[:26]), dtype='<U1')
+LETTERS = np.array(list(ascii_letters[26:]), dtype='<U1')
 
 month_abb = np.array(
     [
@@ -25,7 +25,7 @@
         "Nov",
         "Dec",
     ],
-    dtype=object,
+    dtype='<U1',
 )
 month_name = np.array(
     [
@@ -42,5 +42,5 @@
         "November",
         "December",
     ],
-    dtype=object,
+    dtype='<U1',
 )
diff --git a/datar/base/seq.py b/datar/base/seq.py
@@ -285,7 +285,7 @@ def c(*elems):
         lambda row: Collection(*row),
         axis=1,
     )
-    if isinstance(out, DataFrame):
+    if isinstance(out, DataFrame):  # pragma: no cover
         # pandas < 1.3.2
         out = Series(out.values.tolist(), index=out.index, dtype=object)
 

diff --git a/datar/base/string.py b/datar/base/string.py
@@ -15,7 +15,6 @@
 from ..core.factory import func_factory, dispatching
 from ..core.utils import (
     arg_match,
-    ensure_nparray,
     logger,
     regcall,
 )
@@ -24,31 +23,6 @@
 from .logical import as_logical
 
 
-def _recycle_value(value, size, name=None):
-    """Recycle a value based on a dataframe
-    Args:
-        value: The value to be recycled
-        size: The size to recycle to
-    Returns:
-        The recycled value
-    """
-    name = name or "value"
-    value = ensure_nparray(value)
-
-    if value.size > 0 and size % value.size != 0:
-        raise ValueError(
-            f"Cannot recycle {name} (size={value.size}) to size {size}."
-        )
-
-    if value.size == size == 0:
-        return np.array([], dtype=object)
-
-    if value.size == 0:
-        value = np.array([np.nan], dtype=object)
-
-    return value.repeat(size // value.size)
-
-
 @register_func(None, context=Context.EVAL)
 def as_character(
     x,

diff --git a/datar/core/broadcast.py b/datar/core/broadcast.py
@@ -525,7 +525,10 @@ def _(
         if isinstance(value, DataFrame) and value.index.size == 0:
             value.index = index
 
-        if not value.index.equals(index):
+        # if not value.index.equals(index):
+        if not value.index.equals(index) and frozenset(
+            value.index
+        ) != frozenset(index):
             raise ValueError("Value has incompatible index.")
 
         if isinstance(value, Series):
@@ -716,6 +719,7 @@ def _(value: SeriesGroupBy, name: str) -> Tibble:
 @init_tibble_from.register(DataFrameGroupBy)
 def _(value: Union[DataFrame, DataFrameGroupBy], name: str) -> Tibble:
     from ..tibble import as_tibble
+
     result = regcall(as_tibble, value)
 
     if name:

diff --git a/datar/dplyr/across.py b/datar/dplyr/across.py
@@ -165,6 +165,24 @@ def across(
     The original API:
     https://dplyr.tidyverse.org/reference/across.html
 
+    Examples:
+        #
+        >>> iris >> mutate(across(c(f.Sepal_Length, f.Sepal_Width), round))
+            Sepal_Length  Sepal_Width  Petal_Length  Petal_Width    Species
+               <float64>    <float64>     <float64>    <float64>   <object>
+        0            5.0          4.0           1.4          0.2     setosa
+        1            5.0          3.0           1.4          0.2     setosa
+        ..           ...          ...           ...          ...        ...
+
+        >>> iris >> group_by(f.Species) >> summarise(
+        >>>     across(starts_with("Sepal"), mean)
+        >>> )
+              Species  Sepal_Length  Sepal_Width
+             <object>     <float64>    <float64>
+        0      setosa         5.006        3.428
+        1  versicolor         5.936        2.770
+        2   virginica         6.588        2.974
+
     Args:
         _data: The dataframe.
         *args: If given, the first 2 elements should be columns and functions
@@ -218,7 +236,7 @@ def c_across(
         _cols: The columns
 
     Returns:
-        A series
+        A rowwise tibble
     """
     _data = _context.meta.get("input_data", _data)
 

diff --git a/datar/dplyr/lead_lag.py b/datar/dplyr/lead_lag.py
@@ -28,12 +28,12 @@ def _shift(x, n, default=None, order_by=None):
         newx = Series(x)
 
     if order_by is not None:
-        newx = newx.reset_index(drop=True)
+        # newx = newx.reset_index(drop=True)
         out = with_order(order_by, Series.shift, newx, n, fill_value=default)
     else:
         out = newx.shift(n, fill_value=default)
 
-    return out
+    return out if isinstance(x, Series) else out.values
 
 
 @register_func(None, context=Context.EVAL)

diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md
@@ -1,3 +1,27 @@
+## 0.6.4
+
+### Breaking changes
+
+- 🩹 Make `base.ntile()` labels 1-based (#92)
+
+### Fixes
+
+- 🐛 Fix `order_by` argument for `dplyr.lead-lag`
+
+### Enhancements
+
+- 🚑 Allow `base.paste/paste0()` to work with grouped data
+- 🩹 Change dtypes of `base.letters/LETTERS/month_abb/month_name`
+
+### Housekeeping
+
+- 📝 Update and fix reference maps
+- 📝 Add `environment.yml` for binder to work
+- 📝 Update styles for docs
+- 📝 Update styles for API doc in notebooks
+- 📝 Update README for new description about the project and add examples from StackOverflow
+
+
 ## 0.6.3
 
 - ✨ Allow `base.c()` to handle groupby data

diff --git a/docs/notebooks/across.ipynb b/docs/notebooks/across.ipynb
@@ -2,7 +2,7 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 5,
    "metadata": {
     "execution": {
      "iopub.execute_input": "2021-07-16T22:27:57.831736Z",
@@ -12,26 +12,10 @@
     }
    },
    "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "[2022-03-06 00:17:06][datar][WARNING] Builtin name \"min\" has been overriden by datar.\n",
-      "[2022-03-06 00:17:06][datar][WARNING] Builtin name \"max\" has been overriden by datar.\n",
-      "[2022-03-06 00:17:06][datar][WARNING] Builtin name \"sum\" has been overriden by datar.\n",
-      "[2022-03-06 00:17:06][datar][WARNING] Builtin name \"abs\" has been overriden by datar.\n",
-      "[2022-03-06 00:17:06][datar][WARNING] Builtin name \"round\" has been overriden by datar.\n",
-      "[2022-03-06 00:17:06][datar][WARNING] Builtin name \"all\" has been overriden by datar.\n",
-      "[2022-03-06 00:17:06][datar][WARNING] Builtin name \"any\" has been overriden by datar.\n",
-      "[2022-03-06 00:17:06][datar][WARNING] Builtin name \"re\" has been overriden by datar.\n",
-      "[2022-03-06 00:17:06][datar][WARNING] Builtin name \"filter\" has been overriden by datar.\n",
-      "[2022-03-06 00:17:06][datar][WARNING] Builtin name \"slice\" has been overriden by datar.\n"
-     ]
-    },
     {
      "data": {
       "text/html": [
-       "<div style=\"text-align: right; text-style: italic\">Try this notebook on <a target=\"_blank\" href=\"https://mybinder.org/v2/gh/pwwang/datar/93d069f3ca36711fc811c61dcf60e9fc3d1460a5?filepath=docs%2Fnotebooks%2Facross.ipynb\">binder</a>.</div>"
+       "<div style=\"text-align: right; text-style: italic\">Try this notebook on <a target=\"_blank\" href=\"https://mybinder.org/v2/gh/pwwang/datar/dev?filepath=docs%2Fnotebooks%2Facross.ipynb\">binder</a>.</div>"
       ],
       "text/plain": [
        "<IPython.core.display.HTML object>"
@@ -43,7 +27,7 @@
     {
      "data": {
       "text/markdown": [
-       "### # across  "
+       "### <div style=\"background-color: #EEE; padding: 5px 0 8px 0\">★ across</div>"
       ],
       "text/plain": [
        "<IPython.core.display.Markdown object>"
@@ -94,7 +78,7 @@
     {
      "data": {
       "text/markdown": [
-       "### # if_any  "
+       "### <div style=\"background-color: #EEE; padding: 5px 0 8px 0\">★ if_any</div>"
       ],
       "text/plain": [
        "<IPython.core.display.Markdown object>"
@@ -122,7 +106,7 @@
     {
      "data": {
       "text/markdown": [
-       "### # if_all  "
+       "### <div style=\"background-color: #EEE; padding: 5px 0 8px 0\">★ if_all</div>"
       ],
       "text/plain": [
        "<IPython.core.display.Markdown object>"
@@ -150,7 +134,7 @@
     {
      "data": {
       "text/markdown": [
-       "### # c_across  "
+       "### <div style=\"background-color: #EEE; padding: 5px 0 8px 0\">★ c_across</div>"
       ],
       "text/plain": [
        "<IPython.core.display.Markdown object>"
@@ -169,7 +153,7 @@
        "&emsp;&emsp;`_cols`: The columns  \n",
        "\n",
        "##### Returns:\n",
-       "&emsp;&emsp;A series  \n"
+       "&emsp;&emsp;A rowwise tibble  \n"
       ],
       "text/plain": [
        "<IPython.core.display.Markdown object>"
@@ -180,10 +164,11 @@
     }
    ],
    "source": [
+    "%run nb_helpers.py\n",
+    "\n",
     "from datar.datasets import iris\n",
     "from datar.all import *\n",
     "\n",
-    "%run nb_helpers.py\n",
     "nb_header(across, if_any, if_all, c_across)"
    ]
   },

diff --git a/docs/notebooks/add_column.ipynb b/docs/notebooks/add_column.ipynb
@@ -15,7 +15,7 @@
     {
      "data": {
       "text/html": [
-       "<div style=\"text-align: right; text-style: italic\">Try this notebook on <a target=\"_blank\" href=\"https://mybinder.org/v2/gh/pwwang/datar/93d069f3ca36711fc811c61dcf60e9fc3d1460a5?filepath=docs%2Fnotebooks%2Fadd_column.ipynb\">binder</a>.</div>"
+       "<div style=\"text-align: right; text-style: italic\">Try this notebook on <a target=\"_blank\" href=\"https://mybinder.org/v2/gh/pwwang/datar/dev?filepath=docs%2Fnotebooks%2Fadd_column.ipynb\">binder</a>.</div>"
       ],
       "text/plain": [
        "<IPython.core.display.HTML object>"
@@ -27,7 +27,7 @@
     {
      "data": {
       "text/markdown": [
-       "### # add_column  "
+       "### <div style=\"background-color: #EEE; padding: 5px 0 8px 0\">★ add_column</div>"
       ],
       "text/plain": [
        "<IPython.core.display.Markdown object>"