diff --git a/Makefile b/Makefile index c3e70b4d8..e6a98be58 100644 --- a/Makefile +++ b/Makefile @@ -52,7 +52,7 @@ build: ## Compile polars R package and generate Rd files all: fmt build test README.md ## build -> test -> Update README.md .PHONY: docs -docs: build README.md ## Generate docs +docs: build README.md docs/docs/reference_home.md ## Generate docs cp docs/mkdocs.orig.yml docs/mkdocs.yml Rscript -e 'altdoc::update_docs(custom_reference = "docs/make-docs.R")' cd docs && ../$(VENV_BIN)/python3 -m mkdocs build @@ -64,6 +64,9 @@ docs-preview: ## Preview docs on local server. Needs `make docs` README.md: README.Rmd build ## Update README.md Rscript -e 'devtools::load_all(); rmarkdown::render("README.Rmd")' +docs/docs/reference_home.md: docs/docs/reference_home.Rmd build ## Update the reference home page source + Rscript -e 'devtools::load_all(); rmarkdown::render("docs/docs/reference_home.Rmd")' + .PHONY: test test: build ## Run fast unittests Rscript -e 'devtools::load_all(); devtools::test()' diff --git a/docs/.gitignore b/docs/.gitignore index 67c947011..b316c02ab 100644 --- a/docs/.gitignore +++ b/docs/.gitignore @@ -4,5 +4,5 @@ mkdocs.yml !docs/stylesheets/extra.css !docs/polars-logo.png -!docs/reference_home.md +!docs/reference_home.Rmd !docs/about.md diff --git a/docs/docs/reference_home.md b/docs/docs/reference_home.Rmd similarity index 60% rename from docs/docs/reference_home.md rename to docs/docs/reference_home.Rmd index efb1c320c..2d95e9d54 100644 --- a/docs/docs/reference_home.md +++ b/docs/docs/reference_home.Rmd @@ -1,17 +1,29 @@ +--- +output: + github_document: + html_preview: false +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + # Reference - + `polars` provides a large number of functions for numerous data types and this can sometimes be a bit overwhelming. Overall, you should be able to do anything -you want with `polars` by specifying the **data structure** you want to use and +you want with `polars` by specifying the **data structure** you want to use and then by applying **expressions** in a particular **context**. - ## Data structure As explained in some vignettes, one of `polars` biggest strengths is the ability -to choose between eager and lazy evaluation, that require respectively a +to choose between eager and lazy evaluation, that require respectively a `DataFrame` and a `LazyFrame` (with their counterparts `GroupBy` and `LazyGroupBy` -for grouped data). +for grouped data). We can apply functions directly on a `DataFrame` or `LazyFrame`, such as `rename()` or `drop()`. Most (but not all!) functions that can be applied to `DataFrame`s @@ -19,7 +31,7 @@ can also be used on `LazyFrame`s. Calling `$lazy()` yields a `LazyFrame`. While calling `$collect()` starts a computation and yields a `DataFrame` as result. -Another common data structure is the `Series`, which can be considered as the +Another common data structure is the `Series`, which can be considered as the equivalent of R vectors in `polars`' world. Therefore, a `DataFrame` is a list of `Series`. @@ -27,8 +39,6 @@ Operations on `DataFrame` or `LazyFrame` are useful, but many more operations can be applied on columns themselves by using various **expressions** in different **contexts**. - - ## Contexts A context simply is the type of data modification that is done. There are 3 types @@ -43,58 +53,52 @@ be used in some contexts. For example, in `with_columns()`, you can only apply expressions that return either the same number of values or a single value that will be duplicated on all rows: -```r +```{r} test = pl$DataFrame(mtcars) +``` +```{r} # this works test$with_columns( pl$col("mpg") + 1 ) +``` +```r # this doesn't work because it returns only 2 values, while mtcars has 32 rows. test$with_columns( pl$col("mpg")$slice(0, 2) ) ``` + By contrast, in an `agg` context, any number of return values are possible, as -they are returned in a list, and only the new columns or the grouping columns +they are returned in a list, and only the new columns or the grouping columns are returned. -```r +```{r} test$groupby(pl$col("cyl"))$agg( pl$col("mpg"), # varying number of values pl$col("mpg")$slice(0, 2)$suffix("_sliced"), # two values # aggregated to one value and implicitly unpacks list pl$col("mpg")$sum()$suffix("_summed") ) - -shape: (3, 4) -┌─────┬──────────────────────┬──────────────┬────────────┐ -│ cyl ┆ mpg ┆ mpg_sliced ┆ mpg_summed │ -│ --- ┆ --- ┆ --- ┆ --- │ -│ f64 ┆ list[f64] ┆ list[f64] ┆ f64 │ -╞═════╪══════════════════════╪══════════════╪════════════╡ -│ 4.0 ┆ [22.8, 24.4, … 21.4] ┆ [22.8, 24.4] ┆ 293.3 │ -│ 8.0 ┆ [18.7, 14.3, … 15.0] ┆ [18.7, 14.3] ┆ 211.4 │ -│ 6.0 ┆ [21.0, 21.0, … 19.7] ┆ [21.0, 21.0] ┆ 138.2 │ -└─────┴──────────────────────┴──────────────┴────────────┘ ``` ## Expressions `polars` is quite verbose and requires you to be very explicit on the operations -you want to perform. This can be seen in the way expressions work. All polars +you want to perform. This can be seen in the way expressions work. All polars public functions (excluding methods) are accessed via the namespace handle `pl`. -Two important expressions starters are `pl$col()` (names a column in the context) +Two important expressions starters are `pl$col()` (names a column in the context) and `pl$lit()` (wraps a literal value or vector/series in an Expr). Most other expression starters are syntactic sugar derived from thereof, e.g. `pl$sum(_)` is actually `pl$col(_)$sum()`. -Expressions can be chained with about 170 expression methods such as `$sum()` +Expressions can be chained with about 170 expression methods such as `$sum()` which aggregates e.g. the column with summing. -```r +```{r} # two examples of starting, chaining and combining expressions pl$DataFrame(a = 1:4)$with_columns( # take col mpg, slice it, sum it, then cast it @@ -104,30 +108,19 @@ pl$DataFrame(a = 1:4)$with_columns( # similar to above, but with `mul()`-method instead of `*`. pl$lit(1:3)$sum()$mul(pl$col("a"))$alias("lit_sum_add_mpg") ) -shape: (4, 4) -┌─────┬──────────────────┬─────────────────┬─────────────────┐ -│ a ┆ a_slice_sum_cast ┆ lit_sum_add_two ┆ lit_sum_add_mpg │ -│ --- ┆ --- ┆ --- ┆ --- │ -│ i32 ┆ f32 ┆ i32 ┆ i32 │ -╞═════╪══════════════════╪═════════════════╪═════════════════╡ -│ 1 ┆ 3.0 ┆ 12 ┆ 6 │ -│ 2 ┆ 3.0 ┆ 12 ┆ 12 │ -│ 3 ┆ 3.0 ┆ 12 ┆ 18 │ -│ 4 ┆ 3.0 ┆ 12 ┆ 24 │ -└─────┴──────────────────┴─────────────────┴─────────────────┘ ``` -Moreover there are subnamespaces with special methods only applicable for a +Moreover there are subnamespaces with special methods only applicable for a specific type `dt`(datetime), `arr`(list), `str`(strings), `struct`(structs), `cat`(categoricals) and `bin`(binary). As a sidenote, there is also an exotic subnamespace called `meta` which is rarely used to manipulate the expressions -themselves. Each subsection in the "Expressions" section lists all operations +themselves. Each subsection in the "Expressions" section lists all operations available for a specific subnamespace. -For a concrete example for `dt`, suppose we have a column containing dates and +For a concrete example for `dt`, suppose we have a column containing dates and that we want to extract the year from these dates: -```r +```{r} # Create the DataFrame df = pl$DataFrame( date = pl$date_range( @@ -137,47 +130,23 @@ df = pl$DataFrame( ) ) df - -shape: (4, 1) -┌─────────────────────┐ -│ date │ -│ --- │ -│ datetime[μs] │ -╞═════════════════════╡ -│ 2020-01-01 00:00:00 │ -│ 2021-01-01 00:00:00 │ -│ 2022-01-01 00:00:00 │ -│ 2023-01-01 00:00:00 │ -└─────────────────────┘ ``` The function `year()` only makes sense for date-time data, so the type of input -that can receive this function is `dt` (for **d**ate-**t**ime): +that can receive this function is `dt` (for **d**ate-**t**ime): -```r +```{r} df$with_columns( pl$col("date")$dt$year()$alias("year") ) ``` -Similarly, if we have text data that we want to convert text to uppercase, we +Similarly, if we have text data that we want to convert text to uppercase, we use the `str` prefix before using `to_uppercase()`: - -```r +```{r} # Create the DataFrame df = pl$DataFrame(foo = c("jake", "mary", "john peter")) df$select(pl$col("foo")$str$to_uppercase()) - -shape: (3, 1) -┌────────────┐ -│ foo │ -│ --- │ -│ str │ -╞════════════╡ -│ JAKE │ -│ MARY │ -│ JOHN PETER │ -└────────────┘ ```