diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
index 055ba99797a7..b2f1af343a67 100644
--- a/docs/SUMMARY.md
+++ b/docs/SUMMARY.md
@@ -10,6 +10,7 @@
* [Analytics Tools](tutorial/07-Analytics-Tools.ipynb)
* [Geospatial Analysis](tutorial/08-Geospatial-Analysis.ipynb)
* [Ibis for SQL Programmers](ibis-for-sql-programmers.ipynb)
+* [Ibis for pandas Users](ibis-for-pandas-users.ipynb)
* [User Guide](user_guide/)
* [Execution Backends](backends/)
* [How To Guide](how_to/)
diff --git a/docs/ibis-for-pandas-users.ipynb b/docs/ibis-for-pandas-users.ipynb
new file mode 100644
index 000000000000..849232feb286
--- /dev/null
+++ b/docs/ibis-for-pandas-users.ipynb
@@ -0,0 +1,4355 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "641af053-e938-4384-8ccd-6eff5b31833d",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Ibis for pandas Users\n",
+ "\n",
+ "Much of the syntax and many of the operations in Ibis are inspired\n",
+ "by the pandas DataFrame, however, the primary domain of Ibis is\n",
+ "SQL so there are some differences in how they operate. \n",
+ "\n",
+ "One primary\n",
+ "difference between Ibis tables and pandas `DataFrame`s are that many\n",
+ "of the pandas `DataFrame` operations do in-place operations, whereas\n",
+ "Ibis table operations always return a new table expression."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "bd23750c-bc6b-46db-a6d3-017f73a1d436",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import ibis\n",
+ "import pandas as pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a0f6d6ea-66ec-4554-9cd5-d3dbd60d5ca2",
+ "metadata": {},
+ "source": [
+ "**Note that we'll be using Ibis' interactive mode to automatically execute queries at\n",
+ "the end of each cell in this notebook. If you are using similar code in a program,\n",
+ "you will have to add `.execute()` to each operation that you want to evaluate.**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "83cbc2e5-d2c2-42a2-9a13-fa360cd3a834",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "ibis.options.interactive = True"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "afd136cd-18b5-48a2-8359-c47cde181c7b",
+ "metadata": {},
+ "source": [
+ "We'll be using the pandas backend in Ibis in the examples below. First we'll create a simple `DataFrame`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "ba8432fc-8423-459a-8b56-c4720db65407",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ " two | \n",
+ " three | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " one two three\n",
+ "0 a 1 2\n",
+ "1 b 3 4"
+ ]
+ },
+ "execution_count": 3,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.DataFrame([\n",
+ " ['a', 1, 2],\n",
+ " ['b', 3, 4]\n",
+ "], columns=['one', 'two', 'three'])\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2a226ff0-f1b4-49d8-b6bd-8a8bf5cae109",
+ "metadata": {},
+ "source": [
+ "Now we can create an Ibis table from the above `DataFrame`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "62e08de9-ad4b-453a-b862-429d11b0432a",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ " two | \n",
+ " three | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " one two three\n",
+ "0 a 1 2\n",
+ "1 b 3 4"
+ ]
+ },
+ "execution_count": 4,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t = ibis.pandas.connect({'t': df}).table('t')\n",
+ "t"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b928082-2c35-4683-8f6c-4a800d063922",
+ "metadata": {},
+ "source": [
+ "## Data types\n",
+ "\n",
+ "The data types of columns in pandas are accessed using the `dtypes` attribute. This returns\n",
+ "a `Series` object."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "94ddeaf0-efb9-4be8-9a60-af26dd657ff6",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "one object\n",
+ "two int64\n",
+ "three int64\n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.dtypes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ec2c32a8-fc81-40e5-933a-410407167e80",
+ "metadata": {},
+ "source": [
+ "In Ibis, you use the `schema` method which returns an `ibis.Schema` object."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "11770743-e66f-4001-988d-d5c7a5b40cd6",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "ibis.Schema {\n",
+ " one string\n",
+ " two int64\n",
+ " three int64\n",
+ "}"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t.schema()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "33a9a6a3-6b58-4227-9d6e-9ffcbb5c8ea8",
+ "metadata": {},
+ "source": [
+ "It is possible to convert the schema information to pandas data types using the `to_pandas` method, if needed."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "68da325b-d549-4564-845a-20e3d99b5df7",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[('one', dtype('O')), ('two', dtype('int64')), ('three', dtype('int64'))]"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t.schema().to_pandas()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "10a8e767-5a09-4bf8-afea-a2598b3d96f5",
+ "metadata": {},
+ "source": [
+ "## Table layout\n",
+ "\n",
+ "In pandas, the layout of the table is contained in the `shape` attribute which contains the number\n",
+ "of rows and number of columns in a tuple. The number of columns in an Ibis table can be gotten \n",
+ "from the length of the schema."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "id": "1d58bd48-1b20-4d06-917d-e10aa5ca5a62",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "3"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(t.schema())"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "be551ea9-a124-46b9-b73d-43743c4030c9",
+ "metadata": {},
+ "source": [
+ "To get the number of rows of a table, you use the `count` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "id": "b1c45c64-9c2f-4d74-8c6b-847e6e94bf59",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t.count()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b7753f1-fd33-41f0-bff5-c34f2eb3ffe2",
+ "metadata": {},
+ "source": [
+ "To mimic pandas' behavior, you would use the following code. Note that you need to use the `execute` method\n",
+ "after `count` to evaluate the expression returned by `count`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "id": "21af7477-0f2a-4c3b-8d00-fcbe4a82bf71",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(2, 3)"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "(t.count().execute(), len(t.schema()))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "id": "6970d109-a4ad-4f5a-9263-6361f71f2b2f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(2, 3)"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "953aadd7-1e7f-464b-8aab-df76c11a944c",
+ "metadata": {},
+ "source": [
+ "## Subsetting columns\n",
+ "\n",
+ "Selecting columns is very similar to in pandas. In fact, you can use the same syntax."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "id": "77d9bde4-8465-4ec4-b1a6-b69b32ea0398",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ " two | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " one two\n",
+ "0 a 1\n",
+ "1 b 3"
+ ]
+ },
+ "execution_count": 12,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t[['one', 'two']]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "38f96e98-c2a6-4734-a405-cd3e97224dc4",
+ "metadata": {},
+ "source": [
+ "However, since row-level indexing is not supported in Ibis, the inner list is not necessary."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "id": "9e9770c3-9aae-4aec-b426-84a435e74159",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ " two | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " one two\n",
+ "0 a 1\n",
+ "1 b 3"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t['one', 'two']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "45740901-33f6-4f7a-b282-073d8d1fb7c1",
+ "metadata": {},
+ "source": [
+ "## Selecting columns\n",
+ "\n",
+ "Selecting columns is done using the same syntax as in pandas `DataFrames`. You can use either \n",
+ "the indexing syntax or attribute syntax."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "id": "7ce3128c-a722-4948-be5d-a2f38873acf5",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "0 a\n",
+ "1 b\n",
+ "Name: one, dtype: object"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t['one']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "39c935c7-216f-4086-8ce0-62c5532ef0d5",
+ "metadata": {},
+ "source": [
+ "or:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "id": "7af612e6-dfee-40a5-a3df-788566fdd4eb",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "0 a\n",
+ "1 b\n",
+ "Name: one, dtype: object"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t.one"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2db743ac-0c64-4e15-aa3e-267d93ee303f",
+ "metadata": {},
+ "source": [
+ "## Adding, removing, and modifying columns\n",
+ "\n",
+ "Modifying the columns of an Ibis table is a bit different than doing the same operations in\n",
+ "a pandas `DataFrame`. This is primarily due to the fact that in-place operations are not \n",
+ "supported on Ibis tables. Each time you do a column modification to a table, a new table\n",
+ "expression is returned."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2307d018-6f25-4c0e-a7e4-de3d5428ed46",
+ "metadata": {},
+ "source": [
+ "### Adding columns\n",
+ "\n",
+ "Adding columns is done through the `mutate` method. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "id": "6f1f91bb-ea6c-4d0f-9344-1dae83d8e40e",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ " two | \n",
+ " three | \n",
+ " new_col | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ " 4 | \n",
+ " 8 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " one two three new_col\n",
+ "0 a 1 2 4\n",
+ "1 b 3 4 8"
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "mutated = t.mutate(new_col=t.three * 2)\n",
+ "mutated"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9ffd0202-3d64-4b55-98d1-fa231b0e8e58",
+ "metadata": {},
+ "source": [
+ "Notice that the original table object remains unchanged. Only the `mutated` object that was returned\n",
+ "contains the new column."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "id": "84a0722d-2f84-4997-a4b6-ee44bb3dbcac",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ " two | \n",
+ " three | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " one two three\n",
+ "0 a 1 2\n",
+ "1 b 3 4"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6cc7fb9f-6720-454c-87e3-e0aad3baa3a5",
+ "metadata": {},
+ "source": [
+ "It is also possible to create a column in isolation. This is similar to a `Series` in pandas. \n",
+ "In this situation, the name of the column must be added using the `name` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "cbb0190d-b9af-4d94-a395-254ef31854fe",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " three | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 8 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ "0 4\n",
+ "1 8\n",
+ "Name: three, dtype: int64"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "new_col = (t.three * 2).name('new_col')\n",
+ "new_col"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "46c142ba-34a2-4547-bd86-a7425bf1b0de",
+ "metadata": {},
+ "source": [
+ "You can then add this column to the table using a projection."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "28dbf1e2-f6db-48d1-a61c-e174a8be79bf",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ " two | \n",
+ " new_col | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ " 8 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " one two new_col\n",
+ "0 a 1 4\n",
+ "1 b 3 8"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "proj = t['one', 'two', new_col]\n",
+ "proj"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a047475c-6a4f-4b77-b11a-b17790383ad6",
+ "metadata": {},
+ "source": [
+ "### Removing columns\n",
+ "\n",
+ "Removing a column is done using the `drop` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "7abe3596-8eb7-480e-86f9-7303953c02ae",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['one', 'two', 'three']"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t.columns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "e9ea485e-f433-4487-8c32-dac86fa19db2",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['one', 'three']"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "subset = t.drop('two')\n",
+ "subset.columns"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8f69886e-b250-4719-945a-46099504503f",
+ "metadata": {},
+ "source": [
+ "Multiple column names can also be given."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "id": "982ce8a1-3b2d-4c12-ba1b-4b02bcaca629",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['three']"
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "subset = t.drop(['one', 'two'])\n",
+ "subset.columns"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0240415a-928a-4110-9cc9-3b3533c6de2b",
+ "metadata": {},
+ "source": [
+ "It is also possible to drop columns by selecting the columns you want to remain."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "id": "b7d2b972-aa19-4647-8fb5-511062cfc8fd",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "['two', 'three']"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "subset = t['two', 'three']\n",
+ "subset.columns"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cc6793c5-d2c7-4960-af3c-90c3829b76e8",
+ "metadata": {},
+ "source": [
+ "### Modifying columns\n",
+ "\n",
+ "Replacing existing columns is done using the `mutate` method just like adding columns. You simply\n",
+ "add a column of the same name to replace it."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "id": "5ba58de0-78a6-4c59-ab72-6a55355ef2cb",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ " two | \n",
+ " three | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " one two three\n",
+ "0 a 1 2\n",
+ "1 b 3 4"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "id": "90bdae99-f2d1-4d81-bbe6-bc3de62923a0",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " one | \n",
+ " two | \n",
+ " three | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 2 | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 6 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " one two three\n",
+ "0 a 2 2\n",
+ "1 b 6 4"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "mutated = t.mutate(two=t.two * 2)\n",
+ "mutated"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "665623b1-d038-4bff-9bd8-5368b36e5f57",
+ "metadata": {},
+ "source": [
+ "### Renaming columns\n",
+ "\n",
+ "In addition to replacing columns, you can simply rename them as well. This is done with the `relabel` method\n",
+ "which takes a dictionary containing the name mappings."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "id": "8d5b4242-fb10-4574-88c6-d341826b8f6c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " a | \n",
+ " b | \n",
+ " three | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " a b three\n",
+ "0 a 1 2\n",
+ "1 b 3 4"
+ ]
+ },
+ "execution_count": 26,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "relabeled = t.relabel(dict(\n",
+ " one='a',\n",
+ " two='b',\n",
+ "))\n",
+ "relabeled"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7db908d3-d18c-4f40-86ec-17e39f09eb1a",
+ "metadata": {},
+ "source": [
+ "## Selecting rows\n",
+ "\n",
+ "There are several methods that can be used to select rows of data in various ways. These are described\n",
+ "in the sections below. We'll use the ubiquitous iris dataset for these examples."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "id": "8ce004ee-7723-4e97-a4c5-dff42693d6a0",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.1 | \n",
+ " 3.5 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 4.9 | \n",
+ " 3.0 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 4.7 | \n",
+ " 3.2 | \n",
+ " 1.3 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 4.6 | \n",
+ " 3.1 | \n",
+ " 1.5 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 5.0 | \n",
+ " 3.6 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 5.1 3.5 1.4 0.2 setosa\n",
+ "1 4.9 3.0 1.4 0.2 setosa\n",
+ "2 4.7 3.2 1.3 0.2 setosa\n",
+ "3 4.6 3.1 1.5 0.2 setosa\n",
+ "4 5.0 3.6 1.4 0.2 setosa"
+ ]
+ },
+ "execution_count": 27,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f53898f6-28d5-4791-9de5-f8d3e464757b",
+ "metadata": {},
+ "source": [
+ "Create an Ibis table from the `DataFrame` above."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "id": "3dd069d5-c2c6-4241-a648-76dcdf20cc37",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.1 | \n",
+ " 3.5 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 4.9 | \n",
+ " 3.0 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 4.7 | \n",
+ " 3.2 | \n",
+ " 1.3 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 4.6 | \n",
+ " 3.1 | \n",
+ " 1.5 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 5.0 | \n",
+ " 3.6 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 145 | \n",
+ " 6.7 | \n",
+ " 3.0 | \n",
+ " 5.2 | \n",
+ " 2.3 | \n",
+ " virginica | \n",
+ "
\n",
+ " \n",
+ " 146 | \n",
+ " 6.3 | \n",
+ " 2.5 | \n",
+ " 5.0 | \n",
+ " 1.9 | \n",
+ " virginica | \n",
+ "
\n",
+ " \n",
+ " 147 | \n",
+ " 6.5 | \n",
+ " 3.0 | \n",
+ " 5.2 | \n",
+ " 2.0 | \n",
+ " virginica | \n",
+ "
\n",
+ " \n",
+ " 148 | \n",
+ " 6.2 | \n",
+ " 3.4 | \n",
+ " 5.4 | \n",
+ " 2.3 | \n",
+ " virginica | \n",
+ "
\n",
+ " \n",
+ " 149 | \n",
+ " 5.9 | \n",
+ " 3.0 | \n",
+ " 5.1 | \n",
+ " 1.8 | \n",
+ " virginica | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
150 rows × 5 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 5.1 3.5 1.4 0.2 setosa\n",
+ "1 4.9 3.0 1.4 0.2 setosa\n",
+ "2 4.7 3.2 1.3 0.2 setosa\n",
+ "3 4.6 3.1 1.5 0.2 setosa\n",
+ "4 5.0 3.6 1.4 0.2 setosa\n",
+ ".. ... ... ... ... ...\n",
+ "145 6.7 3.0 5.2 2.3 virginica\n",
+ "146 6.3 2.5 5.0 1.9 virginica\n",
+ "147 6.5 3.0 5.2 2.0 virginica\n",
+ "148 6.2 3.4 5.4 2.3 virginica\n",
+ "149 5.9 3.0 5.1 1.8 virginica\n",
+ "\n",
+ "[150 rows x 5 columns]"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t = ibis.pandas.connect({'t': df}).table('t')\n",
+ "t"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "889fd311-5a26-4eb0-a933-dae0082fcabc",
+ "metadata": {},
+ "source": [
+ "### Head, tail and limit\n",
+ "\n",
+ "The `head` method works the same ways as in pandas. Note that some Ibis backends may not have an \n",
+ "inherent ordering of their rows and using `head` may not return deterministic results. In those\n",
+ "cases, you can use sorting before calling `head` to ensure a stable result."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "id": "c668f96b-db88-4407-92da-06867024988b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.1 | \n",
+ " 3.5 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 4.9 | \n",
+ " 3.0 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 4.7 | \n",
+ " 3.2 | \n",
+ " 1.3 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 4.6 | \n",
+ " 3.1 | \n",
+ " 1.5 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 5.0 | \n",
+ " 3.6 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 5.1 3.5 1.4 0.2 setosa\n",
+ "1 4.9 3.0 1.4 0.2 setosa\n",
+ "2 4.7 3.2 1.3 0.2 setosa\n",
+ "3 4.6 3.1 1.5 0.2 setosa\n",
+ "4 5.0 3.6 1.4 0.2 setosa"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t.head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "987d4d1c-ae5c-4c31-b989-7712564e29f7",
+ "metadata": {},
+ "source": [
+ "However, the tail method is not implemented since it is not supported in all databases.\n",
+ "It is possible to emulate the `tail` method if you use sorting in your table to do a \n",
+ "reverse sort then use the `head` method to retrieve the \"top\" rows."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "id": "684563af-1db7-464c-9d65-ddeb5e7cc854",
+ "metadata": {},
+ "outputs": [
+ {
+ "ename": "AttributeError",
+ "evalue": "tail",
+ "output_type": "error",
+ "traceback": [
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+ "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)",
+ "\u001b[0;32m/tmp/ipykernel_281650/1485850447.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtail\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
+ "\u001b[0;32m~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/ibis/expr/types/relations.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 162\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 163\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mschema\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 164\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mAttributeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 165\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 166\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+ "\u001b[0;31mAttributeError\u001b[0m: tail"
+ ]
+ }
+ ],
+ "source": [
+ "t.tail(1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "757db52c-f8cd-47cc-ab67-fa4a68998827",
+ "metadata": {},
+ "source": [
+ "Another way to limit the number of retrieved rows is using the `limit` method. The following will return\n",
+ "the same result as `head(5)`. This is often used in conjunction with other filtering techniques that we\n",
+ "will cover later."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "id": "92455eef-6f15-48b9-b49b-2d3f8b38c6a7",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.1 | \n",
+ " 3.5 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 4.9 | \n",
+ " 3.0 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 4.7 | \n",
+ " 3.2 | \n",
+ " 1.3 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 4.6 | \n",
+ " 3.1 | \n",
+ " 1.5 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 5.0 | \n",
+ " 3.6 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 5.1 3.5 1.4 0.2 setosa\n",
+ "1 4.9 3.0 1.4 0.2 setosa\n",
+ "2 4.7 3.2 1.3 0.2 setosa\n",
+ "3 4.6 3.1 1.5 0.2 setosa\n",
+ "4 5.0 3.6 1.4 0.2 setosa"
+ ]
+ },
+ "execution_count": 31,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t.limit(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9ead0537-6b29-4ac9-94b6-9097e3e157eb",
+ "metadata": {},
+ "source": [
+ "The starting position of the returned rows can be specified using the `offset` parameter."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "id": "8b1baecc-a88f-4415-9d64-9bc94cabdc20",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.0 | \n",
+ " 3.6 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 5.4 | \n",
+ " 3.9 | \n",
+ " 1.7 | \n",
+ " 0.4 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 4.6 | \n",
+ " 3.4 | \n",
+ " 1.4 | \n",
+ " 0.3 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 5.0 | \n",
+ " 3.4 | \n",
+ " 1.5 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 4.4 | \n",
+ " 2.9 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 5.0 3.6 1.4 0.2 setosa\n",
+ "1 5.4 3.9 1.7 0.4 setosa\n",
+ "2 4.6 3.4 1.4 0.3 setosa\n",
+ "3 5.0 3.4 1.5 0.2 setosa\n",
+ "4 4.4 2.9 1.4 0.2 setosa"
+ ]
+ },
+ "execution_count": 32,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t.limit(5, offset=4)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e9f5a6b4-1288-4aea-8760-c4f0cdab9564",
+ "metadata": {},
+ "source": [
+ "### Filtering rows\n",
+ "\n",
+ "In addition to simply limiting the number of rows that are returned, it is possible to filter the \n",
+ "rows using expressions. Expressions are constructed very similarly to the way they are in pandas.\n",
+ "Ibis expressions are constructed from operations on colunms in a table which return a boolean result.\n",
+ "This result is then used to filter the table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "id": "ccf4e5da-06cd-4331-8ccf-897acc4405e2",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_width | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " False | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " False | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " False | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " False | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " False | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 145 | \n",
+ " False | \n",
+ "
\n",
+ " \n",
+ " 146 | \n",
+ " False | \n",
+ "
\n",
+ " \n",
+ " 147 | \n",
+ " False | \n",
+ "
\n",
+ " \n",
+ " 148 | \n",
+ " False | \n",
+ "
\n",
+ " \n",
+ " 149 | \n",
+ " False | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
150 rows × 1 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ "0 False\n",
+ "1 False\n",
+ "2 False\n",
+ "3 False\n",
+ "4 False\n",
+ " ... \n",
+ "145 False\n",
+ "146 False\n",
+ "147 False\n",
+ "148 False\n",
+ "149 False\n",
+ "Name: sepal_width, Length: 150, dtype: bool"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "expr = t.sepal_width > 3.8\n",
+ "expr"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ec4f85db-ecea-43f1-a299-5313c56705b1",
+ "metadata": {},
+ "source": [
+ "We can evaluate the value counts to see how many rows we will expect to get back after filtering."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "id": "180b2d01-84a0-4042-9d2e-07cbdbe5c8b5",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " unnamed | \n",
+ " count | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " False | \n",
+ " 144 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " True | \n",
+ " 6 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " unnamed count\n",
+ "0 False 144\n",
+ "1 True 6"
+ ]
+ },
+ "execution_count": 34,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "expr.value_counts()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "573e9e1a-062b-4e10-81b3-56c6d1ae17e6",
+ "metadata": {},
+ "source": [
+ "Now we apply the filter to the table. Since there are 6 True values in the expression, we should\n",
+ "get 6 rows back."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "id": "63ff3e0d-2775-432a-8ecb-947d2c8cc0a5",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.4 | \n",
+ " 3.9 | \n",
+ " 1.7 | \n",
+ " 0.4 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 5.8 | \n",
+ " 4.0 | \n",
+ " 1.2 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 5.7 | \n",
+ " 4.4 | \n",
+ " 1.5 | \n",
+ " 0.4 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 5.4 | \n",
+ " 3.9 | \n",
+ " 1.3 | \n",
+ " 0.4 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 5.2 | \n",
+ " 4.1 | \n",
+ " 1.5 | \n",
+ " 0.1 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " 5.5 | \n",
+ " 4.2 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 5.4 3.9 1.7 0.4 setosa\n",
+ "1 5.8 4.0 1.2 0.2 setosa\n",
+ "2 5.7 4.4 1.5 0.4 setosa\n",
+ "3 5.4 3.9 1.3 0.4 setosa\n",
+ "4 5.2 4.1 1.5 0.1 setosa\n",
+ "5 5.5 4.2 1.4 0.2 setosa"
+ ]
+ },
+ "execution_count": 35,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "filtered = t[expr]\n",
+ "filtered"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "335f6010-d92f-4484-83d5-eadded3d7527",
+ "metadata": {},
+ "source": [
+ "Of course, the filtering expression can be applied inline as well."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "id": "240ca8c8-b8ba-4826-831e-e2390cb6fbbb",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.4 | \n",
+ " 3.9 | \n",
+ " 1.7 | \n",
+ " 0.4 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 5.8 | \n",
+ " 4.0 | \n",
+ " 1.2 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 5.7 | \n",
+ " 4.4 | \n",
+ " 1.5 | \n",
+ " 0.4 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 5.4 | \n",
+ " 3.9 | \n",
+ " 1.3 | \n",
+ " 0.4 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 5.2 | \n",
+ " 4.1 | \n",
+ " 1.5 | \n",
+ " 0.1 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " 5.5 | \n",
+ " 4.2 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 5.4 3.9 1.7 0.4 setosa\n",
+ "1 5.8 4.0 1.2 0.2 setosa\n",
+ "2 5.7 4.4 1.5 0.4 setosa\n",
+ "3 5.4 3.9 1.3 0.4 setosa\n",
+ "4 5.2 4.1 1.5 0.1 setosa\n",
+ "5 5.5 4.2 1.4 0.2 setosa"
+ ]
+ },
+ "execution_count": 36,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "filtered = t[t.sepal_width > 3.8]\n",
+ "filtered"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b475d75-e63b-4a5f-9af1-3eb5858a8ace",
+ "metadata": {},
+ "source": [
+ "Multiple filtering expressions can be combined into a single expression or chained onto existing\n",
+ "table expressions."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "id": "f971ed4e-c0de-4ebc-8be6-60eff433399f",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.8 | \n",
+ " 4.0 | \n",
+ " 1.2 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 5.7 | \n",
+ " 4.4 | \n",
+ " 1.5 | \n",
+ " 0.4 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 5.8 4.0 1.2 0.2 setosa\n",
+ "1 5.7 4.4 1.5 0.4 setosa"
+ ]
+ },
+ "execution_count": 37,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "filtered = t[(t.sepal_width > 3.8) & (t.sepal_length > 5.5)]\n",
+ "filtered"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "63772a0e-8c89-4d9e-ae3b-a916d5b17d30",
+ "metadata": {},
+ "source": [
+ "The code above will return the same rows as the code below."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "id": "827c5756-a770-4106-9fb5-8eb98050aed9",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.8 | \n",
+ " 4.0 | \n",
+ " 1.2 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 5.7 | \n",
+ " 4.4 | \n",
+ " 1.5 | \n",
+ " 0.4 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 5.8 4.0 1.2 0.2 setosa\n",
+ "1 5.7 4.4 1.5 0.4 setosa"
+ ]
+ },
+ "execution_count": 38,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "filtered = t[t.sepal_width > 3.8][t.sepal_length > 5.5]\n",
+ "filtered"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7673373b-8cc3-4190-9207-fc5606e749af",
+ "metadata": {},
+ "source": [
+ "Aggregation has not been discussed yet, but aggregate values can be used in expressions\n",
+ "to return things such as all of the rows in a data set where the value in a column\n",
+ "is greater than the mean."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "id": "b33a1e2b-e297-46c5-a163-91ae8e62f7de",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 5.1 | \n",
+ " 3.5 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 4.7 | \n",
+ " 3.2 | \n",
+ " 1.3 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 4.6 | \n",
+ " 3.1 | \n",
+ " 1.5 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 5.0 | \n",
+ " 3.6 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 5.4 | \n",
+ " 3.9 | \n",
+ " 1.7 | \n",
+ " 0.4 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 62 | \n",
+ " 6.7 | \n",
+ " 3.1 | \n",
+ " 5.6 | \n",
+ " 2.4 | \n",
+ " virginica | \n",
+ "
\n",
+ " \n",
+ " 63 | \n",
+ " 6.9 | \n",
+ " 3.1 | \n",
+ " 5.1 | \n",
+ " 2.3 | \n",
+ " virginica | \n",
+ "
\n",
+ " \n",
+ " 64 | \n",
+ " 6.8 | \n",
+ " 3.2 | \n",
+ " 5.9 | \n",
+ " 2.3 | \n",
+ " virginica | \n",
+ "
\n",
+ " \n",
+ " 65 | \n",
+ " 6.7 | \n",
+ " 3.3 | \n",
+ " 5.7 | \n",
+ " 2.5 | \n",
+ " virginica | \n",
+ "
\n",
+ " \n",
+ " 66 | \n",
+ " 6.2 | \n",
+ " 3.4 | \n",
+ " 5.4 | \n",
+ " 2.3 | \n",
+ " virginica | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
67 rows × 5 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 5.1 3.5 1.4 0.2 setosa\n",
+ "1 4.7 3.2 1.3 0.2 setosa\n",
+ "2 4.6 3.1 1.5 0.2 setosa\n",
+ "3 5.0 3.6 1.4 0.2 setosa\n",
+ "4 5.4 3.9 1.7 0.4 setosa\n",
+ ".. ... ... ... ... ...\n",
+ "62 6.7 3.1 5.6 2.4 virginica\n",
+ "63 6.9 3.1 5.1 2.3 virginica\n",
+ "64 6.8 3.2 5.9 2.3 virginica\n",
+ "65 6.7 3.3 5.7 2.5 virginica\n",
+ "66 6.2 3.4 5.4 2.3 virginica\n",
+ "\n",
+ "[67 rows x 5 columns]"
+ ]
+ },
+ "execution_count": 39,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "filtered = t[t.sepal_width > t.sepal_width.mean()]\n",
+ "filtered"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7acc9ac2-3915-4ff4-8685-9b515ff8e882",
+ "metadata": {},
+ "source": [
+ "## Sorting rows\n",
+ "\n",
+ "Sorting rows in Ibis uses a somewhat different API than in pandas. In pandas, you would use the\n",
+ "`sort_values` method to order rows by values in specified columns. Ibis uses a method called\n",
+ "`sort_by`. To specify ascending or descending orders, pandas uses an `ascending=` argument\n",
+ "to `sort_values` that indicates the order for each sorting column. Ibis allows you to tag the\n",
+ "column name in the `sort_by` list as ascending or descending by wrapping it with `ibis.asc` or\n",
+ "`ibis.desc`.\n",
+ "\n",
+ "Here is an example of sorting a `DataFrame` using two sort keys. One key is sorting in ascending\n",
+ "order and the other is in descending order."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "id": "3fca2921-2f93-403e-aabd-ba941f374a02",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 13 | \n",
+ " 4.3 | \n",
+ " 3.0 | \n",
+ " 1.1 | \n",
+ " 0.1 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 42 | \n",
+ " 4.4 | \n",
+ " 3.2 | \n",
+ " 1.3 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 38 | \n",
+ " 4.4 | \n",
+ " 3.0 | \n",
+ " 1.3 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 8 | \n",
+ " 4.4 | \n",
+ " 2.9 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 41 | \n",
+ " 4.5 | \n",
+ " 2.3 | \n",
+ " 1.3 | \n",
+ " 0.3 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "13 4.3 3.0 1.1 0.1 setosa\n",
+ "42 4.4 3.2 1.3 0.2 setosa\n",
+ "38 4.4 3.0 1.3 0.2 setosa\n",
+ "8 4.4 2.9 1.4 0.2 setosa\n",
+ "41 4.5 2.3 1.3 0.3 setosa"
+ ]
+ },
+ "execution_count": 40,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.sort_values(['sepal_length', 'sepal_width'], ascending=[True, False]).head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "da3b0a11-513a-4445-9a14-b05394fd2d2d",
+ "metadata": {},
+ "source": [
+ "The same operation in Ibis would look like the following. Note that the index values of the\n",
+ "resulting `DataFrame` start from zero and count up, whereas in the example above, they retain\n",
+ "their original index value. This is simply due to the fact that rows in tables don't necessarily\n",
+ "have a stable index in database backends, so the index is just generated on the result."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "id": "b984cc38-d235-431f-98bf-5cd3eaa401fd",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_length | \n",
+ " sepal_width | \n",
+ " petal_length | \n",
+ " petal_width | \n",
+ " species | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 4.3 | \n",
+ " 3.0 | \n",
+ " 1.1 | \n",
+ " 0.1 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 4.4 | \n",
+ " 3.2 | \n",
+ " 1.3 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 4.4 | \n",
+ " 3.0 | \n",
+ " 1.3 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 4.4 | \n",
+ " 2.9 | \n",
+ " 1.4 | \n",
+ " 0.2 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 4.5 | \n",
+ " 2.3 | \n",
+ " 1.3 | \n",
+ " 0.3 | \n",
+ " setosa | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " sepal_length sepal_width petal_length petal_width species\n",
+ "0 4.3 3.0 1.1 0.1 setosa\n",
+ "1 4.4 3.2 1.3 0.2 setosa\n",
+ "2 4.4 3.0 1.3 0.2 setosa\n",
+ "3 4.4 2.9 1.4 0.2 setosa\n",
+ "4 4.5 2.3 1.3 0.3 setosa"
+ ]
+ },
+ "execution_count": 41,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "sorted = t.sort_by(['sepal_length', ibis.desc('sepal_width')]).head(5)\n",
+ "sorted"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ad5157cd-e42d-4349-9a6c-f293b3cab1b4",
+ "metadata": {},
+ "source": [
+ "## Aggregation\n",
+ "\n",
+ "Aggregation in pandas is typically done by computing columns based on an aggregate function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "id": "fbd2705f-1fbb-4ba4-9273-ed6e9e89ec9c",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " total_sepal_width | \n",
+ " avg.sepal_length | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 458.6 | \n",
+ " 5.843333 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " total_sepal_width avg.sepal_length\n",
+ "0 458.6 5.843333"
+ ]
+ },
+ "execution_count": 42,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "stats = [df.sepal_width.sum(), df.sepal_length.mean()]\n",
+ "pd.DataFrame([stats], columns=['total_sepal_width', 'avg.sepal_length'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1c59c0d6-823b-4c2a-85f9-732a0468ef2b",
+ "metadata": {},
+ "source": [
+ "In Ibis, you construct aggregate expressions then apply them to the table using the `aggregate` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 43,
+ "id": "04090b90-3e2b-4cfe-ac57-52a652e45609",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " total_sepal_width | \n",
+ " avg_sepal_length | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 458.6 | \n",
+ " 5.843333 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " total_sepal_width avg_sepal_length\n",
+ "0 458.6 5.843333"
+ ]
+ },
+ "execution_count": 43,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "stats = [t.sepal_width.sum().name('total_sepal_width'), t.sepal_length.mean().name('avg_sepal_length')]\n",
+ "agged = t.aggregate(stats)\n",
+ "agged"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9670ba4d-f870-4b55-b948-c57a4933bf07",
+ "metadata": {},
+ "source": [
+ "You can also combine both operations into one and pass the aggregate expressions using keyword parameters."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "id": "ac9dd0ba-fbe9-4a5f-b4d6-ab2b127e5a89",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " total_sepal_width | \n",
+ " avg_sepal_length | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 458.6 | \n",
+ " 5.843333 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " total_sepal_width avg_sepal_length\n",
+ "0 458.6 5.843333"
+ ]
+ },
+ "execution_count": 44,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "agged = t.aggregate(\n",
+ " total_sepal_width=t.sepal_width.sum(),\n",
+ " avg_sepal_length=t.sepal_length.mean(),\n",
+ ")\n",
+ "agged"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7338f6a8-9be9-41a5-bb73-2786d67c1f91",
+ "metadata": {},
+ "source": [
+ "### Group by\n",
+ "\n",
+ "Aggregations can also be done across groupings using the `by=` parameter."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "id": "3e773759-77ca-4eb5-a858-81d489399059",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " species | \n",
+ " total_sepal_width | \n",
+ " avg_sepal_length | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " setosa | \n",
+ " 171.4 | \n",
+ " 5.006 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " versicolor | \n",
+ " 138.5 | \n",
+ " 5.936 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " virginica | \n",
+ " 148.7 | \n",
+ " 6.588 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " species total_sepal_width avg_sepal_length\n",
+ "0 setosa 171.4 5.006\n",
+ "1 versicolor 138.5 5.936\n",
+ "2 virginica 148.7 6.588"
+ ]
+ },
+ "execution_count": 45,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "agged = t.aggregate(\n",
+ " by='species',\n",
+ " total_sepal_width=t.sepal_width.sum(),\n",
+ " avg_sepal_length=t.sepal_length.mean(),\n",
+ ")\n",
+ "agged"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "78035eba-5c94-4e67-9d56-58fa36dcd657",
+ "metadata": {},
+ "source": [
+ "Alternatively, by groups can be computed using a grouped table."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "id": "ca4ed3a8-d788-40f1-ae8e-52f691eaeb40",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " species | \n",
+ " total_sepal_width | \n",
+ " avg_sepal_length | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " setosa | \n",
+ " 171.4 | \n",
+ " 5.006 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " versicolor | \n",
+ " 138.5 | \n",
+ " 5.936 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " virginica | \n",
+ " 148.7 | \n",
+ " 6.588 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " species total_sepal_width avg_sepal_length\n",
+ "0 setosa 171.4 5.006\n",
+ "1 versicolor 138.5 5.936\n",
+ "2 virginica 148.7 6.588"
+ ]
+ },
+ "execution_count": 46,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "agged = t.group_by('species').aggregate(\n",
+ " total_sepal_width=t.sepal_width.sum(),\n",
+ " avg_sepal_length=t.sepal_length.mean(),\n",
+ ")\n",
+ "agged"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ecc585d8-d6b4-40bf-bb77-2a35f2998c8d",
+ "metadata": {},
+ "source": [
+ "## Dropping rows with `NULL`s\n",
+ "\n",
+ "Both pandas and Ibis allow you to drop rows from a table based on whether a set of columns\n",
+ "contains a `NULL` value. This method is called `dropna` in both packages. The common set\n",
+ "of parameters in the two are `subset=` and `how=`. The `subset=` parameter indicates which\n",
+ "columns to inspect for `NULL` values. The `how=` parameter specifies whether 'any' or 'all'\n",
+ "of the specified columns must be `NULL` in order for the row to be dropped."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "id": "170fd8e7-4c2d-4981-8969-e064b2d8b176",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "no_null_t = t.dropna(['sepal_width', 'sepal_length'], how='any')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ff5a2ba3-0164-47ef-b687-47c6dd19903c",
+ "metadata": {},
+ "source": [
+ "## Filling `NULL` values\n",
+ "\n",
+ "Both pandas and Ibis allow you to fill `NULL` values in a table. In Ibis, the replacement value can only\n",
+ "be a scalar value of a dictionary of values. If it is a dictionary, the keys of the dictionary specify\n",
+ "the column name for the value to apply to."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "id": "a5eb4117-c843-401d-8ca6-3b29c00f62da",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "no_null_t = t.fillna(dict(sepal_width=0, sepal_length=0))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ea5f0377-5c1f-4b3d-b0ef-6f1e92533171",
+ "metadata": {},
+ "source": [
+ "## Common column expressions\n",
+ "\n",
+ "See the full API documentation for all of the available value methods and tools for creating value expressions. We mention a few common ones here as they relate to common SQL queries."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e37b0ef6-9653-46a3-ab30-cffd079f081d",
+ "metadata": {},
+ "source": [
+ "## Type casts\n",
+ "\n",
+ "Type casting in pandas is done using the `astype` method on columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "id": "dc002f3f-efaf-43e5-9f1f-e3a0bf683baf",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 3.5\n",
+ "1 3.0\n",
+ "2 3.2\n",
+ "3 3.1\n",
+ "4 3.6\n",
+ " ... \n",
+ "145 3.0\n",
+ "146 2.5\n",
+ "147 3.0\n",
+ "148 3.4\n",
+ "149 3.0\n",
+ "Name: sepal_width, Length: 150, dtype: object"
+ ]
+ },
+ "execution_count": 49,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.sepal_width.astype(str)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3a790e6c-ae3b-4166-b451-4826b3290c61",
+ "metadata": {},
+ "source": [
+ "In Ibis, you cast the column type using the `cast` method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "id": "4537f3eb-7b94-4605-b409-8926e16608de",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " sepal_width | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " ... | \n",
+ " ... | \n",
+ "
\n",
+ " \n",
+ " 145 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 146 | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ " 147 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 148 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ " 149 | \n",
+ " 3 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
150 rows × 1 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ "0 3\n",
+ "1 3\n",
+ "2 3\n",
+ "3 3\n",
+ "4 3\n",
+ " ..\n",
+ "145 3\n",
+ "146 2\n",
+ "147 3\n",
+ "148 3\n",
+ "149 3\n",
+ "Name: sepal_width, Length: 150, dtype: int64"
+ ]
+ },
+ "execution_count": 50,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t.sepal_width.cast('int')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1b94fbb6-f1d3-4f84-afcb-ffcb0e8103f5",
+ "metadata": {},
+ "source": [
+ "Casted columns can be assigned back to the table using the `mutate` method described earlier."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "id": "68dcb03c-6b97-4278-bb8a-13dda3da70dc",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "ibis.Schema {\n",
+ " sepal_length int64\n",
+ " sepal_width int64\n",
+ " petal_length float64\n",
+ " petal_width float64\n",
+ " species string\n",
+ "}"
+ ]
+ },
+ "execution_count": 51,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "casted = t.mutate(\n",
+ " sepal_width=t.sepal_width.cast('int'),\n",
+ " sepal_length=t.sepal_length.cast('int'),\n",
+ ")\n",
+ "casted.schema()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7768f8f1-44a7-44e6-a433-c33eff2b57cb",
+ "metadata": {},
+ "source": [
+ "### Replacing `NULL`s\n",
+ "\n",
+ "Both pandas and Ibis have `fillna` methods which allow you to specify a replacement value\n",
+ "for `NULL` values."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "id": "1a280ac1-0639-4757-9941-fad657ee04a9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sepal_length_no_nulls = t.sepal_length.fillna(0)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "090d19c5-b961-49ad-a4e9-ba490521e785",
+ "metadata": {},
+ "source": [
+ "### Set membership\n",
+ "\n",
+ "pandas set membership uses the `in` and `not in` operators such as `'a' in df.species`. Ibis uses\n",
+ "`isin` and `notin` methods. In addition to testing membership in a set, these methods allow you to\n",
+ "specify an else case to assign a value when the value isn't in the set."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "id": "58f0795e-43d9-4497-b484-4f74f60b3d3a",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " species | \n",
+ " count | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " setosa | \n",
+ " 50 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " versicolor | \n",
+ " 50 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " virginica | \n",
+ " 50 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " species count\n",
+ "0 setosa 50\n",
+ "1 versicolor 50\n",
+ "2 virginica 50"
+ ]
+ },
+ "execution_count": 53,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t.species.value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 54,
+ "id": "a942ae68-8cc6-42b6-9676-2e2c5bbc6633",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " unnamed | \n",
+ " count | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " False | \n",
+ " 50 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " True | \n",
+ " 100 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " unnamed count\n",
+ "0 False 50\n",
+ "1 True 100"
+ ]
+ },
+ "execution_count": 54,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "refined = t.species.isin(['versicolor', 'virginica'])\n",
+ "refined.value_counts()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d2488ae0-933d-43fa-853f-b39359f4c2f8",
+ "metadata": {},
+ "source": [
+ "## Merging tables\n",
+ "\n",
+ "While pandas uses the `merge` method to combine data from multiple `DataFrames`, Ibis uses the\n",
+ "`join` method. They both have similar capabilities. The signature for the `join` method in Ibis\n",
+ "is: `join(right, predicates=(), how='inner', *, suffixes=('_x', '_y'))`. The `merge` method on\n",
+ "pandas' `DataFrame` allows many more parameters, but the signature with the corresponding\n",
+ "parameters would be: `join(right, on=(), how='inner', *, suffixes=('_x', '_y'))`. The valid values\n",
+ "of the `how=` parameter will vary depending on the backend, but common values are 'inner', 'outer',\n",
+ "'left', and 'right'.\n",
+ "\n",
+ "The biggest difference between Ibis' `join` method and pandas' `merge` method is that pandas only\n",
+ "accepts column names or index levels to join on, whereas Ibis can merge on expressions.\n",
+ "\n",
+ "Here are some examples of merging using pandas."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 55,
+ "id": "1d244188-0e5d-441d-b9ed-e6bed6ebf287",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df_left = pd.DataFrame([\n",
+ " ['a', 1, 2],\n",
+ " ['b', 3, 4],\n",
+ " ['c', 4, 6],\n",
+ "], columns=['name', 'x', 'y'])\n",
+ "\n",
+ "df_right = pd.DataFrame([\n",
+ " ['a', 100, 200],\n",
+ " ['m', 300, 400],\n",
+ " ['n', 400, 600],\n",
+ "], columns=['name', 'x_100', 'y_100'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "id": "25e4c74f-178b-4c77-877d-2057e91f393a",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " name | \n",
+ " x | \n",
+ " y | \n",
+ " x_100 | \n",
+ " y_100 | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " 100 | \n",
+ " 200 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " name x y x_100 y_100\n",
+ "0 a 1 2 100 200"
+ ]
+ },
+ "execution_count": 56,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_left.merge(df_right, on='name')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "id": "f28d2c44-1ec6-4e38-af40-527936db6a6d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " name | \n",
+ " x | \n",
+ " y | \n",
+ " x_100 | \n",
+ " y_100 | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1.0 | \n",
+ " 2.0 | \n",
+ " 100.0 | \n",
+ " 200.0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3.0 | \n",
+ " 4.0 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " c | \n",
+ " 4.0 | \n",
+ " 6.0 | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " m | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 300.0 | \n",
+ " 400.0 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " n | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " 400.0 | \n",
+ " 600.0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " name x y x_100 y_100\n",
+ "0 a 1.0 2.0 100.0 200.0\n",
+ "1 b 3.0 4.0 NaN NaN\n",
+ "2 c 4.0 6.0 NaN NaN\n",
+ "3 m NaN NaN 300.0 400.0\n",
+ "4 n NaN NaN 400.0 600.0"
+ ]
+ },
+ "execution_count": 57,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df_left.merge(df_right, on='name', how='outer')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "61fe995c-9414-47d4-963d-a09b146f8106",
+ "metadata": {},
+ "source": [
+ "We can now convert `DataFrames` to Ibis tables to do `join`s."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 58,
+ "id": "e1cbaa65-7c76-4e2f-bf29-dff02c05dc25",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pd_ibis = ibis.pandas.connect({'t_left': df_left, 't_right': df_right})\n",
+ "t_left = pd_ibis.table('t_left')\n",
+ "t_right = pd_ibis.table('t_right')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "id": "45cc37a6-601f-48e5-a266-39380c165bff",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " name_x | \n",
+ " x | \n",
+ " y | \n",
+ " name_y | \n",
+ " x_100 | \n",
+ " y_100 | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 2 | \n",
+ " a | \n",
+ " 100 | \n",
+ " 200 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " name_x x y name_y x_100 y_100\n",
+ "0 a 1 2 a 100 200"
+ ]
+ },
+ "execution_count": 59,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t_left.join(t_right, t_left.name == t_right.name)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4597efd7-5bcf-4c78-9fa9-f60cbd893ea4",
+ "metadata": {},
+ "source": [
+ "You may notice that in Ibis joins, even if the predicate is an equality expression and both tables\n",
+ "have the same column name, you will still get multiple output columns with suffixes added.\n",
+ "This may change in a future version to match the pandas behavior.\n",
+ "\n",
+ "Below is an outer join where missing values are filled with `NaN`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 60,
+ "id": "6142e659-d60b-4f0c-8d37-7ac53d4a7a5b",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " name_x | \n",
+ " x | \n",
+ " y | \n",
+ " name_y | \n",
+ " x_100 | \n",
+ " y_100 | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1.0 | \n",
+ " 2.0 | \n",
+ " a | \n",
+ " 100.0 | \n",
+ " 200.0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3.0 | \n",
+ " 4.0 | \n",
+ " b | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " c | \n",
+ " 4.0 | \n",
+ " 6.0 | \n",
+ " c | \n",
+ " NaN | \n",
+ " NaN | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " m | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " m | \n",
+ " 300.0 | \n",
+ " 400.0 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " n | \n",
+ " NaN | \n",
+ " NaN | \n",
+ " n | \n",
+ " 400.0 | \n",
+ " 600.0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " name_x x y name_y x_100 y_100\n",
+ "0 a 1.0 2.0 a 100.0 200.0\n",
+ "1 b 3.0 4.0 b NaN NaN\n",
+ "2 c 4.0 6.0 c NaN NaN\n",
+ "3 m NaN NaN m 300.0 400.0\n",
+ "4 n NaN NaN n 400.0 600.0"
+ ]
+ },
+ "execution_count": 60,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "t_left.join(t_right, t_left.name == t_right.name, how='outer')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9cce73d2-9d49-48fe-bbe6-8f06982c4f8a",
+ "metadata": {},
+ "source": [
+ "## Concatenating tables\n",
+ "\n",
+ "Concatenating `DataFrame`s in pandas is done with the `concat` top-level function. It takes multiple `DataFrames`\n",
+ "and concatenates the rows of one `DataFrame` to the next. If the columns are mis-matched, it extends the\n",
+ "list of columns to include the full set of columns and inserts `NaN`s and `None`s into the missing values.\n",
+ "\n",
+ "Concatenating tables in Ibis can only be done on tables with matching schemas. The concatenation is done\n",
+ "using the top-level `union` function or the `union` method on a table.\n",
+ "\n",
+ "We'll demonstrate a pandas `concat` first."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 61,
+ "id": "0aa28ac6-bf75-4b47-8986-0ecb1e7ff28e",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "df_1 = pd.DataFrame([\n",
+ " ['a', 1, 2],\n",
+ " ['b', 3, 4],\n",
+ " ['c', 4, 6],\n",
+ "], columns=['name', 'x', 'y'])\n",
+ "\n",
+ "df_2 = pd.DataFrame([\n",
+ " ['a', 100, 200],\n",
+ " ['m', 300, 400],\n",
+ " ['n', 400, 600],\n",
+ "], columns=['name', 'x', 'y'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "id": "19482224-2a86-4925-be2a-c5412f3d488d",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " name | \n",
+ " x | \n",
+ " y | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " c | \n",
+ " 4 | \n",
+ " 6 | \n",
+ "
\n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 100 | \n",
+ " 200 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " m | \n",
+ " 300 | \n",
+ " 400 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " n | \n",
+ " 400 | \n",
+ " 600 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " name x y\n",
+ "0 a 1 2\n",
+ "1 b 3 4\n",
+ "2 c 4 6\n",
+ "0 a 100 200\n",
+ "1 m 300 400\n",
+ "2 n 400 600"
+ ]
+ },
+ "execution_count": 62,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "pd.concat([df_1, df_2])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "20ad6803-4cbd-4fb8-87c6-c5a05092f12f",
+ "metadata": {},
+ "source": [
+ "Now we can convert the `DataFrame`s to Ibis tables and combine the tables using a union."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 63,
+ "id": "efbbb4b4-ef7e-4e8a-a8e2-359ffc70db58",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pd_ibis = ibis.pandas.connect({'t_1': df_1, 't_2': df_2})\n",
+ "t_1 = pd_ibis.table('t_1')\n",
+ "t_2 = pd_ibis.table('t_2')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 64,
+ "id": "56f6ab13-be06-4779-8289-dd838bde2ece",
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " name | \n",
+ " x | \n",
+ " y | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " a | \n",
+ " 1 | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " b | \n",
+ " 3 | \n",
+ " 4 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " c | \n",
+ " 4 | \n",
+ " 6 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " a | \n",
+ " 100 | \n",
+ " 200 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " m | \n",
+ " 300 | \n",
+ " 400 | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " n | \n",
+ " 400 | \n",
+ " 600 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " name x y\n",
+ "0 a 1 2\n",
+ "1 b 3 4\n",
+ "2 c 4 6\n",
+ "3 a 100 200\n",
+ "4 m 300 400\n",
+ "5 n 400 600"
+ ]
+ },
+ "execution_count": 64,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "unioned = ibis.union(t_1, t_2)\n",
+ "unioned"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}