diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index 055ba99797a7..b2f1af343a67 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -10,6 +10,7 @@ * [Analytics Tools](tutorial/07-Analytics-Tools.ipynb) * [Geospatial Analysis](tutorial/08-Geospatial-Analysis.ipynb) * [Ibis for SQL Programmers](ibis-for-sql-programmers.ipynb) +* [Ibis for pandas Users](ibis-for-pandas-users.ipynb) * [User Guide](user_guide/) * [Execution Backends](backends/) * [How To Guide](how_to/) diff --git a/docs/ibis-for-pandas-users.ipynb b/docs/ibis-for-pandas-users.ipynb new file mode 100644 index 000000000000..849232feb286 --- /dev/null +++ b/docs/ibis-for-pandas-users.ipynb @@ -0,0 +1,4355 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "641af053-e938-4384-8ccd-6eff5b31833d", + "metadata": { + "tags": [] + }, + "source": [ + "# Ibis for pandas Users\n", + "\n", + "Much of the syntax and many of the operations in Ibis are inspired\n", + "by the pandas DataFrame, however, the primary domain of Ibis is\n", + "SQL so there are some differences in how they operate. \n", + "\n", + "One primary\n", + "difference between Ibis tables and pandas `DataFrame`s are that many\n", + "of the pandas `DataFrame` operations do in-place operations, whereas\n", + "Ibis table operations always return a new table expression." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "bd23750c-bc6b-46db-a6d3-017f73a1d436", + "metadata": {}, + "outputs": [], + "source": [ + "import ibis\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "id": "a0f6d6ea-66ec-4554-9cd5-d3dbd60d5ca2", + "metadata": {}, + "source": [ + "**Note that we'll be using Ibis' interactive mode to automatically execute queries at\n", + "the end of each cell in this notebook. If you are using similar code in a program,\n", + "you will have to add `.execute()` to each operation that you want to evaluate.**" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "83cbc2e5-d2c2-42a2-9a13-fa360cd3a834", + "metadata": {}, + "outputs": [], + "source": [ + "ibis.options.interactive = True" + ] + }, + { + "cell_type": "markdown", + "id": "afd136cd-18b5-48a2-8359-c47cde181c7b", + "metadata": {}, + "source": [ + "We'll be using the pandas backend in Ibis in the examples below. First we'll create a simple `DataFrame`." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "ba8432fc-8423-459a-8b56-c4720db65407", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwothree
0a12
1b34
\n", + "
" + ], + "text/plain": [ + " one two three\n", + "0 a 1 2\n", + "1 b 3 4" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.DataFrame([\n", + " ['a', 1, 2],\n", + " ['b', 3, 4]\n", + "], columns=['one', 'two', 'three'])\n", + "df" + ] + }, + { + "cell_type": "markdown", + "id": "2a226ff0-f1b4-49d8-b6bd-8a8bf5cae109", + "metadata": {}, + "source": [ + "Now we can create an Ibis table from the above `DataFrame`." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "62e08de9-ad4b-453a-b862-429d11b0432a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwothree
0a12
1b34
\n", + "
" + ], + "text/plain": [ + " one two three\n", + "0 a 1 2\n", + "1 b 3 4" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t = ibis.pandas.connect({'t': df}).table('t')\n", + "t" + ] + }, + { + "cell_type": "markdown", + "id": "6b928082-2c35-4683-8f6c-4a800d063922", + "metadata": {}, + "source": [ + "## Data types\n", + "\n", + "The data types of columns in pandas are accessed using the `dtypes` attribute. This returns\n", + "a `Series` object." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "94ddeaf0-efb9-4be8-9a60-af26dd657ff6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "one object\n", + "two int64\n", + "three int64\n", + "dtype: object" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "markdown", + "id": "ec2c32a8-fc81-40e5-933a-410407167e80", + "metadata": {}, + "source": [ + "In Ibis, you use the `schema` method which returns an `ibis.Schema` object." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "11770743-e66f-4001-988d-d5c7a5b40cd6", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ibis.Schema {\n", + " one string\n", + " two int64\n", + " three int64\n", + "}" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t.schema()" + ] + }, + { + "cell_type": "markdown", + "id": "33a9a6a3-6b58-4227-9d6e-9ffcbb5c8ea8", + "metadata": {}, + "source": [ + "It is possible to convert the schema information to pandas data types using the `to_pandas` method, if needed." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "68da325b-d549-4564-845a-20e3d99b5df7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('one', dtype('O')), ('two', dtype('int64')), ('three', dtype('int64'))]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t.schema().to_pandas()" + ] + }, + { + "cell_type": "markdown", + "id": "10a8e767-5a09-4bf8-afea-a2598b3d96f5", + "metadata": {}, + "source": [ + "## Table layout\n", + "\n", + "In pandas, the layout of the table is contained in the `shape` attribute which contains the number\n", + "of rows and number of columns in a tuple. The number of columns in an Ibis table can be gotten \n", + "from the length of the schema." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "1d58bd48-1b20-4d06-917d-e10aa5ca5a62", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "3" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(t.schema())" + ] + }, + { + "cell_type": "markdown", + "id": "be551ea9-a124-46b9-b73d-43743c4030c9", + "metadata": {}, + "source": [ + "To get the number of rows of a table, you use the `count` method." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "b1c45c64-9c2f-4d74-8c6b-847e6e94bf59", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t.count()" + ] + }, + { + "cell_type": "markdown", + "id": "6b7753f1-fd33-41f0-bff5-c34f2eb3ffe2", + "metadata": {}, + "source": [ + "To mimic pandas' behavior, you would use the following code. Note that you need to use the `execute` method\n", + "after `count` to evaluate the expression returned by `count`." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "21af7477-0f2a-4c3b-8d00-fcbe4a82bf71", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(2, 3)" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "(t.count().execute(), len(t.schema()))" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "6970d109-a4ad-4f5a-9263-6361f71f2b2f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(2, 3)" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.shape" + ] + }, + { + "cell_type": "markdown", + "id": "953aadd7-1e7f-464b-8aab-df76c11a944c", + "metadata": {}, + "source": [ + "## Subsetting columns\n", + "\n", + "Selecting columns is very similar to in pandas. In fact, you can use the same syntax." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "77d9bde4-8465-4ec4-b1a6-b69b32ea0398", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwo
0a1
1b3
\n", + "
" + ], + "text/plain": [ + " one two\n", + "0 a 1\n", + "1 b 3" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t[['one', 'two']]" + ] + }, + { + "cell_type": "markdown", + "id": "38f96e98-c2a6-4734-a405-cd3e97224dc4", + "metadata": {}, + "source": [ + "However, since row-level indexing is not supported in Ibis, the inner list is not necessary." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "9e9770c3-9aae-4aec-b426-84a435e74159", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwo
0a1
1b3
\n", + "
" + ], + "text/plain": [ + " one two\n", + "0 a 1\n", + "1 b 3" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t['one', 'two']" + ] + }, + { + "cell_type": "markdown", + "id": "45740901-33f6-4f7a-b282-073d8d1fb7c1", + "metadata": {}, + "source": [ + "## Selecting columns\n", + "\n", + "Selecting columns is done using the same syntax as in pandas `DataFrames`. You can use either \n", + "the indexing syntax or attribute syntax." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "7ce3128c-a722-4948-be5d-a2f38873acf5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
one
0a
1b
\n", + "
" + ], + "text/plain": [ + "0 a\n", + "1 b\n", + "Name: one, dtype: object" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t['one']" + ] + }, + { + "cell_type": "markdown", + "id": "39c935c7-216f-4086-8ce0-62c5532ef0d5", + "metadata": {}, + "source": [ + "or:" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "7af612e6-dfee-40a5-a3df-788566fdd4eb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
one
0a
1b
\n", + "
" + ], + "text/plain": [ + "0 a\n", + "1 b\n", + "Name: one, dtype: object" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t.one" + ] + }, + { + "cell_type": "markdown", + "id": "2db743ac-0c64-4e15-aa3e-267d93ee303f", + "metadata": {}, + "source": [ + "## Adding, removing, and modifying columns\n", + "\n", + "Modifying the columns of an Ibis table is a bit different than doing the same operations in\n", + "a pandas `DataFrame`. This is primarily due to the fact that in-place operations are not \n", + "supported on Ibis tables. Each time you do a column modification to a table, a new table\n", + "expression is returned." + ] + }, + { + "cell_type": "markdown", + "id": "2307d018-6f25-4c0e-a7e4-de3d5428ed46", + "metadata": {}, + "source": [ + "### Adding columns\n", + "\n", + "Adding columns is done through the `mutate` method. " + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "6f1f91bb-ea6c-4d0f-9344-1dae83d8e40e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwothreenew_col
0a124
1b348
\n", + "
" + ], + "text/plain": [ + " one two three new_col\n", + "0 a 1 2 4\n", + "1 b 3 4 8" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mutated = t.mutate(new_col=t.three * 2)\n", + "mutated" + ] + }, + { + "cell_type": "markdown", + "id": "9ffd0202-3d64-4b55-98d1-fa231b0e8e58", + "metadata": {}, + "source": [ + "Notice that the original table object remains unchanged. Only the `mutated` object that was returned\n", + "contains the new column." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "84a0722d-2f84-4997-a4b6-ee44bb3dbcac", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwothree
0a12
1b34
\n", + "
" + ], + "text/plain": [ + " one two three\n", + "0 a 1 2\n", + "1 b 3 4" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t" + ] + }, + { + "cell_type": "markdown", + "id": "6cc7fb9f-6720-454c-87e3-e0aad3baa3a5", + "metadata": {}, + "source": [ + "It is also possible to create a column in isolation. This is similar to a `Series` in pandas. \n", + "In this situation, the name of the column must be added using the `name` method." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "cbb0190d-b9af-4d94-a395-254ef31854fe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
three
04
18
\n", + "
" + ], + "text/plain": [ + "0 4\n", + "1 8\n", + "Name: three, dtype: int64" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "new_col = (t.three * 2).name('new_col')\n", + "new_col" + ] + }, + { + "cell_type": "markdown", + "id": "46c142ba-34a2-4547-bd86-a7425bf1b0de", + "metadata": {}, + "source": [ + "You can then add this column to the table using a projection." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "28dbf1e2-f6db-48d1-a61c-e174a8be79bf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwonew_col
0a14
1b38
\n", + "
" + ], + "text/plain": [ + " one two new_col\n", + "0 a 1 4\n", + "1 b 3 8" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "proj = t['one', 'two', new_col]\n", + "proj" + ] + }, + { + "cell_type": "markdown", + "id": "a047475c-6a4f-4b77-b11a-b17790383ad6", + "metadata": {}, + "source": [ + "### Removing columns\n", + "\n", + "Removing a column is done using the `drop` method." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "7abe3596-8eb7-480e-86f9-7303953c02ae", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['one', 'two', 'three']" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "e9ea485e-f433-4487-8c32-dac86fa19db2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['one', 'three']" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "subset = t.drop('two')\n", + "subset.columns" + ] + }, + { + "cell_type": "markdown", + "id": "8f69886e-b250-4719-945a-46099504503f", + "metadata": {}, + "source": [ + "Multiple column names can also be given." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "982ce8a1-3b2d-4c12-ba1b-4b02bcaca629", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['three']" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "subset = t.drop(['one', 'two'])\n", + "subset.columns" + ] + }, + { + "cell_type": "markdown", + "id": "0240415a-928a-4110-9cc9-3b3533c6de2b", + "metadata": {}, + "source": [ + "It is also possible to drop columns by selecting the columns you want to remain." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "b7d2b972-aa19-4647-8fb5-511062cfc8fd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['two', 'three']" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "subset = t['two', 'three']\n", + "subset.columns" + ] + }, + { + "cell_type": "markdown", + "id": "cc6793c5-d2c7-4960-af3c-90c3829b76e8", + "metadata": {}, + "source": [ + "### Modifying columns\n", + "\n", + "Replacing existing columns is done using the `mutate` method just like adding columns. You simply\n", + "add a column of the same name to replace it." + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "5ba58de0-78a6-4c59-ab72-6a55355ef2cb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwothree
0a12
1b34
\n", + "
" + ], + "text/plain": [ + " one two three\n", + "0 a 1 2\n", + "1 b 3 4" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "90bdae99-f2d1-4d81-bbe6-bc3de62923a0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
onetwothree
0a22
1b64
\n", + "
" + ], + "text/plain": [ + " one two three\n", + "0 a 2 2\n", + "1 b 6 4" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "mutated = t.mutate(two=t.two * 2)\n", + "mutated" + ] + }, + { + "cell_type": "markdown", + "id": "665623b1-d038-4bff-9bd8-5368b36e5f57", + "metadata": {}, + "source": [ + "### Renaming columns\n", + "\n", + "In addition to replacing columns, you can simply rename them as well. This is done with the `relabel` method\n", + "which takes a dictionary containing the name mappings." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "8d5b4242-fb10-4574-88c6-d341826b8f6c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
abthree
0a12
1b34
\n", + "
" + ], + "text/plain": [ + " a b three\n", + "0 a 1 2\n", + "1 b 3 4" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "relabeled = t.relabel(dict(\n", + " one='a',\n", + " two='b',\n", + "))\n", + "relabeled" + ] + }, + { + "cell_type": "markdown", + "id": "7db908d3-d18c-4f40-86ec-17e39f09eb1a", + "metadata": {}, + "source": [ + "## Selecting rows\n", + "\n", + "There are several methods that can be used to select rows of data in various ways. These are described\n", + "in the sections below. We'll use the ubiquitous iris dataset for these examples." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "8ce004ee-7723-4e97-a4c5-dff42693d6a0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 5.1 3.5 1.4 0.2 setosa\n", + "1 4.9 3.0 1.4 0.2 setosa\n", + "2 4.7 3.2 1.3 0.2 setosa\n", + "3 4.6 3.1 1.5 0.2 setosa\n", + "4 5.0 3.6 1.4 0.2 setosa" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "f53898f6-28d5-4791-9de5-f8d3e464757b", + "metadata": {}, + "source": [ + "Create an Ibis table from the `DataFrame` above." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "3dd069d5-c2c6-4241-a648-76dcdf20cc37", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
..................
1456.73.05.22.3virginica
1466.32.55.01.9virginica
1476.53.05.22.0virginica
1486.23.45.42.3virginica
1495.93.05.11.8virginica
\n", + "

150 rows × 5 columns

\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 5.1 3.5 1.4 0.2 setosa\n", + "1 4.9 3.0 1.4 0.2 setosa\n", + "2 4.7 3.2 1.3 0.2 setosa\n", + "3 4.6 3.1 1.5 0.2 setosa\n", + "4 5.0 3.6 1.4 0.2 setosa\n", + ".. ... ... ... ... ...\n", + "145 6.7 3.0 5.2 2.3 virginica\n", + "146 6.3 2.5 5.0 1.9 virginica\n", + "147 6.5 3.0 5.2 2.0 virginica\n", + "148 6.2 3.4 5.4 2.3 virginica\n", + "149 5.9 3.0 5.1 1.8 virginica\n", + "\n", + "[150 rows x 5 columns]" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t = ibis.pandas.connect({'t': df}).table('t')\n", + "t" + ] + }, + { + "cell_type": "markdown", + "id": "889fd311-5a26-4eb0-a933-dae0082fcabc", + "metadata": {}, + "source": [ + "### Head, tail and limit\n", + "\n", + "The `head` method works the same ways as in pandas. Note that some Ibis backends may not have an \n", + "inherent ordering of their rows and using `head` may not return deterministic results. In those\n", + "cases, you can use sorting before calling `head` to ensure a stable result." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "c668f96b-db88-4407-92da-06867024988b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 5.1 3.5 1.4 0.2 setosa\n", + "1 4.9 3.0 1.4 0.2 setosa\n", + "2 4.7 3.2 1.3 0.2 setosa\n", + "3 4.6 3.1 1.5 0.2 setosa\n", + "4 5.0 3.6 1.4 0.2 setosa" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "987d4d1c-ae5c-4c31-b989-7712564e29f7", + "metadata": {}, + "source": [ + "However, the tail method is not implemented since it is not supported in all databases.\n", + "It is possible to emulate the `tail` method if you use sorting in your table to do a \n", + "reverse sort then use the `head` method to retrieve the \"top\" rows." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "684563af-1db7-464c-9d65-ddeb5e7cc854", + "metadata": {}, + "outputs": [ + { + "ename": "AttributeError", + "evalue": "tail", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m/tmp/ipykernel_281650/1485850447.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtail\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", + "\u001b[0;32m~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/ibis/expr/types/relations.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 162\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 163\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mkey\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mschema\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 164\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mAttributeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 165\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 166\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mAttributeError\u001b[0m: tail" + ] + } + ], + "source": [ + "t.tail(1)" + ] + }, + { + "cell_type": "markdown", + "id": "757db52c-f8cd-47cc-ab67-fa4a68998827", + "metadata": {}, + "source": [ + "Another way to limit the number of retrieved rows is using the `limit` method. The following will return\n", + "the same result as `head(5)`. This is often used in conjunction with other filtering techniques that we\n", + "will cover later." + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "92455eef-6f15-48b9-b49b-2d3f8b38c6a7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 5.1 3.5 1.4 0.2 setosa\n", + "1 4.9 3.0 1.4 0.2 setosa\n", + "2 4.7 3.2 1.3 0.2 setosa\n", + "3 4.6 3.1 1.5 0.2 setosa\n", + "4 5.0 3.6 1.4 0.2 setosa" + ] + }, + "execution_count": 31, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t.limit(5)" + ] + }, + { + "cell_type": "markdown", + "id": "9ead0537-6b29-4ac9-94b6-9097e3e157eb", + "metadata": {}, + "source": [ + "The starting position of the returned rows can be specified using the `offset` parameter." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "8b1baecc-a88f-4415-9d64-9bc94cabdc20", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.03.61.40.2setosa
15.43.91.70.4setosa
24.63.41.40.3setosa
35.03.41.50.2setosa
44.42.91.40.2setosa
\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 5.0 3.6 1.4 0.2 setosa\n", + "1 5.4 3.9 1.7 0.4 setosa\n", + "2 4.6 3.4 1.4 0.3 setosa\n", + "3 5.0 3.4 1.5 0.2 setosa\n", + "4 4.4 2.9 1.4 0.2 setosa" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t.limit(5, offset=4)" + ] + }, + { + "cell_type": "markdown", + "id": "e9f5a6b4-1288-4aea-8760-c4f0cdab9564", + "metadata": {}, + "source": [ + "### Filtering rows\n", + "\n", + "In addition to simply limiting the number of rows that are returned, it is possible to filter the \n", + "rows using expressions. Expressions are constructed very similarly to the way they are in pandas.\n", + "Ibis expressions are constructed from operations on colunms in a table which return a boolean result.\n", + "This result is then used to filter the table." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "ccf4e5da-06cd-4331-8ccf-897acc4405e2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_width
0False
1False
2False
3False
4False
......
145False
146False
147False
148False
149False
\n", + "

150 rows × 1 columns

\n", + "
" + ], + "text/plain": [ + "0 False\n", + "1 False\n", + "2 False\n", + "3 False\n", + "4 False\n", + " ... \n", + "145 False\n", + "146 False\n", + "147 False\n", + "148 False\n", + "149 False\n", + "Name: sepal_width, Length: 150, dtype: bool" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "expr = t.sepal_width > 3.8\n", + "expr" + ] + }, + { + "cell_type": "markdown", + "id": "ec4f85db-ecea-43f1-a299-5313c56705b1", + "metadata": {}, + "source": [ + "We can evaluate the value counts to see how many rows we will expect to get back after filtering." + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "180b2d01-84a0-4042-9d2e-07cbdbe5c8b5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
unnamedcount
0False144
1True6
\n", + "
" + ], + "text/plain": [ + " unnamed count\n", + "0 False 144\n", + "1 True 6" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "expr.value_counts()" + ] + }, + { + "cell_type": "markdown", + "id": "573e9e1a-062b-4e10-81b3-56c6d1ae17e6", + "metadata": {}, + "source": [ + "Now we apply the filter to the table. Since there are 6 True values in the expression, we should\n", + "get 6 rows back." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "63ff3e0d-2775-432a-8ecb-947d2c8cc0a5", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.43.91.70.4setosa
15.84.01.20.2setosa
25.74.41.50.4setosa
35.43.91.30.4setosa
45.24.11.50.1setosa
55.54.21.40.2setosa
\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 5.4 3.9 1.7 0.4 setosa\n", + "1 5.8 4.0 1.2 0.2 setosa\n", + "2 5.7 4.4 1.5 0.4 setosa\n", + "3 5.4 3.9 1.3 0.4 setosa\n", + "4 5.2 4.1 1.5 0.1 setosa\n", + "5 5.5 4.2 1.4 0.2 setosa" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "filtered = t[expr]\n", + "filtered" + ] + }, + { + "cell_type": "markdown", + "id": "335f6010-d92f-4484-83d5-eadded3d7527", + "metadata": {}, + "source": [ + "Of course, the filtering expression can be applied inline as well." + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "240ca8c8-b8ba-4826-831e-e2390cb6fbbb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.43.91.70.4setosa
15.84.01.20.2setosa
25.74.41.50.4setosa
35.43.91.30.4setosa
45.24.11.50.1setosa
55.54.21.40.2setosa
\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 5.4 3.9 1.7 0.4 setosa\n", + "1 5.8 4.0 1.2 0.2 setosa\n", + "2 5.7 4.4 1.5 0.4 setosa\n", + "3 5.4 3.9 1.3 0.4 setosa\n", + "4 5.2 4.1 1.5 0.1 setosa\n", + "5 5.5 4.2 1.4 0.2 setosa" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "filtered = t[t.sepal_width > 3.8]\n", + "filtered" + ] + }, + { + "cell_type": "markdown", + "id": "8b475d75-e63b-4a5f-9af1-3eb5858a8ace", + "metadata": {}, + "source": [ + "Multiple filtering expressions can be combined into a single expression or chained onto existing\n", + "table expressions." + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "f971ed4e-c0de-4ebc-8be6-60eff433399f", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.84.01.20.2setosa
15.74.41.50.4setosa
\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 5.8 4.0 1.2 0.2 setosa\n", + "1 5.7 4.4 1.5 0.4 setosa" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "filtered = t[(t.sepal_width > 3.8) & (t.sepal_length > 5.5)]\n", + "filtered" + ] + }, + { + "cell_type": "markdown", + "id": "63772a0e-8c89-4d9e-ae3b-a916d5b17d30", + "metadata": {}, + "source": [ + "The code above will return the same rows as the code below." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "827c5756-a770-4106-9fb5-8eb98050aed9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.84.01.20.2setosa
15.74.41.50.4setosa
\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 5.8 4.0 1.2 0.2 setosa\n", + "1 5.7 4.4 1.5 0.4 setosa" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "filtered = t[t.sepal_width > 3.8][t.sepal_length > 5.5]\n", + "filtered" + ] + }, + { + "cell_type": "markdown", + "id": "7673373b-8cc3-4190-9207-fc5606e749af", + "metadata": {}, + "source": [ + "Aggregation has not been discussed yet, but aggregate values can be used in expressions\n", + "to return things such as all of the rows in a data set where the value in a column\n", + "is greater than the mean." + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "b33a1e2b-e297-46c5-a163-91ae8e62f7de", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.73.21.30.2setosa
24.63.11.50.2setosa
35.03.61.40.2setosa
45.43.91.70.4setosa
..................
626.73.15.62.4virginica
636.93.15.12.3virginica
646.83.25.92.3virginica
656.73.35.72.5virginica
666.23.45.42.3virginica
\n", + "

67 rows × 5 columns

\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 5.1 3.5 1.4 0.2 setosa\n", + "1 4.7 3.2 1.3 0.2 setosa\n", + "2 4.6 3.1 1.5 0.2 setosa\n", + "3 5.0 3.6 1.4 0.2 setosa\n", + "4 5.4 3.9 1.7 0.4 setosa\n", + ".. ... ... ... ... ...\n", + "62 6.7 3.1 5.6 2.4 virginica\n", + "63 6.9 3.1 5.1 2.3 virginica\n", + "64 6.8 3.2 5.9 2.3 virginica\n", + "65 6.7 3.3 5.7 2.5 virginica\n", + "66 6.2 3.4 5.4 2.3 virginica\n", + "\n", + "[67 rows x 5 columns]" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "filtered = t[t.sepal_width > t.sepal_width.mean()]\n", + "filtered" + ] + }, + { + "cell_type": "markdown", + "id": "7acc9ac2-3915-4ff4-8685-9b515ff8e882", + "metadata": {}, + "source": [ + "## Sorting rows\n", + "\n", + "Sorting rows in Ibis uses a somewhat different API than in pandas. In pandas, you would use the\n", + "`sort_values` method to order rows by values in specified columns. Ibis uses a method called\n", + "`sort_by`. To specify ascending or descending orders, pandas uses an `ascending=` argument\n", + "to `sort_values` that indicates the order for each sorting column. Ibis allows you to tag the\n", + "column name in the `sort_by` list as ascending or descending by wrapping it with `ibis.asc` or\n", + "`ibis.desc`.\n", + "\n", + "Here is an example of sorting a `DataFrame` using two sort keys. One key is sorting in ascending\n", + "order and the other is in descending order." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "3fca2921-2f93-403e-aabd-ba941f374a02", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
134.33.01.10.1setosa
424.43.21.30.2setosa
384.43.01.30.2setosa
84.42.91.40.2setosa
414.52.31.30.3setosa
\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "13 4.3 3.0 1.1 0.1 setosa\n", + "42 4.4 3.2 1.3 0.2 setosa\n", + "38 4.4 3.0 1.3 0.2 setosa\n", + "8 4.4 2.9 1.4 0.2 setosa\n", + "41 4.5 2.3 1.3 0.3 setosa" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.sort_values(['sepal_length', 'sepal_width'], ascending=[True, False]).head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "da3b0a11-513a-4445-9a14-b05394fd2d2d", + "metadata": {}, + "source": [ + "The same operation in Ibis would look like the following. Note that the index values of the\n", + "resulting `DataFrame` start from zero and count up, whereas in the example above, they retain\n", + "their original index value. This is simply due to the fact that rows in tables don't necessarily\n", + "have a stable index in database backends, so the index is just generated on the result." + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "b984cc38-d235-431f-98bf-5cd3eaa401fd", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
04.33.01.10.1setosa
14.43.21.30.2setosa
24.43.01.30.2setosa
34.42.91.40.2setosa
44.52.31.30.3setosa
\n", + "
" + ], + "text/plain": [ + " sepal_length sepal_width petal_length petal_width species\n", + "0 4.3 3.0 1.1 0.1 setosa\n", + "1 4.4 3.2 1.3 0.2 setosa\n", + "2 4.4 3.0 1.3 0.2 setosa\n", + "3 4.4 2.9 1.4 0.2 setosa\n", + "4 4.5 2.3 1.3 0.3 setosa" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sorted = t.sort_by(['sepal_length', ibis.desc('sepal_width')]).head(5)\n", + "sorted" + ] + }, + { + "cell_type": "markdown", + "id": "ad5157cd-e42d-4349-9a6c-f293b3cab1b4", + "metadata": {}, + "source": [ + "## Aggregation\n", + "\n", + "Aggregation in pandas is typically done by computing columns based on an aggregate function." + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "fbd2705f-1fbb-4ba4-9273-ed6e9e89ec9c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
total_sepal_widthavg.sepal_length
0458.65.843333
\n", + "
" + ], + "text/plain": [ + " total_sepal_width avg.sepal_length\n", + "0 458.6 5.843333" + ] + }, + "execution_count": 42, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "stats = [df.sepal_width.sum(), df.sepal_length.mean()]\n", + "pd.DataFrame([stats], columns=['total_sepal_width', 'avg.sepal_length'])" + ] + }, + { + "cell_type": "markdown", + "id": "1c59c0d6-823b-4c2a-85f9-732a0468ef2b", + "metadata": {}, + "source": [ + "In Ibis, you construct aggregate expressions then apply them to the table using the `aggregate` method." + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "04090b90-3e2b-4cfe-ac57-52a652e45609", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
total_sepal_widthavg_sepal_length
0458.65.843333
\n", + "
" + ], + "text/plain": [ + " total_sepal_width avg_sepal_length\n", + "0 458.6 5.843333" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "stats = [t.sepal_width.sum().name('total_sepal_width'), t.sepal_length.mean().name('avg_sepal_length')]\n", + "agged = t.aggregate(stats)\n", + "agged" + ] + }, + { + "cell_type": "markdown", + "id": "9670ba4d-f870-4b55-b948-c57a4933bf07", + "metadata": {}, + "source": [ + "You can also combine both operations into one and pass the aggregate expressions using keyword parameters." + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "ac9dd0ba-fbe9-4a5f-b4d6-ab2b127e5a89", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
total_sepal_widthavg_sepal_length
0458.65.843333
\n", + "
" + ], + "text/plain": [ + " total_sepal_width avg_sepal_length\n", + "0 458.6 5.843333" + ] + }, + "execution_count": 44, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agged = t.aggregate(\n", + " total_sepal_width=t.sepal_width.sum(),\n", + " avg_sepal_length=t.sepal_length.mean(),\n", + ")\n", + "agged" + ] + }, + { + "cell_type": "markdown", + "id": "7338f6a8-9be9-41a5-bb73-2786d67c1f91", + "metadata": {}, + "source": [ + "### Group by\n", + "\n", + "Aggregations can also be done across groupings using the `by=` parameter." + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "3e773759-77ca-4eb5-a858-81d489399059", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
speciestotal_sepal_widthavg_sepal_length
0setosa171.45.006
1versicolor138.55.936
2virginica148.76.588
\n", + "
" + ], + "text/plain": [ + " species total_sepal_width avg_sepal_length\n", + "0 setosa 171.4 5.006\n", + "1 versicolor 138.5 5.936\n", + "2 virginica 148.7 6.588" + ] + }, + "execution_count": 45, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agged = t.aggregate(\n", + " by='species',\n", + " total_sepal_width=t.sepal_width.sum(),\n", + " avg_sepal_length=t.sepal_length.mean(),\n", + ")\n", + "agged" + ] + }, + { + "cell_type": "markdown", + "id": "78035eba-5c94-4e67-9d56-58fa36dcd657", + "metadata": {}, + "source": [ + "Alternatively, by groups can be computed using a grouped table." + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "ca4ed3a8-d788-40f1-ae8e-52f691eaeb40", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
speciestotal_sepal_widthavg_sepal_length
0setosa171.45.006
1versicolor138.55.936
2virginica148.76.588
\n", + "
" + ], + "text/plain": [ + " species total_sepal_width avg_sepal_length\n", + "0 setosa 171.4 5.006\n", + "1 versicolor 138.5 5.936\n", + "2 virginica 148.7 6.588" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "agged = t.group_by('species').aggregate(\n", + " total_sepal_width=t.sepal_width.sum(),\n", + " avg_sepal_length=t.sepal_length.mean(),\n", + ")\n", + "agged" + ] + }, + { + "cell_type": "markdown", + "id": "ecc585d8-d6b4-40bf-bb77-2a35f2998c8d", + "metadata": {}, + "source": [ + "## Dropping rows with `NULL`s\n", + "\n", + "Both pandas and Ibis allow you to drop rows from a table based on whether a set of columns\n", + "contains a `NULL` value. This method is called `dropna` in both packages. The common set\n", + "of parameters in the two are `subset=` and `how=`. The `subset=` parameter indicates which\n", + "columns to inspect for `NULL` values. The `how=` parameter specifies whether 'any' or 'all'\n", + "of the specified columns must be `NULL` in order for the row to be dropped." + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "170fd8e7-4c2d-4981-8969-e064b2d8b176", + "metadata": {}, + "outputs": [], + "source": [ + "no_null_t = t.dropna(['sepal_width', 'sepal_length'], how='any')" + ] + }, + { + "cell_type": "markdown", + "id": "ff5a2ba3-0164-47ef-b687-47c6dd19903c", + "metadata": {}, + "source": [ + "## Filling `NULL` values\n", + "\n", + "Both pandas and Ibis allow you to fill `NULL` values in a table. In Ibis, the replacement value can only\n", + "be a scalar value of a dictionary of values. If it is a dictionary, the keys of the dictionary specify\n", + "the column name for the value to apply to." + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "a5eb4117-c843-401d-8ca6-3b29c00f62da", + "metadata": {}, + "outputs": [], + "source": [ + "no_null_t = t.fillna(dict(sepal_width=0, sepal_length=0))" + ] + }, + { + "cell_type": "markdown", + "id": "ea5f0377-5c1f-4b3d-b0ef-6f1e92533171", + "metadata": {}, + "source": [ + "## Common column expressions\n", + "\n", + "See the full API documentation for all of the available value methods and tools for creating value expressions. We mention a few common ones here as they relate to common SQL queries." + ] + }, + { + "cell_type": "markdown", + "id": "e37b0ef6-9653-46a3-ab30-cffd079f081d", + "metadata": {}, + "source": [ + "## Type casts\n", + "\n", + "Type casting in pandas is done using the `astype` method on columns." + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "dc002f3f-efaf-43e5-9f1f-e3a0bf683baf", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "0 3.5\n", + "1 3.0\n", + "2 3.2\n", + "3 3.1\n", + "4 3.6\n", + " ... \n", + "145 3.0\n", + "146 2.5\n", + "147 3.0\n", + "148 3.4\n", + "149 3.0\n", + "Name: sepal_width, Length: 150, dtype: object" + ] + }, + "execution_count": 49, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.sepal_width.astype(str)" + ] + }, + { + "cell_type": "markdown", + "id": "3a790e6c-ae3b-4166-b451-4826b3290c61", + "metadata": {}, + "source": [ + "In Ibis, you cast the column type using the `cast` method." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "4537f3eb-7b94-4605-b409-8926e16608de", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
sepal_width
03
13
23
33
43
......
1453
1462
1473
1483
1493
\n", + "

150 rows × 1 columns

\n", + "
" + ], + "text/plain": [ + "0 3\n", + "1 3\n", + "2 3\n", + "3 3\n", + "4 3\n", + " ..\n", + "145 3\n", + "146 2\n", + "147 3\n", + "148 3\n", + "149 3\n", + "Name: sepal_width, Length: 150, dtype: int64" + ] + }, + "execution_count": 50, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t.sepal_width.cast('int')" + ] + }, + { + "cell_type": "markdown", + "id": "1b94fbb6-f1d3-4f84-afcb-ffcb0e8103f5", + "metadata": {}, + "source": [ + "Casted columns can be assigned back to the table using the `mutate` method described earlier." + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "68dcb03c-6b97-4278-bb8a-13dda3da70dc", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "ibis.Schema {\n", + " sepal_length int64\n", + " sepal_width int64\n", + " petal_length float64\n", + " petal_width float64\n", + " species string\n", + "}" + ] + }, + "execution_count": 51, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "casted = t.mutate(\n", + " sepal_width=t.sepal_width.cast('int'),\n", + " sepal_length=t.sepal_length.cast('int'),\n", + ")\n", + "casted.schema()" + ] + }, + { + "cell_type": "markdown", + "id": "7768f8f1-44a7-44e6-a433-c33eff2b57cb", + "metadata": {}, + "source": [ + "### Replacing `NULL`s\n", + "\n", + "Both pandas and Ibis have `fillna` methods which allow you to specify a replacement value\n", + "for `NULL` values." + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "1a280ac1-0639-4757-9941-fad657ee04a9", + "metadata": {}, + "outputs": [], + "source": [ + "sepal_length_no_nulls = t.sepal_length.fillna(0)" + ] + }, + { + "cell_type": "markdown", + "id": "090d19c5-b961-49ad-a4e9-ba490521e785", + "metadata": {}, + "source": [ + "### Set membership\n", + "\n", + "pandas set membership uses the `in` and `not in` operators such as `'a' in df.species`. Ibis uses\n", + "`isin` and `notin` methods. In addition to testing membership in a set, these methods allow you to\n", + "specify an else case to assign a value when the value isn't in the set." + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "58f0795e-43d9-4497-b484-4f74f60b3d3a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
speciescount
0setosa50
1versicolor50
2virginica50
\n", + "
" + ], + "text/plain": [ + " species count\n", + "0 setosa 50\n", + "1 versicolor 50\n", + "2 virginica 50" + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t.species.value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "a942ae68-8cc6-42b6-9676-2e2c5bbc6633", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
unnamedcount
0False50
1True100
\n", + "
" + ], + "text/plain": [ + " unnamed count\n", + "0 False 50\n", + "1 True 100" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "refined = t.species.isin(['versicolor', 'virginica'])\n", + "refined.value_counts()" + ] + }, + { + "cell_type": "markdown", + "id": "d2488ae0-933d-43fa-853f-b39359f4c2f8", + "metadata": {}, + "source": [ + "## Merging tables\n", + "\n", + "While pandas uses the `merge` method to combine data from multiple `DataFrames`, Ibis uses the\n", + "`join` method. They both have similar capabilities. The signature for the `join` method in Ibis\n", + "is: `join(right, predicates=(), how='inner', *, suffixes=('_x', '_y'))`. The `merge` method on\n", + "pandas' `DataFrame` allows many more parameters, but the signature with the corresponding\n", + "parameters would be: `join(right, on=(), how='inner', *, suffixes=('_x', '_y'))`. The valid values\n", + "of the `how=` parameter will vary depending on the backend, but common values are 'inner', 'outer',\n", + "'left', and 'right'.\n", + "\n", + "The biggest difference between Ibis' `join` method and pandas' `merge` method is that pandas only\n", + "accepts column names or index levels to join on, whereas Ibis can merge on expressions.\n", + "\n", + "Here are some examples of merging using pandas." + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "1d244188-0e5d-441d-b9ed-e6bed6ebf287", + "metadata": {}, + "outputs": [], + "source": [ + "df_left = pd.DataFrame([\n", + " ['a', 1, 2],\n", + " ['b', 3, 4],\n", + " ['c', 4, 6],\n", + "], columns=['name', 'x', 'y'])\n", + "\n", + "df_right = pd.DataFrame([\n", + " ['a', 100, 200],\n", + " ['m', 300, 400],\n", + " ['n', 400, 600],\n", + "], columns=['name', 'x_100', 'y_100'])" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "25e4c74f-178b-4c77-877d-2057e91f393a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
namexyx_100y_100
0a12100200
\n", + "
" + ], + "text/plain": [ + " name x y x_100 y_100\n", + "0 a 1 2 100 200" + ] + }, + "execution_count": 56, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_left.merge(df_right, on='name')" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "f28d2c44-1ec6-4e38-af40-527936db6a6d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
namexyx_100y_100
0a1.02.0100.0200.0
1b3.04.0NaNNaN
2c4.06.0NaNNaN
3mNaNNaN300.0400.0
4nNaNNaN400.0600.0
\n", + "
" + ], + "text/plain": [ + " name x y x_100 y_100\n", + "0 a 1.0 2.0 100.0 200.0\n", + "1 b 3.0 4.0 NaN NaN\n", + "2 c 4.0 6.0 NaN NaN\n", + "3 m NaN NaN 300.0 400.0\n", + "4 n NaN NaN 400.0 600.0" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df_left.merge(df_right, on='name', how='outer')" + ] + }, + { + "cell_type": "markdown", + "id": "61fe995c-9414-47d4-963d-a09b146f8106", + "metadata": {}, + "source": [ + "We can now convert `DataFrames` to Ibis tables to do `join`s." + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "id": "e1cbaa65-7c76-4e2f-bf29-dff02c05dc25", + "metadata": {}, + "outputs": [], + "source": [ + "pd_ibis = ibis.pandas.connect({'t_left': df_left, 't_right': df_right})\n", + "t_left = pd_ibis.table('t_left')\n", + "t_right = pd_ibis.table('t_right')" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "id": "45cc37a6-601f-48e5-a266-39380c165bff", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
name_xxyname_yx_100y_100
0a12a100200
\n", + "
" + ], + "text/plain": [ + " name_x x y name_y x_100 y_100\n", + "0 a 1 2 a 100 200" + ] + }, + "execution_count": 59, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t_left.join(t_right, t_left.name == t_right.name)" + ] + }, + { + "cell_type": "markdown", + "id": "4597efd7-5bcf-4c78-9fa9-f60cbd893ea4", + "metadata": {}, + "source": [ + "You may notice that in Ibis joins, even if the predicate is an equality expression and both tables\n", + "have the same column name, you will still get multiple output columns with suffixes added.\n", + "This may change in a future version to match the pandas behavior.\n", + "\n", + "Below is an outer join where missing values are filled with `NaN`." + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "6142e659-d60b-4f0c-8d37-7ac53d4a7a5b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
name_xxyname_yx_100y_100
0a1.02.0a100.0200.0
1b3.04.0bNaNNaN
2c4.06.0cNaNNaN
3mNaNNaNm300.0400.0
4nNaNNaNn400.0600.0
\n", + "
" + ], + "text/plain": [ + " name_x x y name_y x_100 y_100\n", + "0 a 1.0 2.0 a 100.0 200.0\n", + "1 b 3.0 4.0 b NaN NaN\n", + "2 c 4.0 6.0 c NaN NaN\n", + "3 m NaN NaN m 300.0 400.0\n", + "4 n NaN NaN n 400.0 600.0" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "t_left.join(t_right, t_left.name == t_right.name, how='outer')" + ] + }, + { + "cell_type": "markdown", + "id": "9cce73d2-9d49-48fe-bbe6-8f06982c4f8a", + "metadata": {}, + "source": [ + "## Concatenating tables\n", + "\n", + "Concatenating `DataFrame`s in pandas is done with the `concat` top-level function. It takes multiple `DataFrames`\n", + "and concatenates the rows of one `DataFrame` to the next. If the columns are mis-matched, it extends the\n", + "list of columns to include the full set of columns and inserts `NaN`s and `None`s into the missing values.\n", + "\n", + "Concatenating tables in Ibis can only be done on tables with matching schemas. The concatenation is done\n", + "using the top-level `union` function or the `union` method on a table.\n", + "\n", + "We'll demonstrate a pandas `concat` first." + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "0aa28ac6-bf75-4b47-8986-0ecb1e7ff28e", + "metadata": {}, + "outputs": [], + "source": [ + "df_1 = pd.DataFrame([\n", + " ['a', 1, 2],\n", + " ['b', 3, 4],\n", + " ['c', 4, 6],\n", + "], columns=['name', 'x', 'y'])\n", + "\n", + "df_2 = pd.DataFrame([\n", + " ['a', 100, 200],\n", + " ['m', 300, 400],\n", + " ['n', 400, 600],\n", + "], columns=['name', 'x', 'y'])" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "19482224-2a86-4925-be2a-c5412f3d488d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
namexy
0a12
1b34
2c46
0a100200
1m300400
2n400600
\n", + "
" + ], + "text/plain": [ + " name x y\n", + "0 a 1 2\n", + "1 b 3 4\n", + "2 c 4 6\n", + "0 a 100 200\n", + "1 m 300 400\n", + "2 n 400 600" + ] + }, + "execution_count": 62, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.concat([df_1, df_2])" + ] + }, + { + "cell_type": "markdown", + "id": "20ad6803-4cbd-4fb8-87c6-c5a05092f12f", + "metadata": {}, + "source": [ + "Now we can convert the `DataFrame`s to Ibis tables and combine the tables using a union." + ] + }, + { + "cell_type": "code", + "execution_count": 63, + "id": "efbbb4b4-ef7e-4e8a-a8e2-359ffc70db58", + "metadata": {}, + "outputs": [], + "source": [ + "pd_ibis = ibis.pandas.connect({'t_1': df_1, 't_2': df_2})\n", + "t_1 = pd_ibis.table('t_1')\n", + "t_2 = pd_ibis.table('t_2')" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "id": "56f6ab13-be06-4779-8289-dd838bde2ece", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
namexy
0a12
1b34
2c46
3a100200
4m300400
5n400600
\n", + "
" + ], + "text/plain": [ + " name x y\n", + "0 a 1 2\n", + "1 b 3 4\n", + "2 c 4 6\n", + "3 a 100 200\n", + "4 m 300 400\n", + "5 n 400 600" + ] + }, + "execution_count": 64, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "unioned = ibis.union(t_1, t_2)\n", + "unioned" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}