Skip to content

SciPy 2024 - Ibis Tutorial #22

@gforsyth

Description

@gforsyth

Title

Intro to Ibis: blazing fast analytics with DuckDB, Polars, Snowflake, and more, from the comfort of your Python repl.

Abstract

Tabular data is ubiquitous, and pandas has been the de facto tool in Python for analyzing it. However, as data size scales, analysis using pandas may become untenable. Luckily, modern analytical databases (like DuckDB) are able to analyze this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Many of these systems only provide a SQL interface though; something far different from pandas’ dataframe interface, requiring a rewrite of your analysis code.

This is where Ibis comes in. Ibis is a pure-Python open-source library that provides a dataframe interface to many popular databases and analytics tools (DuckDB, Polars, Snowflake, Spark, etc...). This lets users analyze data using the same consistent API, regardless of which backend they’re using, and without ever having to learn SQL. No more pains rewriting pandas code to something else when you run into performance issues; write your code once using Ibis and run it on any supported backend.

Description

This tutorial is open to all. If you have ever

  • been thwarted by SQL or data stored somewhere, or
  • been stuck trying to translate a pandas POC to PySpark for "production", or
  • are interested in how to write blazing fast analytics code that uses all of the cores on your laptop without running into memory limits (and without writing any SQL)

then this tutorial is for you!

We’ll cover:

  • The basic operations of Ibis (select, filter, group_by, order_by, join, and aggregate), and how these operations may be composed to form more complicated queries.
  • How Ibis may be used on a number of different local and remote backend engines to execute the same queries on different systems.
  • How to quickly compare performance between different backends without changing your code.
  • How Ibis integrates into the larger Python data ecosystem, including tools like Scikit-Learn, Matplotlib, PyArrow, pandas, Shapely, Altair, hvPlot, and VegaFusion.

Notes

Prior teaching experience

Gil Forsyth

I'm an experienced instructor, having led tutorials at several PyData conferences, PyCon, and SciPy. I also ran internal training on Python and distributed data analysis at Capital One for several years.
I am one of the core maintainers of Ibis.

Tutorials

Talks

Naty Clementi

Naty Clementi is an experienced educator. She has presented tutorials at meetups and conferences such as SciPy, PyData NYC, and Women Who Code DC. Her latest presentations include an advanced tutorial on Dask at SciPy 2023 (recording: youtube.com/watch?v=ZMwpK6KVj3o) and recurring online Dask tutorials until July 2023 (recordings at youtube.com/watch?v=32w33L7hseQ and youtube.com/watch?v=8bd7DswSxw4). In addition, she has taught multiple (unrecorded) Python courses to graduate and undergraduate students at the George Washington University.

Recent Tutorials: - Advanced Dask Tutorial (Scipy 2023): https://www.youtube.com/watch?v=ZMwpK6KVj3o - Dask Futures Tutorial (recurring online until 07/23): recordings at https://www.youtube.com/watch?v=32w33L7hseQ - Dask Dataframes Tutorial (recurring online until 07/23): https://www.youtube.com/watch?v=8bd7DswSxw4 - How to contribute to open source (Women Who Code DC - 2022): https://www.youtube.com/watch?v=eAIYrnguV8c - Intro to Dask (Women Who Code DC - 2021): https://www.youtube.com/watch?v=ZKw7PdoS7YA

Recent Talks: - Open Source meets Enterprise: The right way (PyData Seattle 2023): https://www.youtube.com/watch?v=hSyLEuNrU5Y

Jim Crist-Harif

Jim is also a core maintainer of Ibis and one of the original contributors and long-time maintainer of Dask. He has presented many talks and tutorials, links are available on his website: https://jcristharif.com/talks.html

Phillip Cloud

I have been speaking in public in the software industry since around 2015, all of which has been about Python and analytics.

In a past life I taught undergraduate statistics to psychology students, as well as experimental design.

In addition to working full time on Ibis, I've given a large number of talks on it, nearly all of them public. Here are a few places you can see what I've done:

• Ibis @ Trino Fest: https://youtu.be/JMUtPl-cMRc
• Live stream of an introduction to Ibis: https://www.youtube.com/live/rMeDeSNY8yI
• EuroSciPy 2023: https://youtu.be/-p6SRufakjI
• My YouTube channel with a large variety of Ibis content: https://www.youtube.com/@cpcloud

Prerequisites

This is a hands-on tutorial, with numerous examples to get your hands dirty. Participants should ideally have some experience using Python and pandas, but no SQL experience is necessary.

Outline

0:00 - Intro and Setup “Going beyond pandas”

Get attendees up and running in a GitHub Codespace or on their laptops. A bit of motivation about the kinds of problems where Ibis can help, and a general survey of attendees to find out what their existing pain points and experiences are.

0:15 - Introduction to Ibis basics

A hands-on, follow-along notebook introducing the basic verbs of Ibis data analysis, (select, filter, group_by, order_by, and aggregate), with hands-on exercises throughout.

1:00 - Coffee Break (5 minutes that definitely takes 10 minutes)

1:10 - In-memory tables, joins, and data analysis

Building on the previous notebook, we'll explore how to join in-memory data (from a pandas DataFrame, Python dictionary, or PyArrow Table) with existing tables in a local database and continue analysis on the join result.

We'll touch on the Ibis deferred operator for specifying predicates in chained joins, and demonstrate read_parquet and other read_* methods for loading local data into existing databases.

Then we'll continue with a series of hands-on exercises, building up an analysis pipeline for some IMDB ratings data, but only operating on a 5% sample of the original dataset.

After, we show how the same expression can be computed on the full dataset without any code changes, both for local execution, or with bursting to a cloud database (or other hosted database).

2:00 - Coffee Break (5-10 minutes) + Q&A in the room

2:10 - Selectors

Continuing on from joins, we'll introduce selectors as a means of quickly renaming and cleaning datasets, a powerful feature stolen inspired by dplyr.

2:30 - UDFs and sql passthrough

Demonstrate using UDFs to add custom operations.

Explain and demonstrate various "escape hatches" when you really need to use SQL directly.

3:00 - Coffee Break (5-10 minutes) + Q&A in the room

3:10 - PyPI data exploration and integration with the broader Python ecosystem

Demonstate projection pushdown and column pruning when operating on remote datasets. Explore questions about PyPI maintainers, search for typo-squatters, and try to find explanations for outliers using data from https://py-code.org/datasets.

Feed Ibis expressions into common plotting tools to look for outliers and demonstrate interoperability.

(Note: depending on conference wifi, even with column pruning and parquet files this may be untenable. We have backup exercises that perform the same analysis but make use of either the Clickhouse Playground or a sponsored SnowFlake account, so only basic internet connectivity will be required. Bonus: using Ibis means shifting to these backup options is a one-line operation!)

3:30- Intro to geospatial workflows

Introduction to using Ibis geospatial with supported backends. Short hands-on exercises using NYC subway and bike-share data to try to find the fastest way to get a pizza from one borough to another.

3:50 - Wrap-up

Additional Information

We have given versions of this tutorial (although shorter) at EuroSciPy 2023 and PyData NYC 2023.

EuroSciPy 2023: https://youtu.be/tkejUD5Uq40
PyData NYC 2023: https://youtu.be/TyopbrmlZx8

The content is significantly expanded from previous offerings and is structured for an approximately 3.5 to 4 hour block.

We are looking at swapping out the PyPI data exploration exercise for a more applicable set of problems for the SciPy audience and are currently vetting available datasets. The purpose of those exercises is to bring together all of the various methodologies covered in the previous sections and to demonstrate more realistic end-to-end data analysis problems. Our goal is that even if the particular problem set isn't a perfect match with an attendees field of study, that the lessons learned will be easily transferable to other data domains.

Pre-existing material and continued tutorial development all happening at:
https://github.com/ibis-project/ibis-tutorial

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Delivered

Status

done

Relationships

None yet

Development

No branches or pull requests

Issue actions