# Welcome to the Kedro-Ibis tutorial!

## Outline

- Introduction
  - Who we are
  - Workshop material
  - Setup
  - Motivation
- Expressive analytics at any scale: Introduction to Ibis
- From prototype to production: Introduction to Kedro
- Conclusion

These are the notebooks for the tutorial: 👇

1. [Getting Started with Ibis](./01%20-%20Getting%20Started%20with%20Ibis.ipynb)
2. [Switching Backends](./02%20-%20Switching%20Backends.ipynb)
3. [First steps with Kedro](./03%20-%20First%20steps%20with%20Kedro.ipynb)

## Introduction

### Who we are

|  |  |
|--------|------|
| ![Deepyaman](static/deepyaman.jpg) | **Deepyaman Datta**<br><br>Deepyaman is a software engineer at Voltron Data. Before their acquisition by Voltron Data, he was a Founding Machine Learning Engineer at  Claypot AI, working on their real-time feature engineering platform. Prior to that, he led data engineering teams and asset development across a range of  industries at QuantumBlack, AI by McKinsey. |
| ![Juan Luis](static/juanluis.png) | **Juan Luis Cano Rodríguez**<br><br>Juan Luis (he/him/él) is an Aerospace Engineer with a passion for tech communities, outreach, and sustainability. He works at QuantumBlack, AI by McKinsey, as Product Manager for Kedro, an  opinionated Python framework for creating reproducible, maintainable and modular data science code. He has worked as Developer Advocate at Read  the Docs, as software engineer in the space, consulting, and banking industries, and as a Python trainer for several private and public entities. |

### Workshop material

**https://github.com/ibis-project/kedro-ibis-tutorial**

![QR Code](static/qr.png)

_Note: This will be a lot of material for a 90-minute tutorial; we’ll go fast and not go too much in depth, but will be happy to answer questions later_

1. Open URL above
2. Hit 🟩 "Create codespace on main"
3. Open `00 - Welcome.ipynb` notebook and follow instructions

<img src="static/codespaces.png" width="400" alt="Codespaces">

## Setup

Let's start by downloading the [nycflights13 data](https://github.com/hadley/nycflights13); we'll use this dataset throughout the tutorial.

In [None]:
import ibis

con = ibis.connect("duckdb://nycflights13.ddb")
con.create_table(
    "flights", ibis.examples.nycflights13_flights.fetch().to_pyarrow(), overwrite=True
)
con.create_table(
    "weather", ibis.examples.nycflights13_weather.fetch().to_pyarrow(), overwrite=True
)
con.disconnect()

Next, we'll load the data into a local PostgreSQL database using DuckDB—[yes, you can do that](https://duckdb.org/docs/extensions/postgres.html#writing-data-to-postgres)!

In [None]:
!psql < sql/create_nycflights13.sql

In [None]:
!duckdb nycflights13.ddb < sql/load_nycflights13.sql

We can confirm that our PostgreSQL database contains the tables we just populated.

In [None]:
!psql < sql/verify_nycflights13.sql

## Motivation

In your experience doing data analytics/building data pipelines, have you ever...

- ...slurped up large amounts of data into memory, instead of pushing execution down to the source database/engine?

- ...prototyped code in pandas, and then rewritten it in PySpark/Snowpark/some other native dataframe API?

- ...implemented a proof-of-concept solution on data extracts, and then struggled massively when you needed to move to running against the production databases and scale out?

- ...insisted on using Python across the full data engineering/data science workflow for consistency (fair enough), although dbt would have been the much better fit for non-ML pipelines, because you essentially needed a SQL workflow?