# DuckDB in Jupyter Notebooks
A streamlined workflow for SQL analysis with DuckDB and Jupyter

## Library Import and Configuration

In [None]:
!pip install --quiet duckdb
!pip install --quiet pandas
!pip install --quiet ipython-sql 
!pip install --quiet SQLAlchemy
!pip install --quiet duckdb-engine

[K     |████████████████████████████████| 12.5 MB 4.6 MB/s 
[?25h

In [None]:
import duckdb
import pandas as pd
import sqlalchemy
# No need to import duckdb_engine
#  SQLAlchemy will auto-detect the driver needed based on your connection string!

# Import ipython-sql Jupyter extension to create SQL cells
%load_ext sql

We configure ipython-sql to return data as a Pandas dataframe and have less verbose output

In [None]:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
# %config SqlMagic.displaycon = False # Available in newer releases of ipython-sql than version on Collab

## Connecting to DuckDB
Connect ipython-sql to DuckDB using a SQLAlchemy-style connection string. You may either connect to an in memory DuckDB, or a file backed db.

In [None]:
%sql duckdb:///:memory:
# %sql duckdb:///path/to/file.db

'Connected: @:memory:'

## Querying DuckDB
Single line SQL queries can be run using `%sql` at the start of a line. Query results will be displayed as a Pandas DF. Note the SQL syntax highlighting!

In [None]:
%sql SELECT 'Off and flying!' as a_duckdb_column

 * duckdb:///:memory:


Unnamed: 0,a_duckdb_column
0,Off and flying!


An entire Jupyter cell can be used as a SQL cell by placing `%%sql` at the start of the cell. Query results will be displayed as a Pandas DF.

In [None]:
%%sql
SELECT
    schema_name,
    function_name
FROM duckdb_functions()
ORDER BY ALL DESC
LIMIT 5

 * duckdb:///:memory:


Unnamed: 0,schema_name,function_name
0,pg_catalog,shobj_description
1,pg_catalog,pg_typeof
2,pg_catalog,pg_type_is_visible
3,pg_catalog,pg_ts_template_is_visible
4,pg_catalog,pg_ts_parser_is_visible


To return query results into a Pandas dataframe for future usage, use `<<` as an assignment operator. This can be used with both the `%sql` and `%%sql` Jupyter magics.

In [None]:
%sql my_df << SELECT 'Off and flying!' as a_duckdb_column
my_df

 * duckdb:///:memory:
Returning data to local variable my_df


Unnamed: 0,a_duckdb_column
0,Off and flying!


## Querying Pandas Dataframes
DuckDB is able to find and query any dataframe stored as a variable in the Jupyter notebook.

In [None]:
input_df = pd.DataFrame.from_dict({"i":[1, 2, 3],
                                  "j":["one", "two", "three"]})

The dataframe being queried can be specified just like any other table in the `FROM` clause.

In [None]:
%sql output_df << SELECT sum(i) as total_i FROM input_df
output_df

 * duckdb:///:memory:
Returning data to local variable output_df


Unnamed: 0,total_i
0,6


## Summary
You now have the ability to alternate between SQL and Pandas in a simple and highly performant way! Dataframes can be read as tables in SQL, and SQL results can be output into Dataframes. You also benefit from SQL syntax highlighting. Happy analyzing!