# Tutorial 1: Getting Started

In this tutorial, we will walk through how you can get started running pandas on DuckDB using Ponder.

In [1]:
import ponder
ponder.init()



### Uploading Example Datasets

We will be using a few example datasets for the tutorial. You can run this python script to populate the required datasets to your database. This will add three different tables to your database populated with example datasets: 
- [PONDER_TAXI](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/yellow_tripdata_2015-01.csv)
- [PONDER_CITIBIKE](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/citibike_trial.csv)
- [PONDER_BOOK](https://github.com/ponder-org/ponder-datasets/blob/main/books.csv).

Note that you only need to run the following script once for the tables to get populated.

In [None]:
!python populate_datasets.py > /dev/null 2>&1

### Connecting to DuckDB

Ponder uses your database as an engine, so we need to establish a connection with DuckDB in order to start querying the data. The code below shows how you can configure the database connection.

In [2]:
import duckdb
duckdb_con = duckdb.connect("../ponder.db")
ponder.configure(default_connection=duckdb_con)

First let's take a look at the tables in this database:

In [3]:
duckdb_con.execute('SELECT * FROM duckdb_tables;').df()

Unnamed: 0,database_name,database_oid,schema_name,schema_oid,table_name,table_oid,internal,temporary,has_primary_key,estimated_size,column_count,index_count,check_constraint_count,sql
0,ponder,4,main,856,PONDER_CITIBIKE,878,False,False,False,118865,15,0,0,"CREATE TABLE ""PONDER_CITIBIKE""(tripduration DO..."
1,ponder,4,main,856,PONDER_BOOKS,876,False,False,False,11123,12,0,0,"CREATE TABLE ""PONDER_BOOKS""(""bookID"" BIGINT, t..."
2,ponder,4,main,856,PONDER_TAXI,874,False,False,False,210035,17,0,0,"CREATE TABLE ""PONDER_TAXI""(""VENDORID"" BIGINT, ..."
3,ponder,4,main,856,PONDER_CUSTOMER,872,False,False,False,100,8,0,0,"CREATE TABLE ""PONDER_CUSTOMER""(""C_CUSTKEY"" BIG..."
4,ponder,4,main,856,PONDER_ORDERS,870,False,False,False,145,9,0,0,"CREATE TABLE ""PONDER_ORDERS""(""O_ORDERKEY"" BIGI..."
5,ponder,4,main,856,PONDER_PART,868,False,False,False,3893,9,0,0,"CREATE TABLE ""PONDER_PART""(""P_PARTKEY"" BIGINT,..."
6,ponder,4,main,856,PONDER_SUPPLIER,866,False,False,False,3255,7,0,0,"CREATE TABLE ""PONDER_SUPPLIER""(""S_SUPPKEY"" BIG..."


### Starting Pondering 🎉

Now that we have the connection initialized. Let's read the **PONDER_BOOKS** table that already exists in your database. This dataset comes from the [Goodreads dataset from Kaggle](https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks) and contains a books and their review information.

In [4]:
import modin.pandas as pd

In [5]:
df = pd.read_sql("PONDER_BOOKS", duckdb_con)

Let's first print out the dataframe and take a look at the data. 

In [6]:
df

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic
...,...,...,...,...,...,...,...,...,...,...,...,...
11118,45631,Expelled from Eden: A William T. Vollmann Reader,William T. Vollmann/Larry McCaffery/Michael He...,4.06,1560254416,9781560254416,eng,512,156,20,12/21/2004,Da Capo Press
11119,45633,You Bright and Risen Angels,William T. Vollmann,4.08,0140110879,9780140110876,eng,635,783,56,12/1/1988,Penguin Books
11120,45634,The Ice-Shirt (Seven Dreams #1),William T. Vollmann,3.96,0140131965,9780140131963,eng,415,820,95,8/1/1993,Penguin Books
11121,45639,Poor People,William T. Vollmann,3.72,0060878827,9780060878825,eng,434,769,139,2/27/2007,Ecco


Now we can start hacking away with pandas! Note that every single operations you are doing here with pandas is directly being run on DuckDB.

First, let's take a look at the basic statistics around the numerical columns in our dataset.

In [7]:
df.describe()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
count,11123.0,11123.0,11123.0,11123.0,11123.0,11123.0
mean,21310.856963,3.934075,9759880000000.0,336.405556,17942.85,542.048099
std,13094.727252,0.350485,442975800000.0,241.152626,112499.2,2576.619589
min,1.0,0.0,8987060000.0,0.0,0.0,0.0
25%,10277.5,3.77,9780345000000.0,192.0,104.0,9.0
50%,20287.0,3.96,9780582000000.0,299.0,745.0,47.0
75%,32104.5,4.14,9780873000000.0,416.0,5000.5,238.0
max,45641.0,5.0,9790008000000.0,6576.0,4597666.0,94265.0


Let's say we want to normalize the numerical columns by doing a standard z-score normalization (where $\mu$ is the mean and $\sigma$ is the standard deviation). 

$$ x' = \frac{x-\mu}{\sigma}$$

In [8]:
x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
0,-1.627362,1.814412,0.046412,1.308692,18.469003,10.497845
1,-1.627285,1.586157,0.046411,2.212684,18.979913,11.130456
2,-1.627133,1.386434,0.046412,0.064666,-0.103199,-0.115674
3,-1.627056,1.785880,0.046412,0.408847,20.636974,13.887557
4,-1.626827,2.413581,0.046412,9.759771,0.208758,-0.146723
...,...,...,...,...,...,...
11118,1.857247,0.359287,0.048942,0.728147,-0.158107,-0.202610
11119,1.857400,0.416350,0.045736,1.238197,-0.152533,-0.188638
11120,1.857476,0.073968,0.045736,0.325912,-0.152204,-0.173502
11121,1.857858,-0.610797,0.045557,0.404700,-0.152658,-0.156425


Next, let's look at all the columns that are non-numerical:

In [9]:
df.select_dtypes(include='object').head()

Unnamed: 0,title,authors,isbn,language_code,publication_date,publisher
0,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,0439785960,eng,9/16/2006,Scholastic Inc.
1,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,0439358078,eng,9/1/2004,Scholastic Inc.
2,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,0439554896,eng,11/1/2003,Scholastic
3,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,043965548X,eng,5/1/2004,Scholastic Inc.
4,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,0439682584,eng,9/13/2004,Scholastic


We see that there are 27 different languages represented by `language_code` in this dataset.

In [10]:
df.language_code.unique()

array(['eng', 'en-US', 'fre', 'spa', 'en-GB', 'mul', 'grc', 'enm',
       'en-CA', 'ger', 'jpn', 'ara', 'nl', 'zho', 'lat', 'por', 'srp',
       'ita', 'rus', 'msa', 'glg', 'wel', 'swe', 'nor', 'tur', 'gla',
       'ale'], dtype=object)

To feed this into a machine learning model, we want to [one-hot encode](https://en.wikipedia.org/wiki/One-hot) this catagorical column to a set of binary features. 

In [11]:
encoded_df = pd.get_dummies(df, columns="language_code")
encoded_df

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,num_pages,ratings_count,text_reviews_count,publication_date,...,language_code_nl,language_code_nor,language_code_por,language_code_rus,language_code_spa,language_code_srp,language_code_swe,language_code_tur,language_code_wel,language_code_zho
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,652,2095690,27591,9/16/2006,...,0,0,0,0,0,0,0,0,0,0
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,870,2153167,29221,9/1/2004,...,0,0,0,0,0,0,0,0,0,0
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,352,6333,244,11/1/2003,...,0,0,0,0,0,0,0,0,0,0
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,435,2339585,36325,5/1/2004,...,0,0,0,0,0,0,0,0,0,0
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,2690,41428,164,9/13/2004,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11118,45631,Expelled from Eden: A William T. Vollmann Reader,William T. Vollmann/Larry McCaffery/Michael He...,4.06,1560254416,9781560254416,512,156,20,12/21/2004,...,0,0,0,0,0,0,0,0,0,0
11119,45633,You Bright and Risen Angels,William T. Vollmann,4.08,0140110879,9780140110876,635,783,56,12/1/1988,...,0,0,0,0,0,0,0,0,0,0
11120,45634,The Ice-Shirt (Seven Dreams #1),William T. Vollmann,3.96,0140131965,9780140131963,415,820,95,8/1/1993,...,0,0,0,0,0,0,0,0,0,0
11121,45639,Poor People,William T. Vollmann,3.72,0060878827,9780060878825,434,769,139,2/27/2007,...,0,0,0,0,0,0,0,0,0,0


We select out only the columns with names matching "language". This leaves us with all the converted binary columns, which is often referred to as the indicator matrix. This can be an input to a machine learning model. 

In [12]:
indicator_matrix= encoded_df.filter(regex="language")
indicator_matrix

Unnamed: 0,language_code_ale,language_code_ara,language_code_en-CA,language_code_en-GB,language_code_en-US,language_code_eng,language_code_enm,language_code_fre,language_code_ger,language_code_gla,...,language_code_nl,language_code_nor,language_code_por,language_code_rus,language_code_spa,language_code_srp,language_code_swe,language_code_tur,language_code_wel,language_code_zho
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11118,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11119,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11120,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11121,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Summary

In this tutorial, we saw how you can get started in running common data science operations in pandas directly on your `PONDER_BOOK` table in DuckDB.

That means that every single operation that you performed in this tutorial is being executed directly in your database! The only data that is being pulled out of your database is the few lines of results that is printed in the notebook!

Note that if you were to write the equivalent SQL query to run these pandas commands on DuckDB, it would take many lines of code to express the same query. If you're interested in learning about why, check out this [blogpost](https://ponder.io/pandas-vs-sql-part-2-pandas-is-more-concise/#:~:text=the%20window%20function.-,Conclusion,and%20dropping%20sparsely%20populated%20features.).

In our next tutorial, we will share more details on how Ponder works and how you can leverage Ponder to scale up your data science workflow!