# Tutorial 1: Getting Started

In this tutorial, we will walk through how you can get started running pandas on BigQuery using Ponder.

### BigQuery Connections Credential

To run Ponder on your data warehouse, you must first establish database connection to your warehouse. 

In [1]:
import json
import os
creds = json.load(open(os.path.expanduser("../../../doris-bigquery-381416-497e94001f15.json")))

### Connecting to BigQuery

Ponder uses your data warehouse as an engine, so we need to establish a connection with BigQuery in order to start querying the data. The code below shows how you can configure the database connection.

In [3]:
import ponder.bigquery

# Create a Ponder BigQuery Connections object
bigquery_con = ponder.bigquery.connect(creds, schema = "TEST")

# Initialize the BigQuery connection
ponder.bigquery.init(bigquery_con)

2023-03-22 18:36:37,767 - INFO - Establishing connection to pushdown.ponder-internal.io



Connected to
       ___               __
      / _ \___  ___  ___/ /__ ____
     / ___/ _ \/ _ \/ _  / -_) __/
    /_/___\___/_//_/\_,_/\__/_/
      / __/__ _____  _____ ____
     _\ \/ -_) __/ |/ / -_) __/
    /___/\__/_/  |___/\__/_/



If you have succesfully established the connection, you should see the following output

### Uploading Example Datasets

In [5]:
import modin.pandas as pd

We will be using a few example datasets for the tutorial. You can run this python script to populate the required datasets to your database. This will add three different tables to your database populated with example datasets: 
- [PONDER_TAXI](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/yellow_tripdata_2015-01.csv)
- [PONDER_CITIBIKE](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/citibike_trial.csv)
- [PONDER_BOOK](https://github.com/ponder-org/ponder-datasets/blob/main/books.csv).

Note that you only need to run the following script once for the tables to get populated.

In [4]:
# !python populate_datasets.py > /dev/null 2>&1

### Starting Pondering 🎉

Now that we have the connection initialized. Let's read the **PONDER_CUSTOMER** table that already exists in your database. This dataset is a sample of the `CUSTOMER` table in the [TPCH dataset](https://www.tpc.org/tpch/).

In [6]:
df = pd.read_sql("PONDER_CUSTOMER", bigquery_con)

Let's first print out the dataframe and take a look at the data. 

In [7]:
df

Unnamed: 0,C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_MKTSEGMENT,C_COMMENT
0,60082,Customer#000060082,"x3V6vEbLSeUjYdjS1MvR2,u4gB0S 9d8UEJ",0,10-729-863-1818,3645.47,BUILDING,the accounts. furiously unusual
1,60080,Customer#000060080,"g7cKdEj2mzUQLSKFFnWsmL,3GaOIrBmfi",0,10-192-161-6631,689.24,BUILDING,"slyly pending, permanent packages. special fo..."
2,60018,Customer#000060018,lQ8PB9FGW53C36XQX2uq0,0,10-310-354-8579,5759.83,BUILDING,ckly bold deposits. carefully bold accounts in...
3,60062,Customer#000060062,"1SI,x4F9 zO22 F7OGksMBSUWu5AUpP",0,10-604-525-3386,6210.99,FURNITURE,ons cajole blithely. bold theodolites along
4,60022,Customer#000060022,"I2XoZQLC,63R3zIG z6i3VMCS",0,10-513-498-1045,-759.74,FURNITURE,across the blithely ironic sentiments. thinly...
...,...,...,...,...,...,...,...,...
95,60058,Customer#000060058,"X9NS,0Ddki",23,33-146-680-6559,6672.12,MACHINERY,ess requests. special requests wake blit
96,60079,Customer#000060079,dwwsJWhDr0fnRJnyhe6gtls,24,34-197-192-3607,3329.55,BUILDING,ly special somas poach carefully. furiously un...
97,60059,Customer#000060059,"dZISBokE9NWaz13 b5WbOHrd8DifA,e2yict0",24,34-348-323-9173,2337.46,HOUSEHOLD,ndencies. excuses sleep. quickly daring dugout...
98,60033,Customer#000060033,fwvb5ua8ZcB,24,34-142-708-2404,-493.59,MACHINERY,lithely final packages. quickly regular reques...


Now we can start hacking away with pandas! Note that every single operations you are doing here with pandas is directly being run on BigQuery.

First, let's take a look at the basic statistics around the numerical columns in our dataset.

In [8]:
df.describe()

Unnamed: 0,C_CUSTKEY,C_NATIONKEY,C_ACCTBAL
count,100.0,100.0,100.0
mean,60050.5,11.38,4419.6958
std,29.011492,7.710389,3317.751305
min,60001.0,0.0,-924.45
25%,60025.75,4.0,1582.3775
50%,60050.5,12.0,4486.89
75%,60075.25,17.25,7201.275
max,60100.0,24.0,9957.56


Let's say we want to normalize the numerical columns by doing a standard z-score normalization (where $\mu$ is the mean and $\sigma$ is the standard deviation). 

$$ x' = \frac{x-\mu}{\sigma}$$

In [9]:
x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()

Unnamed: 0,C_CUSTKEY,C_NATIONKEY,C_ACCTBAL
0,1.085777,-1.475931,-0.233359
1,1.016838,-1.475931,-1.124393
2,-1.120246,-1.475931,0.403928
3,0.396395,-1.475931,0.539912
4,-0.982369,-1.475931,-1.561128
...,...,...,...
95,0.258518,1.507058,0.678901
96,0.982369,1.636753,-0.328580
97,0.292987,1.636753,-0.627605
98,-0.603209,1.636753,-1.480908


Next, let's look at all the columns that are non-numerical:

In [10]:
df.select_dtypes(include='object').head()

Unnamed: 0,C_NAME,C_ADDRESS,C_PHONE,C_MKTSEGMENT,C_COMMENT
0,Customer#000060082,"x3V6vEbLSeUjYdjS1MvR2,u4gB0S 9d8UEJ",10-729-863-1818,BUILDING,the accounts. furiously unusual
1,Customer#000060080,"g7cKdEj2mzUQLSKFFnWsmL,3GaOIrBmfi",10-192-161-6631,BUILDING,"slyly pending, permanent packages. special fo..."
2,Customer#000060018,lQ8PB9FGW53C36XQX2uq0,10-310-354-8579,BUILDING,ckly bold deposits. carefully bold accounts in...
3,Customer#000060062,"1SI,x4F9 zO22 F7OGksMBSUWu5AUpP",10-604-525-3386,FURNITURE,ons cajole blithely. bold theodolites along
4,Customer#000060022,"I2XoZQLC,63R3zIG z6i3VMCS",10-513-498-1045,FURNITURE,across the blithely ironic sentiments. thinly...


We see that there are 5 different market segments represented by `C_MKTSEGMENT` in this dataset.

In [13]:
df.C_MKTSEGMENT.unique()

array(['BUILDING', 'FURNITURE', 'HOUSEHOLD', 'MACHINERY', 'AUTOMOBILE'],
      dtype=object)

To feed this into a machine learning model, we want to [one-hot encode](https://en.wikipedia.org/wiki/One-hot) this catagorical column to a set of binary features. 

In [16]:
encoded_df = pd.get_dummies(df, columns="C_MKTSEGMENT")
encoded_df

Unnamed: 0,C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_COMMENT,C_MKTSEGMENT_AUTOMOBILE,C_MKTSEGMENT_BUILDING,C_MKTSEGMENT_FURNITURE,C_MKTSEGMENT_HOUSEHOLD,C_MKTSEGMENT_MACHINERY
0,60082,Customer#000060082,"x3V6vEbLSeUjYdjS1MvR2,u4gB0S 9d8UEJ",0,10-729-863-1818,3645.47,the accounts. furiously unusual,0,1,0,0,0
1,60080,Customer#000060080,"g7cKdEj2mzUQLSKFFnWsmL,3GaOIrBmfi",0,10-192-161-6631,689.24,"slyly pending, permanent packages. special fo...",0,1,0,0,0
2,60018,Customer#000060018,lQ8PB9FGW53C36XQX2uq0,0,10-310-354-8579,5759.83,ckly bold deposits. carefully bold accounts in...,0,1,0,0,0
3,60062,Customer#000060062,"1SI,x4F9 zO22 F7OGksMBSUWu5AUpP",0,10-604-525-3386,6210.99,ons cajole blithely. bold theodolites along,0,0,1,0,0
4,60022,Customer#000060022,"I2XoZQLC,63R3zIG z6i3VMCS",0,10-513-498-1045,-759.74,across the blithely ironic sentiments. thinly...,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
95,60058,Customer#000060058,"X9NS,0Ddki",23,33-146-680-6559,6672.12,ess requests. special requests wake blit,0,0,0,0,1
96,60079,Customer#000060079,dwwsJWhDr0fnRJnyhe6gtls,24,34-197-192-3607,3329.55,ly special somas poach carefully. furiously un...,0,1,0,0,0
97,60059,Customer#000060059,"dZISBokE9NWaz13 b5WbOHrd8DifA,e2yict0",24,34-348-323-9173,2337.46,ndencies. excuses sleep. quickly daring dugout...,0,0,0,1,0
98,60033,Customer#000060033,fwvb5ua8ZcB,24,34-142-708-2404,-493.59,lithely final packages. quickly regular reques...,0,0,0,0,1


We select out only the columns with names matching "C_MKTSEGMENT". This leaves us with all the converted binary columns, which is often referred to as the indicator matrix. This can be an input to a machine learning model. 

In [17]:
indicator_matrix= encoded_df.filter(regex="C_MKTSEGMENT")
indicator_matrix

Unnamed: 0,C_MKTSEGMENT_AUTOMOBILE,C_MKTSEGMENT_BUILDING,C_MKTSEGMENT_FURNITURE,C_MKTSEGMENT_HOUSEHOLD,C_MKTSEGMENT_MACHINERY
0,0,1,0,0,0
1,0,1,0,0,0
2,0,1,0,0,0
3,0,0,1,0,0
4,0,0,1,0,0
...,...,...,...,...,...
95,0,0,0,0,1
96,0,1,0,0,0
97,0,0,0,1,0
98,0,0,0,0,1


### Summary

In this tutorial, we saw how you can get started in running common data science operations in pandas directly on the `PONDER_CUSTOMER` table in your BigQuery.

That means that every single operation that you performed in this tutorial is being executed directly in your data warehouse! The only data that is being pulled out of the warehouse is the few lines of results that is printed in the notebook!

Note that if you were to write the equivalent SQL query to run these pandas commands on BigQuery, it would take many lines of code to express the same query. If you're interested in learning about why, check out this [blogpost](https://ponder.io/pandas-vs-sql-part-2-pandas-is-more-concise/#:~:text=the%20window%20function.-,Conclusion,and%20dropping%20sparsely%20populated%20features.).

In our next tutorial, we will share more details on how Ponder works and how you can leverage Ponder to scale up your data science workflow!
