# Tutorial 1: Getting Started

In this tutorial, we will walk through how you can get started running pandas on BigQuery using Ponder.

### BigQuery Connections Credential

To run Ponder on your data warehouse, you must first establish database connection to your warehouse. 

In [1]:
import json
import os
creds = json.load(open(os.path.expanduser("/Users/dorislee/Desktop/access_keys/doris-bigquery-381416-497e94001f15.json")))

### Connecting to BigQuery

Ponder uses your data warehouse as an engine, so we need to establish a connection with BigQuery in order to start querying the data. The code below shows how you can configure the database connection.

In [2]:
import ponder.bigquery

# Create a Ponder BigQuery Connections object
bigquery_con = ponder.bigquery.connect(creds, schema = "TEST")

# Initialize the BigQuery connection
ponder.bigquery.init(bigquery_con,enable_ssl=True)

2023-03-22 11:13:35,659 - INFO - Establishing connection to service.ponder.io


Connection encrypted with TLSv1.3
Connected to
       ___               __
      / _ \___  ___  ___/ /__ ____
     / ___/ _ \/ _ \/ _  / -_) __/
    /_/___\___/_//_/\_,_/\__/_/
      / __/__ _____  _____ ____
     _\ \/ -_) __/ |/ / -_) __/
    /___/\__/_/  |___/\__/_/




If you have succesfully established the connection, you should see the following output

### Uploading Example Datasets

We will be using a few example datasets for the tutorial. You can run this python script to populate the required datasets to your database. This will add three different tables to your database populated with example datasets: 
- [PONDER_TAXI](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/yellow_tripdata_2015-01.csv)
- [PONDER_CITIBIKE](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/citibike_trial.csv)
- [PONDER_BOOK](https://github.com/ponder-org/ponder-datasets/blob/main/books.csv).

Note that you only need to run the following script once for the tables to get populated.

In [19]:
# !python populate_datasets.py > /dev/null 2>&1

### Starting Pondering 🎉

Now that we have the connection initialized. Let's read the **PONDER_CUSTOMER** table that already exists in your database. This dataset is a sample of the `CUSTOMER` table in the [TPCH dataset](https://www.tpc.org/tpch/).

In [17]:
import modin.pandas as pd

In [28]:
df = pd.read_sql("PONDER_CUSTOMER", bigquery_con)

Let's first print out the dataframe and take a look at the data. 

In [22]:
df

Unnamed: 0,O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT
0,888608,60047,F,121724.69,9/30/1992,5-LOW,Clerk#000000326,0,express packages boost slyly quickly bold asy...
1,1050464,60040,F,320535.89,1/27/1993,5-LOW,Clerk#000000646,0,s dolphins are carefully at the quickly iron
2,246241,60028,F,19117.05,5/30/1992,5-LOW,Clerk#000000711,0,ly among the quickly silent p
3,162594,60049,F,27063.41,3/16/1992,5-LOW,Clerk#000000949,0,nts use furiously. quickly regular accounts
4,859619,60020,F,191464.85,8/16/1994,5-LOW,Clerk#000000171,0,pinto beans breach. f
...,...,...,...,...,...,...,...,...,...
140,916005,60067,P,93197.61,3/20/1995,2-HIGH,Clerk#000000349,0,eposits are across the slyly ironic instructio...
141,919652,60100,P,237607.25,5/9/1995,2-HIGH,Clerk#000000052,0,. carefully furious excuses according to th
142,219908,60052,P,88290.20,3/29/1995,1-URGENT,Clerk#000000487,0,he unusual packages. regular platelets use car...
143,668835,60089,P,326256.63,5/9/1995,3-MEDIUM,Clerk#000000721,0,he slyly ironic packages inte


Now we can start hacking away with pandas! Note that every single operations you are doing here with pandas is directly being run on BigQuery.

First, let's take a look at the basic statistics around the numerical columns in our dataset.

In [None]:
df.describe()

Let's say we want to normalize the numerical columns by doing a standard z-score normalization (where $\mu$ is the mean and $\sigma$ is the standard deviation). 

$$ x' = \frac{x-\mu}{\sigma}$$

In [None]:
x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()

Next, let's look at all the columns that are non-numerical:

In [None]:
df.select_dtypes(include='object').head()

We can look at the number of distinct value in each of these non-numerical columns

In [None]:
df.select_dtypes(include='object').nunique()

We see that there are 27 different languages represented by `language_code` in this dataset.

To feed this into a machine learning model, we want to [one-hot encode](https://en.wikipedia.org/wiki/One-hot) this catagorical column to a set of binary features. 

In [None]:
encoded_df = pd.get_dummies(df, columns="language_code")
encoded_df

We select out only the columns with names matching "language". This leaves us with all the converted binary columns, which is often referred to as the indicator matrix. This can be an input to a machine learning model. 

In [None]:
indicator_matrix= encoded_df.filter(regex="language")
indicator_matrix

### Summary

In this tutorial, we saw how you can get started in running common data science operations in pandas directly on the `PONDER_BOOK` table in your BigQuery.

That means that every single operation that you performed in this tutorial is being executed directly in your data warehouse! The only data that is being pulled out of the warehouse is the few lines of results that is printed in the notebook!

Note that if you were to write the equivalent SQL query to run these pandas commands on BigQuery, it would take many lines of code to express the same query. If you're interested in learning about why, check out this [blogpost](https://ponder.io/pandas-vs-sql-part-2-pandas-is-more-concise/#:~:text=the%20window%20function.-,Conclusion,and%20dropping%20sparsely%20populated%20features.).

In our next tutorial, we will share more details on how Ponder works and how you can leverage Ponder to scale up your data science workflow!
