# Tutorial 2: Primer to Ponder

In [1]:
import ponder.bigquery
import modin.pandas as pd
import json; import os; os.chdir("..")
creds = json.load(open(os.path.expanduser("credential.json")))
bigquery_con = ponder.bigquery.connect(creds, schema = "TEST")
ponder.bigquery.init(bigquery_con)

2023-03-22 20:05:03,175 - INFO - Establishing connection to pushdown.ponder-internal.io



Connected to
       ___               __
      / _ \___  ___  ___/ /__ ____
     / ___/ _ \/ _ \/ _  / -_) __/
    /_/___\___/_//_/\_,_/\__/_/
      / __/__ _____  _____ ____
     _\ \/ -_) __/ |/ / -_) __/
    /___/\__/_/  |___/\__/_/



## What is Ponder?

Ponder lets you run your pandas code directly in your data warehouse. This means that you can continue to write pandas, but with the scalability and security benefits of a modern data warehouse. 

### Key Features

- **Data science at all scales**: With Ponder's technology, the same pandas workflows can be run at all scales, from megabytes to terabytes, without changing a single line of code. 

- **No change to user workflow:** Data scientists can continue running their existing pandas workflows and writing pandas code in their favorite IDE of choice, and benefit from seamless scalability improvements.

- **Simplify your data infrastructure:** No need to set up and maintain compute infrastructure required for other parallel processing frameworks (e.g., Spark, Ray, Dask, etc.) to perform large scale data analysis with pandas.

- **Guaranteed security:** All your pandas workflows will be executed in BigQuery, thus benefiting from the rigorous security guarantees offered by BigQuery.

In the following sections, we will showcase some examples of how Ponder works and how it can be used in your work.

### Demo 1: Write SQL no more, Ponder in action!

Under the hood, pandas operations are automatically compiled down to SQL queries that get pushed to BigQuery. Queries are executed directly on BigQuery, with users benefiting from the performance, scalability, and security benefits provided by BigQuery as the computation engine.  

Here is an architecture of how Ponder works: 

<img src="https://ponder.io/wp-content/uploads/2023/01/Group-362.png" width="75%"></img>


To show you that this is actually running in the data warehouse, you can log onto your [BigQuery web interface](https://console.cloud.google.com/bigquery). The pandas operations you execute on Ponder correspond to the SQL queries shown on the `Project History` page in BigQuery web interface.

In [2]:
df = pd.read_sql("PONDER_CUSTOMER", bigquery_con)

You can look at the corresponding SQL queries for the pandas operations ran in Ponder by going to `Project History` in your BigQuery web interface. The history page lets you view and drill into the details of queries executed in your BigQuery account.

<img src="img/bigquery_history.png" width="150%"></img>

In this case, you can see that as we connected to the table via `pd.read_sql`, this corresponding SQL query was generated: 

```sql
CREATE TEMP TABLE `Ponder_scbxxccwrd` AS SELECT *,  ROW_NUMBER() OVER (ORDER BY 1) -1  AS _PONDER_ROW_NUMBER_,  ROW_NUMBER() OVER (ORDER BY 1) -1  AS  _PONDER_ROW_LABELS_  FROM TEST.`PONDER_CUSTOMER`
```

You might recall that in the last tutorial, we performed z-score normalization on all the numerical columns. 

In [3]:
x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()

Unnamed: 0,C_CUSTKEY,C_NATIONKEY,C_ACCTBAL
0,1.085777,-1.475931,-0.233359
1,1.016838,-1.475931,-1.124393
2,-1.120246,-1.475931,0.403928
3,0.396395,-1.475931,0.539912
4,-0.982369,-1.475931,-1.561128
...,...,...,...
95,0.258518,1.507058,0.678901
96,0.982369,1.636753,-0.328580
97,0.292987,1.636753,-0.627605
98,-0.603209,1.636753,-1.480908


Now take a look at BigQuery's `Query History`, the corresponding SQL query is 150+ lines long!!

```sql
SELECT 
  _PONDER_ROW_LABELS_, 
  C_CUSTKEY, 
  C_NATIONKEY, 
  C_ACCTBAL 
FROM 
  (
    SELECT 
      * 
    FROM 
      (
        SELECT 
          _PONDER_ROW_NUMBER_, 
          _PONDER_ROW_LABELS_, 
          C_CUSTKEY / C_CUSTKEY_ponder_right AS C_CUSTKEY, 
          C_NATIONKEY / C_NATIONKEY_ponder_right AS C_NATIONKEY, 
          C_ACCTBAL / C_ACCTBAL_ponder_right AS C_ACCTBAL 
        FROM 
          (
            SELECT 
              _PONDER_ROW_NUMBER_, 
              _PONDER_ROW_LABELS_, 
              C_CUSTKEY - C_CUSTKEY_ponder_right AS C_CUSTKEY, 
              C_NATIONKEY - C_NATIONKEY_ponder_right AS C_NATIONKEY, 
              C_ACCTBAL - C_ACCTBAL_ponder_right AS C_ACCTBAL 
            FROM 
              (
                SELECT 
                  _PONDER_ROW_NUMBER_, 
                  _PONDER_ROW_LABELS_, 
                  C_CUSTKEY, 
                  C_NATIONKEY, 
                  C_ACCTBAL 
                FROM 
                  (
                    SELECT 
                      C_CUSTKEY, 
                      C_NAME, 
                      C_ADDRESS, 
                      C_NATIONKEY, 
                      C_PHONE, 
                      C_ACCTBAL, 
                      C_MKTSEGMENT, 
                      C_COMMENT, 
                      _PONDER_ROW_NUMBER_, 
                      _PONDER_ROW_LABELS_ 
                    FROM 
                      Ponder_scbxxccwrd 
                    ORDER BY 
                      _PONDER_ROW_NUMBER_
                  )
              ) AS _PONDER_LEFT_ CROSS 
              JOIN (
                SELECT 
                  C_CUSTKEY AS C_CUSTKEY_ponder_right, 
                  C_NATIONKEY AS C_NATIONKEY_ponder_right, 
                  C_ACCTBAL AS C_ACCTBAL_ponder_right 
                FROM 
                  (
                    SELECT 
                      0 AS _PONDER_ROW_NUMBER_, 
                      0 AS _PONDER_ROW_LABELS_, 
                      AVG(C_CUSTKEY) AS C_CUSTKEY, 
                      AVG(C_NATIONKEY) AS C_NATIONKEY, 
                      AVG(C_ACCTBAL) AS C_ACCTBAL 
                    FROM 
                      (
                        SELECT 
                          CAST(C_CUSTKEY AS FLOAT64) AS C_CUSTKEY, 
                          CAST(C_NATIONKEY AS FLOAT64) AS C_NATIONKEY, 
                          CAST(C_ACCTBAL AS FLOAT64) AS C_ACCTBAL, 
                          _PONDER_ROW_LABELS_, 
                          _PONDER_ROW_NUMBER_ 
                        FROM 
                          (
                            SELECT 
                              _PONDER_ROW_NUMBER_, 
                              _PONDER_ROW_LABELS_, 
                              C_CUSTKEY, 
                              C_NATIONKEY, 
                              C_ACCTBAL 
                            FROM 
                              (
                                SELECT 
                                  C_CUSTKEY, 
                                  C_NAME, 
                                  C_ADDRESS, 
                                  C_NATIONKEY, 
                                  C_PHONE, 
                                  C_ACCTBAL, 
                                  C_MKTSEGMENT, 
                                  C_COMMENT, 
                                  _PONDER_ROW_NUMBER_, 
                                  _PONDER_ROW_LABELS_ 
                                FROM 
                                  Ponder_scbxxccwrd 
                                ORDER BY 
                                  _PONDER_ROW_NUMBER_
                              )
                          )
                      ) 
                    LIMIT 
                      1
                  )
              ) AS _PONDER_RIGHT_
          ) AS _PONDER_LEFT_ CROSS 
          JOIN (
            SELECT 
              C_CUSTKEY AS C_CUSTKEY_ponder_right, 
              C_NATIONKEY AS C_NATIONKEY_ponder_right, 
              C_ACCTBAL AS C_ACCTBAL_ponder_right 
            FROM 
              (
                SELECT 
                  0 AS _PONDER_ROW_NUMBER_, 
                  0 AS _PONDER_ROW_LABELS_, 
                  STDDEV(C_CUSTKEY) AS C_CUSTKEY, 
                  STDDEV(C_NATIONKEY) AS C_NATIONKEY, 
                  STDDEV(C_ACCTBAL) AS C_ACCTBAL 
                FROM 
                  (
                    SELECT 
                      _PONDER_ROW_NUMBER_, 
                      _PONDER_ROW_LABELS_, 
                      C_CUSTKEY, 
                      C_NATIONKEY, 
                      C_ACCTBAL 
                    FROM 
                      (
                        SELECT 
                          C_CUSTKEY, 
                          C_NAME, 
                          C_ADDRESS, 
                          C_NATIONKEY, 
                          C_PHONE, 
                          C_ACCTBAL, 
                          C_MKTSEGMENT, 
                          C_COMMENT, 
                          _PONDER_ROW_NUMBER_, 
                          _PONDER_ROW_LABELS_ 
                        FROM 
                          Ponder_scbxxccwrd 
                        ORDER BY 
                          _PONDER_ROW_NUMBER_
                      )
                  ) 
                LIMIT 
                  1
              )
          ) AS _PONDER_RIGHT_
      ) 
    WHERE 
      _PONDER_ROW_NUMBER_ IN (
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
        14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 
        24, 25, 26, 27, 28, 29, 30, 69, 70, 71, 
        72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 
        82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 
        92, 93, 94, 95, 96, 97, 98, 99
      )
  ) 
ORDER BY 
  _PONDER_ROW_NUMBER_ 
LIMIT 
  10001
```

In this example, we saw how something as easy to express in pandas in a single line can in fact take *many* lines of SQL to write. 

Using Ponder leads to huge time-savings since you can think and work natively in pandas when interacting with your data warehouse.

### Summary

In this tutorial, we saw how Ponder lets you run pandas on BigQuery. 

Ponder simplifies your experience in working with data. It does this by translating your pandas queries to corresponding SQL queries to run on your data warehouse. Ponder gives you the flexibility of working in pandas directly and often there are queries that are easier to write in pandas than having to craft hundreds of lines of SQL!

As we can see, there are many benefits from being able to leverage the pandas API (over writing SQL directly) on your data warehouse, as summarized in this table. 

|               | pandas | SQL | Ponder |
|---------------|--------|-----|--------|
| Easy to use   | ✅      | ❌   | ✅      |
| Flexible      | ✅      | ❌   | ✅      |
| Scalable      | ❌      | ✅   | ✅      |
| Secure access | ❌      | ✅   | ✅      |


To learn more about Ponder, check out our product blogpost [here](https://ponder.io/run-pandas-on-1tb-directly-in-your-data-warehouse/).