# Tutorial 2: Primer to Ponder

<div class="alert alert-block alert-info"> <b>Before we get started: </b> 
    <ul style="list-style-type: none;margin: 0;padding: 0;">
        <li>✍️ To run this notebook, you need to have Ponder installed and set up on your machine. If you have not done so already, please refer to our <a href="https://docs.ponder.io/getting_started/quickstart.html">Quickstart guide</a> to get started.</li>
        <li>📖 Otherwise, if you're just interested in browsing through the tutorial, keep reading below!</li>
    </ul>
</div>

In [1]:
import ponder; ponder.init()
import modin.pandas as pd
from google.cloud import bigquery
from google.cloud.bigquery import dbapi
from google.oauth2 import service_account
import json
bigquery_con = dbapi.Connection(bigquery.Client(credentials=service_account.Credentials.from_service_account_info(json.loads(open("../credential.json").read()),scopes=["https://www.googleapis.com/auth/bigquery"])))

2023-05-11 18:59:05 - Creating session yETePRbnYQDX4ScLqp2C6WmwnxqbcfvpZ81o8uZURt


## What is Ponder?

Ponder lets you run your pandas code directly in your data warehouse. This means that you can continue to write pandas, but with the scalability and security benefits of a modern data warehouse. 

### Key Features

- **Data science at all scales**: With Ponder's technology, the same pandas workflows can be run at all scales, from megabytes to terabytes, without changing a single line of code. 

- **No change to user workflow:** Data scientists can continue running their existing pandas workflows and writing pandas code in their favorite IDE of choice, and benefit from seamless scalability improvements.

- **Simplify your data infrastructure:** No need to set up and maintain compute infrastructure required for other parallel processing frameworks (e.g., Spark, Ray, Dask, etc.) to perform large scale data analysis with pandas.

- **Guaranteed security:** All your pandas workflows will be executed in BigQuery, thus benefiting from the rigorous security guarantees offered by BigQuery.

In the following sections, we will showcase some examples of how Ponder works and how it can be used in your work.

### Demo: Write SQL no more, Ponder in action!

Under the hood, pandas operations are automatically compiled down to SQL queries that get pushed to BigQuery. Queries are executed directly on BigQuery, with users benefiting from the performance, scalability, and security benefits provided by BigQuery as the computation engine.  

Here is an architecture of how Ponder works: 

<img src="https://ponder.io/wp-content/uploads/2023/04/ponder_architecture.png" width="75%"></img>


To show you that this is actually running in the data warehouse, you can log onto your [BigQuery web interface](https://console.cloud.google.com/bigquery). The pandas operations you execute on Ponder correspond to the SQL queries shown on the `Project History` page in BigQuery web interface.

In [2]:
df = pd.read_sql("TEST.PONDER_BOOKS", bigquery_con)

You can look at the corresponding SQL queries for the pandas operations ran in Ponder by going to `Project History` in your BigQuery web interface. The history page lets you view and drill into the details of queries executed in your BigQuery account in the last 14 days.

<img src="img/bigquery_history.png" width="150%"></img>

In this case, you can see that as we connected to the table via `pd.read_sql`, this corresponding SQL query was generated: 

```sql
CREATE TEMP TABLE `Ponder_rnzdmeknkw` AS SELECT *, ROW_NUMBER() OVER (ORDER BY 1) -1 AS _PONDER_ROW_NUMBER_ FROM `TEST.PONDER_BOOKS` 
```

You might recall that in the last tutorial, we performed z-score normalization on all the numerical columns. 

In [5]:
x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
0,-1.228041,0.701669,0.065788,1.690193,-0.158898,-0.206879
1,1.733609,1.215243,0.046199,0.728147,-0.158587,-0.204162
2,-1.297916,-1.095839,0.067484,0.052226,-0.158409,-0.205715
3,-1.528009,-11.224651,0.047099,-0.930554,-0.159493,-0.210372
4,0.345952,-11.224651,0.046189,-0.864206,-0.159493,-0.210372
...,...,...,...,...,...,...
11118,1.373999,0.872860,0.046125,0.408847,-0.137742,-0.185921
11119,0.186804,0.872860,0.048999,0.587157,-0.158115,-0.208431
11120,-0.193884,0.872860,0.047306,-0.171699,-0.158880,-0.206879
11121,-0.884009,1.586157,0.047147,3.979200,-0.143538,-0.187085


Now take a look at BigQuery's `Project History`, the corresponding SQL query is 200+ lines long!!

```sql
SELECT 
  _PONDER_ROW_LABELS_, 
  bookID, 
  average_rating, 
  isbn13, 
  num_pages, 
  ratings_count, 
  text_reviews_count 
FROM 
  (
    SELECT 
      * 
    FROM 
      (
        SELECT 
          _PONDER_ROW_NUMBER_, 
          _PONDER_ROW_LABELS_, 
          bookID / bookID_ponder_right AS bookID, 
          average_rating / average_rating_ponder_right AS average_rating, 
          isbn13 / isbn13_ponder_right AS isbn13, 
          num_pages / num_pages_ponder_right AS num_pages, 
          ratings_count / ratings_count_ponder_right AS ratings_count, 
          text_reviews_count / text_reviews_count_ponder_right AS text_reviews_count 
        FROM 
          (
            SELECT 
              _PONDER_ROW_NUMBER_, 
              _PONDER_ROW_LABELS_, 
              bookID - bookID_ponder_right AS bookID, 
              average_rating - average_rating_ponder_right AS average_rating, 
              isbn13 - isbn13_ponder_right AS isbn13, 
              num_pages - num_pages_ponder_right AS num_pages, 
              ratings_count - ratings_count_ponder_right AS ratings_count, 
              text_reviews_count - text_reviews_count_ponder_right AS text_reviews_count 
            FROM 
              (
                SELECT 
                  _PONDER_ROW_NUMBER_, 
                  _PONDER_ROW_LABELS_, 
                  bookID, 
                  average_rating, 
                  isbn13, 
                  num_pages, 
                  ratings_count, 
                  text_reviews_count 
                FROM 
                  (
                    SELECT 
                      bookID, 
                      title, 
                      authors, 
                      average_rating, 
                      isbn, 
                      isbn13, 
                      language_code, 
                      num_pages, 
                      ratings_count, 
                      text_reviews_count, 
                      publication_date, 
                      publisher, 
                      _PONDER_ROW_NUMBER_, 
                      _PONDER_ROW_NUMBER_ AS _PONDER_ROW_LABELS_ 
                    FROM 
                      Ponder_rnzdmeknkw
                  )
              ) AS _PONDER_LEFT_ CROSS 
              JOIN (
                SELECT 
                  bookID AS bookID_ponder_right, 
                  average_rating AS average_rating_ponder_right, 
                  isbn13 AS isbn13_ponder_right, 
                  num_pages AS num_pages_ponder_right, 
                  ratings_count AS ratings_count_ponder_right, 
                  text_reviews_count AS text_reviews_count_ponder_right 
                FROM 
                  (
                    SELECT 
                      0 AS _PONDER_ROW_NUMBER_, 
                      0 AS _PONDER_ROW_LABELS_, 
                      AVG(bookID) AS bookID, 
                      AVG(average_rating) AS average_rating, 
                      AVG(isbn13) AS isbn13, 
                      AVG(num_pages) AS num_pages, 
                      AVG(ratings_count) AS ratings_count, 
                      AVG(text_reviews_count) AS text_reviews_count 
                    FROM 
                      (
                        SELECT 
                          CAST(bookID AS FLOAT64) AS bookID, 
                          CAST(average_rating AS FLOAT64) AS average_rating, 
                          CAST(isbn13 AS FLOAT64) AS isbn13, 
                          CAST(num_pages AS FLOAT64) AS num_pages, 
                          CAST(ratings_count AS FLOAT64) AS ratings_count, 
                          CAST(text_reviews_count AS FLOAT64) AS text_reviews_count, 
                          _PONDER_ROW_LABELS_, 
                          _PONDER_ROW_NUMBER_ 
                        FROM 
                          (
                            SELECT 
                              _PONDER_ROW_NUMBER_, 
                              _PONDER_ROW_LABELS_, 
                              bookID, 
                              average_rating, 
                              isbn13, 
                              num_pages, 
                              ratings_count, 
                              text_reviews_count 
                            FROM 
                              (
                                SELECT 
                                  bookID, 
                                  title, 
                                  authors, 
                                  average_rating, 
                                  isbn, 
                                  isbn13, 
                                  language_code, 
                                  num_pages, 
                                  ratings_count, 
                                  text_reviews_count, 
                                  publication_date, 
                                  publisher, 
                                  _PONDER_ROW_NUMBER_, 
                                  _PONDER_ROW_NUMBER_ AS _PONDER_ROW_LABELS_ 
                                FROM 
                                  Ponder_rnzdmeknkw
                              )
                          )
                      ) 
                    LIMIT 
                      1
                  )
              ) AS _PONDER_RIGHT_
          ) AS _PONDER_LEFT_ CROSS 
          JOIN (
            SELECT 
              bookID AS bookID_ponder_right, 
              average_rating AS average_rating_ponder_right, 
              isbn13 AS isbn13_ponder_right, 
              num_pages AS num_pages_ponder_right, 
              ratings_count AS ratings_count_ponder_right, 
              text_reviews_count AS text_reviews_count_ponder_right 
            FROM 
              (
                SELECT 
                  0 AS _PONDER_ROW_NUMBER_, 
                  0 AS _PONDER_ROW_LABELS_, 
                  STDDEV(bookID) AS bookID, 
                  STDDEV(average_rating) AS average_rating, 
                  STDDEV(isbn13) AS isbn13, 
                  STDDEV(num_pages) AS num_pages, 
                  STDDEV(ratings_count) AS ratings_count, 
                  STDDEV(text_reviews_count) AS text_reviews_count 
                FROM 
                  (
                    SELECT 
                      _PONDER_ROW_NUMBER_, 
                      _PONDER_ROW_LABELS_, 
                      bookID, 
                      average_rating, 
                      isbn13, 
                      num_pages, 
                      ratings_count, 
                      text_reviews_count 
                    FROM 
                      (
                        SELECT 
                          bookID, 
                          title, 
                          authors, 
                          average_rating, 
                          isbn, 
                          isbn13, 
                          language_code, 
                          num_pages, 
                          ratings_count, 
                          text_reviews_count, 
                          publication_date, 
                          publisher, 
                          _PONDER_ROW_NUMBER_, 
                          _PONDER_ROW_NUMBER_ AS _PONDER_ROW_LABELS_ 
                        FROM 
                          Ponder_rnzdmeknkw
                      )
                  ) 
                LIMIT 
                  1
              )
          ) AS _PONDER_RIGHT_
      ) 
    WHERE 
      _PONDER_ROW_NUMBER_ IN (
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
        14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 
        24, 25, 26, 27, 28, 29, 30, 11092, 11093, 
        11094, 11095, 11096, 11097, 11098, 
        11099, 11100, 11101, 11102, 11103, 
        11104, 11105, 11106, 11107, 11108, 
        11109, 11110, 11111, 11112, 11113, 
        11114, 11115, 11116, 11117, 11118, 
        11119, 11120, 11121, 11122
      )
  ) 
ORDER BY 
  _PONDER_ROW_NUMBER_ 
LIMIT 
  10001
```

In this example, we saw how something as easy to express in pandas in a single line can in fact take *many* lines of SQL to write. 

Using Ponder leads to huge time-savings since you can think and work natively in pandas when interacting with your data warehouse.

### Summary

In this tutorial, we saw how Ponder lets you run pandas on BigQuery. Ponder simplifies your experience in working with data. It does this by translating your pandas queries to corresponding SQL queries to run on your data warehouse. Ponder gives you the flexibility of working in pandas directly and often there are queries that are easier to write in pandas than having to craft hundreds of lines of SQL!

As we can see, there are many benefits from being able to leverage the pandas API (over writing SQL directly) on your data warehouse, as summarized in this table. 

|               | pandas | SQL | Ponder |
|---------------|--------|-----|--------|
| Easy to use   | ✅      | ❌   | ✅      |
| Flexible      | ✅      | ❌   | ✅      |
| Scalable      | ❌      | ✅   | ✅      |
| Secure access | ❌      | ✅   | ✅      |


To learn more about Ponder, check out our product blogpost [here](https://ponder.io/run-pandas-on-1tb-directly-in-your-data-warehouse/). 

In our [next tutorial](https://github.com/ponder-org/ponder-notebooks/blob/main/bigquery/tutorial/03-reading-data.ipynb), you will learn about how you can use the pandas's I/O methods to read from your database, CSV file, or a Parquet file.