# Tutorial 1: Getting Started

<div class="alert alert-block alert-info"> <b>Before we get started: </b> 
    <ul style="list-style-type: none;margin: 0;padding: 0;">
        <li>✍️ To run this notebook, you need to have Ponder installed and set up on your machine. If you have not done so already, please refer to our <a href="https://docs.ponder.io/getting_started/quickstart.html">Quickstart guide</a> to get started.</li>
        <li>📖 Otherwise, if you're just interested in browsing through the tutorial, keep reading below!</li>
    </ul>
</div>

In this tutorial, we will walk through how you can get started running pandas on BigQuery using Ponder.

### BigQuery Connections Credential

To run Ponder on your data warehouse, you must first establish database connection to your warehouse. Move your BigQuery service account key to your `bigquery/` directory and rename it as `credential.json`. We will be using this throughout the tutorial series.

<div class="alert alert-block alert-info"> <b>Note: </b> <span> If can not find the BigQuery account information you need to set up your database connection, please follow our <a href="https://docs.ponder.io/resources/bigquery_setup.html">step-by-step guide</a> here for more information. </span></div>

In [1]:
import ponder
ponder.init()

We will be using a few example datasets for the tutorial. You can run this python script to populate the required datasets to your database. This will add the following tables to your database populated with example datasets: 

- [PONDER_CITIBIKE](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/citibike_trial.csv)
- [PONDER_BOOKS](https://github.com/ponder-org/ponder-datasets/blob/main/books.csv)
- [PONDER_CUSTOMER](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/tpch/customer.csv)
- [PONDER_ORDER](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/tpch/orders.csv)
- [PONDER_PART](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/tpch/part.csv)
- [PONDER_SUPPLIER](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/tpch/supplier.csv)

Note that you only need to run the following script once for the tables to get populated.

In [None]:
!python populate_datasets.py

### Connecting to BigQuery

Ponder uses your data warehouse as an engine, so we need to establish a connection with BigQuery in order to start querying the data. The code below shows how you can configure the database connection.

In [2]:
from google.cloud import bigquery
from google.cloud.bigquery import dbapi
from google.oauth2 import service_account
import json

bigquery_con = dbapi.Connection(
            bigquery.Client(
            credentials=service_account.Credentials.from_service_account_info(
                    json.loads(open("../credential.json").read()),
                    scopes=["https://www.googleapis.com/auth/bigquery"],
                )
            )
        )

### Starting Pondering 🎉

Now that we have the connection initialized. Let's read the **PONDER_BOOKS** table that already exists in your database (Note that `PONDER` is the [BigQuery Dataset](https://cloud.google.com/bigquery/docs/datasets-intro) name here). This dataset comes from the [Goodreads dataset from Kaggle](https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks) and contains a books and their review information.

In [3]:
import modin.pandas as pd

In [4]:
df = pd.read_sql("PONDER.PONDER_BOOKS", bigquery_con)

Let's first print out the dataframe and take a look at the data. 

In [5]:
df

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,5230,Vergeef me,Wally Lamb/Inge de Heer,4.18,9022530078,9789022530078,nl,744,67,9,7/1/2001,De Boekerij
1,44012,Shield of Thunder (Troy #2),David Gemmell,4.36,0345477014,9780345477019,ale,512,102,16,3/27/2007,Ballantine Books
2,4315,Zaat,Sonallah Ibrahim/صنع الله إبراهيم/Anthony Cald...,3.55,9774248449,9789774248443,ara,349,122,12,3/15/2004,American University in Cairo Press
3,19257,Canopy: A Work for Voice and Light in Harvard ...,David Ward/Parveen Adams/Seamus Heaney/Ivan ...,0.00,0916724948,9780916724948,eng,63,0,0,12/31/1997,Arts Publications
4,15186,American Film Guide,Frank N. Magill,0.00,0893562505,9780893562502,eng,5,0,0,1/1/1983,Salem Press Inc
...,...,...,...,...,...,...,...,...,...,...,...,...
11118,43068,True Blue: The Oxford Boat Race Mutiny,Daniel Topolski/Patrick Robinson,4.24,0553400037,9780553400038,en-US,320,69,13,2/23/1990,Bantam
11119,30250,Sensual Phrase Vol. 17,Mayu Shinjō,4.24,1421505622,9781421505626,en-US,208,489,4,12/1/2006,Viz Media
11120,22406,The Invisibles Vol. 6: Kissing Mister Quimper,Grant Morrison/Chris Weston/Ivan Reis,4.24,1563896001,9781563896002,en-US,224,3852,63,2/1/2000,DC Comics Vertigo
11121,27550,Walking with the Wind: A Memoir of the Movement,John Lewis/Michael D'Orso,4.49,0156007088,9780156007085,en-US,496,2052,253,10/18/1999,Mariner Books


Now we can start hacking away with pandas! Note that every single operations you are doing here with pandas is directly being run on BigQuery.

First, let's take a look at the basic statistics around the numerical columns in our dataset.

In [6]:
df.describe()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
count,11123.0,11123.0,11123.0,11123.0,11123.0,11123.0
mean,21310.856963,3.934075,9759880000000.0,336.405556,17942.85,542.048099
std,13094.727252,0.350485,442975800000.0,241.152626,112499.2,2576.619589
min,1.0,0.0,8987060000.0,0.0,0.0,0.0
25%,10277.5,3.77,9780345000000.0,192.0,104.0,9.0
50%,20287.0,3.96,9780582000000.0,299.0,745.0,47.0
75%,32104.5,4.14,9780872000000.0,416.0,5000.5,238.0
max,45641.0,5.0,9790008000000.0,6576.0,4597666.0,94265.0


Let's say we want to normalize the numerical columns by doing a standard z-score normalization (where $\mu$ is the mean and $\sigma$ is the standard deviation). 

$$ x' = \frac{x-\mu}{\sigma}$$

In [7]:
x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
0,-1.228041,0.701669,0.065788,1.690193,-0.158898,-0.206879
1,1.733609,1.215243,0.046199,0.728147,-0.158587,-0.204162
2,-1.297916,-1.095839,0.067484,0.052226,-0.158409,-0.205715
3,-0.156846,-11.224651,0.047489,-1.133745,-0.159493,-0.210372
4,-0.467735,-11.224651,0.047437,-1.374256,-0.159493,-0.210372
...,...,...,...,...,...,...
11118,1.661519,0.872860,0.046669,-0.068030,-0.158880,-0.205326
11119,0.682652,0.872860,0.048629,-0.532466,-0.155146,-0.208819
11120,0.083632,0.872860,0.048950,-0.466118,-0.125253,-0.185921
11121,0.476462,1.586157,0.045772,0.661798,-0.141253,-0.112181


Next, let's look at all the columns that are non-numerical:

In [8]:
df.select_dtypes(include='object').head()

Unnamed: 0,title,authors,isbn,language_code,publication_date,publisher
0,Vergeef me,Wally Lamb/Inge de Heer,9022530078,nl,7/1/2001,De Boekerij
1,Shield of Thunder (Troy #2),David Gemmell,345477014,ale,3/27/2007,Ballantine Books
2,Zaat,Sonallah Ibrahim/صنع الله إبراهيم/Anthony Cald...,9774248449,ara,3/15/2004,American University in Cairo Press
3,Canopy: A Work for Voice and Light in Harvard ...,David Ward/Parveen Adams/Seamus Heaney/Ivan ...,916724948,eng,12/31/1997,Arts Publications
4,American Film Guide,Frank N. Magill,893562505,eng,1/1/1983,Salem Press Inc


We see that there are 27 different languages represented by `language_code` in this dataset.

In [9]:
df.language_code.unique()

array(['nl', 'ale', 'ara', 'eng', 'enm', 'fre', 'ger', 'gla', 'glg',
       'grc', 'ita', 'jpn', 'lat', 'msa', 'mul', 'nor', 'por', 'rus',
       'spa', 'srp', 'swe', 'tur', 'wel', 'zho', 'en-CA', 'en-GB',
       'en-US'], dtype=object)

Since BigQuery doesn't [support special characters](https://cloud.google.com/bigquery/docs/schemas#column_names) such as `-` as a column name, we clean up the `en-*` entries by replacing the `-` with an underscore.

In [10]:
df.language_code = df.language_code.str.replace("-","_")

To feed this into a machine learning model, we want to [one-hot encode](https://en.wikipedia.org/wiki/One-hot) this catagorical column to a set of binary features. 

In [11]:
encoded_df = pd.get_dummies(df, columns="language_code")
encoded_df

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,num_pages,ratings_count,text_reviews_count,publication_date,...,language_code_nl,language_code_nor,language_code_por,language_code_rus,language_code_spa,language_code_srp,language_code_swe,language_code_tur,language_code_wel,language_code_zho
0,5230,Vergeef me,Wally Lamb/Inge de Heer,4.18,9022530078,9789022530078,744,67,9,7/1/2001,...,True,False,False,False,False,False,False,False,False,False
1,44012,Shield of Thunder (Troy #2),David Gemmell,4.36,0345477014,9780345477019,512,102,16,3/27/2007,...,False,False,False,False,False,False,False,False,False,False
2,4315,Zaat,Sonallah Ibrahim/صنع الله إبراهيم/Anthony Cald...,3.55,9774248449,9789774248443,349,122,12,3/15/2004,...,False,False,False,False,False,False,False,False,False,False
3,19257,Canopy: A Work for Voice and Light in Harvard ...,David Ward/Parveen Adams/Seamus Heaney/Ivan ...,0.00,0916724948,9780916724948,63,0,0,12/31/1997,...,False,False,False,False,False,False,False,False,False,False
4,15186,American Film Guide,Frank N. Magill,0.00,0893562505,9780893562502,5,0,0,1/1/1983,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11118,43068,True Blue: The Oxford Boat Race Mutiny,Daniel Topolski/Patrick Robinson,4.24,0553400037,9780553400038,320,69,13,2/23/1990,...,False,False,False,False,False,False,False,False,False,False
11119,30250,Sensual Phrase Vol. 17,Mayu Shinjō,4.24,1421505622,9781421505626,208,489,4,12/1/2006,...,False,False,False,False,False,False,False,False,False,False
11120,22406,The Invisibles Vol. 6: Kissing Mister Quimper,Grant Morrison/Chris Weston/Ivan Reis,4.24,1563896001,9781563896002,224,3852,63,2/1/2000,...,False,False,False,False,False,False,False,False,False,False
11121,27550,Walking with the Wind: A Memoir of the Movement,John Lewis/Michael D'Orso,4.49,0156007088,9780156007085,496,2052,253,10/18/1999,...,False,False,False,False,False,False,False,False,False,False


We select out only the columns with names matching "language". This leaves us with all the converted binary columns, which is often referred to as the indicator matrix. This can be an input to a machine learning model. 

In [12]:
indicator_matrix= encoded_df.filter(regex="language")
indicator_matrix

Unnamed: 0,language_code_ale,language_code_ara,language_code_en_CA,language_code_en_GB,language_code_en_US,language_code_eng,language_code_enm,language_code_fre,language_code_ger,language_code_gla,...,language_code_nl,language_code_nor,language_code_por,language_code_rus,language_code_spa,language_code_srp,language_code_swe,language_code_tur,language_code_wel,language_code_zho
0,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11118,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
11119,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
11120,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
11121,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Summary

In this tutorial, we saw how you can get started in running common data science operations in pandas directly on your `PONDER_BOOK` table in BigQuery.

That means that every single operation that you performed in this tutorial is being executed directly in your data warehouse! The only data that is being pulled out of the warehouse is the few lines of results that is printed in the notebook!

In our [next tutorial](https://github.com/ponder-org/ponder-notebooks/blob/main/bigquery/tutorial/02-primer-to-ponder.ipynb), we will share more details on how Ponder works and how you can leverage Ponder to accelerate your data science workflow!
