# 5-min Quickstart Guide to Ponder
Follow along this notebook or [watch a video](https://www.youtube.com/watch?v=StA8zv607fk) to see how you can use Ponder today! 

**What is Ponder?** Ponder lets you run your data science workflows (pandas, NumPy) directly in your database, be it Snowflake, BigQuery, or BigQuery. With Ponder, you get the same Python experience you love, but with the power and scalability of data warehouses. Learn more about Ponder [here](https://ponder.io/ponder-in-public-beta-data-science-in-your-data-warehouse/).

## Step 0: Create an Account

Before we get started, you first need a Ponder account.
If you don't already have a Ponder account, you can create a free account by signing up [here](https://app.ponder.io/signup).


## Step 1: Setting up Ponder

You can use Ponder by simply installing Ponder as a library on your own machine. With this flexible and lightweight approach, you can continue using Ponder within your own environment with your existing notebook/IDE setup.

To install the library, run the following command:

In [None]:
!pip install "ponder[bigquery]" # Install Ponder dependencies and BigQuery connector

## Step 2: Initialize Ponder and Authenticate

Now we are ready to start using Ponder! To get started, you first need to initialize Ponder.

You will need to register your product key. Your product key can be found in your [Account Settings](https://app.ponder.io/account-settings) after you sign up for an account (following Step 0).

<img src="https://docs.ponder.io/_images/api_token.png" width="60%"></img>

In [None]:
import ponder
ponder.init(api_key="<Enter-Your-Product-Key-Here>")

If you are setting up Ponder on your own machine and prefer to go through a one time setup process, check out the instructions [here](https://docs.ponder.io/getting_started/quickstart.html#step-2-login-to-authenticate). After the setup, you can run the following command to initialize Ponder.

In [1]:
import ponder
ponder.init()

## Step 3: Connect to a Database

Next, configure your connection to BigQuery.

To establish a connection to BigQuery, we leverage Google Cloud’s [Python client for Google BigQuery](https://cloud.google.com/python/docs/reference/bigquery/latest/index.html).


In [2]:
from google.cloud import bigquery
from google.cloud.bigquery import dbapi
from google.oauth2 import service_account

import json

db_con = dbapi.Connection(
            bigquery.Client(
               credentials=service_account.Credentials.from_service_account_info(
                  json.loads(open("../credential.json").read()),
                  scopes=["https://www.googleapis.com/auth/bigquery"]
               )
            )
         )

If you are looking for more information about how to set up the connection, please check out [this guide](https://docs.ponder.io/getting_started/connection.html) for more information.

## Step 4: Selecting Your Data Source


Ponder currently supports `read_csv`, `read_parquet` for operating on CSV, Parquet files and `read_sql` for operating on tables that are already stored in BigQuery.
Going beyond ``read_sql``, we need to configure Ponder to leverage the BigQuery connection that we established earlier. 

In [3]:
ponder.configure(default_connection=db_con, bigquery_dataset="TEST")

Then, we can use the ``read_csv`` command to feed in the file path to the CSV file.

In [4]:
import modin.pandas as pd

df = pd.read_csv("https://github.com/ponder-org/ponder-datasets/blob/main/books.csv?raw=True")

Below you can see that your CSV data is loaded into a temporary table in BigQuery, you can now work on your DataFrame ``df`` just like you would typically do with any pandas dataframe – with all the computation happening on BigQuery!

Now we can start hacking away with pandas! Note that every single operations you are doing here with pandas is directly being run on your database!

First, let's take a look at the basic statistics around the numerical columns in our dataset.

In [5]:
df.describe()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
count,11123.0,11123.0,11123.0,11123.0,11123.0,11123.0
mean,21310.856963,3.934075,9759880000000.0,336.405556,17942.85,542.048099
std,13094.727252,0.350485,442975800000.0,241.152626,112499.2,2576.619589
min,1.0,0.0,8987060000.0,0.0,0.0,0.0
25%,10277.5,3.77,9780345000000.0,192.0,104.0,9.0
50%,20287.0,3.96,9780582000000.0,299.0,745.0,47.0
75%,32104.5,4.14,9780872000000.0,416.0,5000.5,238.0
max,45641.0,5.0,9790008000000.0,6576.0,4597666.0,94265.0


Let's say we want to normalize the numerical columns by doing a standard z-score normalization (where $\mu$ is the mean and $\sigma$ is the standard deviation). 

$$ x' = \frac{x-\mu}{\sigma}$$

In [6]:
x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
0,-1.627362,1.814412,0.046412,1.308692,18.469003,10.497845
1,-1.627285,1.586157,0.046411,2.212684,18.979913,11.130456
2,-1.627133,1.386434,0.046412,0.064666,-0.103199,-0.115674
3,-1.627056,1.785880,0.046412,0.408847,20.636974,13.887557
4,-1.626827,2.413581,0.046412,9.759771,0.208758,-0.146723
...,...,...,...,...,...,...
11118,1.857247,0.359287,0.048942,0.728147,-0.158107,-0.202610
11119,1.857400,0.416350,0.045736,1.238197,-0.152533,-0.188638
11120,1.857476,0.073968,0.045736,0.325912,-0.152204,-0.173502
11121,1.857858,-0.610797,0.045557,0.404700,-0.152658,-0.156425


Next, let's look at all the columns that are non-numerical:

In [7]:
df.select_dtypes(include='object').head()

Unnamed: 0,title,authors,isbn,language_code,publication_date,publisher
0,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,0439785960,eng,9/16/2006,Scholastic Inc.
1,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,0439358078,eng,9/1/2004,Scholastic Inc.
2,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,0439554896,eng,11/1/2003,Scholastic
3,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,043965548X,eng,5/1/2004,Scholastic Inc.
4,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,0439682584,eng,9/13/2004,Scholastic


Filter to all the books that are written by J.R.R Tolkien.

In [19]:
df[df.authors.str.contains('J.R.R. Tolkien')]

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
21,30,J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...,J.R.R. Tolkien,4.59,0345538374,9780345538376,eng,1728,101233,1550,9/25/2012,Ballantine Books
22,31,The Lord of the Rings (The Lord of the Rings ...,J.R.R. Tolkien,4.5,0618517650,9780618517657,eng,1184,1710,91,10/21/2004,Houghton Mifflin Harcourt
23,34,The Fellowship of the Ring (The Lord of the Ri...,J.R.R. Tolkien,4.36,0618346252,9780618346257,eng,398,2128944,13670,9/5/2003,Houghton Mifflin Harcourt
24,35,The Lord of the Rings (The Lord of the Rings ...,J.R.R. Tolkien/Alan Lee,4.5,0618260587,9780618260584,en_US,1216,1618,140,10/1/2002,Houghton Mifflin Harcourt
721,2327,The Letters of J.R.R. Tolkien,J.R.R. Tolkien/Humphrey Carpenter/Christopher ...,4.15,0618056998,9780618056996,eng,502,4689,171,6/6/2000,Mariner Books
722,2329,The History of the Lord of the Rings (The Hist...,J.R.R. Tolkien/Christopher Tolkien,4.38,0618083553,9780618083558,en_US,1680,237,3,9/1/2000,Mariner Books
723,2330,The Languages of Tolkien's Middle-Earth,Ruth S. Noel/J.R.R. Tolkien,3.98,0395291305,9780395291306,eng,207,4685,74,5/28/1980,Houghton Mifflin Company
724,2331,The Lord of the Rings- 3 volumes set (The Lord...,J.R.R. Tolkien,4.5,0618574999,9780618574995,en_US,1438,232,9,6/1/2005,Mariner Books
725,2333,Farmer Giles of Ham,J.R.R. Tolkien/Christina Scull/Wayne G. Hammond,3.85,0618009361,9780618009367,eng,127,5526,225,11/15/1999,Houghton Mifflin Harcourt
1695,5898,The Lord of the Rings (The Lord of the Rings ...,J.R.R. Tolkien,4.5,0007136587,9780007136582,eng,1200,682,43,9/16/2002,Not Avail


We see that there are 27 different languages represented by `language_code` in this dataset.

In [8]:
df.language_code.unique()

array(['eng', 'en-US', 'fre', 'spa', 'en-GB', 'mul', 'grc', 'enm',
       'en-CA', 'ger', 'jpn', 'ara', 'nl', 'zho', 'lat', 'por', 'srp',
       'ita', 'rus', 'msa', 'glg', 'wel', 'swe', 'nor', 'tur', 'gla',
       'ale'], dtype=object)

Since BigQuery doesn't [support special characters](https://cloud.google.com/bigquery/docs/schemas#column_names) such as `-` as a column name, we clean up the `en-*` entries by replacing the `-` with an underscore.

In [9]:
df.language_code = df.language_code.str.replace("-","_")

To feed this into a machine learning model, we want to [one-hot encode](https://en.wikipedia.org/wiki/One-hot) this catagorical column to a set of binary features. 

In [10]:
encoded_df = pd.get_dummies(df, columns=["language_code"])
encoded_df

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,num_pages,ratings_count,text_reviews_count,publication_date,...,language_code_nl,language_code_nor,language_code_por,language_code_rus,language_code_spa,language_code_srp,language_code_swe,language_code_tur,language_code_wel,language_code_zho
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,652,2095690,27591,9/16/2006,...,False,False,False,False,False,False,False,False,False,False
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,870,2153167,29221,9/1/2004,...,False,False,False,False,False,False,False,False,False,False
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,352,6333,244,11/1/2003,...,False,False,False,False,False,False,False,False,False,False
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,435,2339585,36325,5/1/2004,...,False,False,False,False,False,False,False,False,False,False
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,2690,41428,164,9/13/2004,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11118,45631,Expelled from Eden: A William T. Vollmann Reader,William T. Vollmann/Larry McCaffery/Michael He...,4.06,1560254416,9781560254416,512,156,20,12/21/2004,...,False,False,False,False,False,False,False,False,False,False
11119,45633,You Bright and Risen Angels,William T. Vollmann,4.08,0140110879,9780140110876,635,783,56,12/1/1988,...,False,False,False,False,False,False,False,False,False,False
11120,45634,The Ice-Shirt (Seven Dreams #1),William T. Vollmann,3.96,0140131965,9780140131963,415,820,95,8/1/1993,...,False,False,False,False,False,False,False,False,False,False
11121,45639,Poor People,William T. Vollmann,3.72,0060878827,9780060878825,434,769,139,2/27/2007,...,False,False,False,False,False,False,False,False,False,False


Every single operation that you performed above is being executed directly in your database! The only data that is being pulled out of your database is the few lines of results that is printed in the notebook!

In this tutorial, we took a look at a quick example of how we can use pandas to work with the data directly in your database. Next, we will take a look at the different ways you can [work with a data source in Ponder](https://docs.ponder.io/getting_started/reading.html). If you are looking to learn more about how you can use Ponder, check out [this tutorial series](https://app.ponder.io/resources).