# Tutorial 1: Getting Started

In this tutorial, we will walk through how you can get started running pandas on BigQuery using Ponder.

### BigQuery Connections Credential

To run Ponder on your data warehouse, you must first establish database connection to your warehouse. 
Please edit the `credentials.py` file to populate the connections information, we will be using the same connections information throughout the tutorial series. 

<div class="alert alert-block alert-info"> <b>Note: </b> <span> If can not find the BigQuery account information you need to set up your database connection, please follow our <a href="https://docs.ponder.io/resources/BigQueryInfo.html">step-by-step guide</a> here for more information. </spam></div>

In [None]:
import os; os.chdir("..")
import credential

### Uploading Example Datasets

We will be using a few example datasets for the tutorial. You can run this python script to populate the required datasets to your database. This will add three different tables to your database populated with example datasets: 
- [PONDER_TAXI](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/yellow_tripdata_2015-01.csv)
- [PONDER_CITIBIKE](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/citibike_trial.csv)
- [PONDER_BOOK](https://github.com/ponder-org/ponder-datasets/blob/main/books.csv).

Note that you only need to run the following script once for the tables to get populated.

In [None]:
!python populate_datasets.py > /dev/null 2>&1

### Connecting to BigQuery

Ponder uses your data warehouse as an engine, so we need to establish a connection with BigQuery in order to start querying the data. The code below shows how you can configure the database connection.

In [None]:
import ponder.bigquery

# Create a Ponder BigQuery Connections object
bigquery_con = ponder.bigquery.connect(
    user=credential.params["user"],
    password=credential.params["password"],
    account=credential.params["account"],
    role=credential.params["role"],
    database=credential.params["database"],
    schema=credential.params["schema"],
    warehouse=credential.params["warehouse"]
)
# Initialize the BigQuery connection
ponder.bigquery.init(bigquery_con,enable_ssl=True)

If you have succesfully established the connection, you should see the following output

### Starting Pondering 🎉

Now that we have the connection initialized. Let's read the **PONDER_BOOKS** table that already exists in your database. This dataset comes from the [Goodreads dataset from Kaggle](https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks) and contains a books and their review information.

In [None]:
import modin.pandas as pd

In [None]:
df = pd.read_sql("PONDER_BOOKS", bigquery_con)

Let's first print out the dataframe and take a look at the data. 

In [None]:
df

Now we can start hacking away with pandas! Note that every single operations you are doing here with pandas is directly being run on BigQuery.

First, let's take a look at the basic statistics around the numerical columns in our dataset.

In [None]:
df.describe()

Let's say we want to normalize the numerical columns by doing a standard z-score normalization (where $\mu$ is the mean and $\sigma$ is the standard deviation). 

$$ x' = \frac{x-\mu}{\sigma}$$

In [None]:
x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()

Next, let's look at all the columns that are non-numerical:

In [None]:
df.select_dtypes(include='object').head()

We can look at the number of distinct value in each of these non-numerical columns

In [None]:
df.select_dtypes(include='object').nunique()

We see that there are 27 different languages represented by `language_code` in this dataset.

To feed this into a machine learning model, we want to [one-hot encode](https://en.wikipedia.org/wiki/One-hot) this catagorical column to a set of binary features. 

In [None]:
encoded_df = pd.get_dummies(df, columns="language_code")
encoded_df

We select out only the columns with names matching "language". This leaves us with all the converted binary columns, which is often referred to as the indicator matrix. This can be an input to a machine learning model. 

In [None]:
indicator_matrix= encoded_df.filter(regex="language")
indicator_matrix

### Summary

In this tutorial, we saw how you can get started in running common data science operations in pandas directly on the `PONDER_BOOK` table in your BigQuery.

That means that every single operation that you performed in this tutorial is being executed directly in your data warehouse! The only data that is being pulled out of the warehouse is the few lines of results that is printed in the notebook!

Note that if you were to write the equivalent SQL query to run these pandas commands on BigQuery, it would take many lines of code to express the same query. If you're interested in learning about why, check out this [blogpost](https://ponder.io/pandas-vs-sql-part-2-pandas-is-more-concise/#:~:text=the%20window%20function.-,Conclusion,and%20dropping%20sparsely%20populated%20features.).

In our next tutorial, we will share more details on how Ponder works and how you can leverage Ponder to scale up your data science workflow!
