# Tutorial 1: Getting Started

<div class="alert alert-block alert-info"> <b>Before we get started: </b> 
    <ul style="list-style-type: none;margin: 0;padding: 0;">
        <li>✍️ To run this notebook, you need to have Ponder installed and set up on your machine. If you have not done so already, please refer to our <a href="https://docs.ponder.io/getting_started/quickstart.html">Quickstart guide</a> to get started.</li>
        <li>📖 Otherwise, if you're just interested in browsing through the tutorial, keep reading below!</li>
    </ul>
</div>

In this tutorial, we will walk through how you can get started running pandas on Snowflake using Ponder.

### Snowflake Connections Credential

To run Ponder on your data warehouse, you must first establish database connection to your warehouse. 
Please edit the `credentials.py` file to populate the connections information, we will be using the same connections information throughout the tutorial series. 

<div class="alert alert-block alert-info"> <b>Note: </b> <span> If can not find the Snowflake account information you need to set up your database connection, please follow our <a href="https://docs.ponder.io/resources/SnowflakeInfo.html">step-by-step guide</a> here for more information. </span></div>

In [1]:
import os; os.chdir("..")
import credential

### Uploading Example Datasets

We will be using a few example datasets for the tutorial. You can run this python script to populate the required datasets to your database. This will add the following tables to your database populated with example datasets: 

- [PONDER_CITIBIKE](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/citibike_trial.csv)
- [PONDER_BOOKS](https://github.com/ponder-org/ponder-datasets/blob/main/books.csv)
- [PONDER_CUSTOMER](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/tpch/customer.csv)
- [PONDER_ORDER](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/tpch/orders.csv)
- [PONDER_PART](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/tpch/part.csv)
- [PONDER_SUPPLIER](https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/tpch/supplier.csv)

Note that you only need to run the following script once for the tables to get populated.

In [None]:
!python populate_datasets.py

### Connecting to Snowflake

Ponder uses your data warehouse as an engine, so we need to establish a connection with Snowflake in order to start querying the data. The code below shows how you can configure the database connection.

In [2]:
import ponder
ponder.init()



In [3]:
import snowflake.connector
# Create a Ponder Snowflake Connections object
snowflake_con = snowflake.connector.connect(
    user=credential.params["user"],
    password=credential.params["password"],
    account=credential.params["account"],
    role=credential.params["role"],
    database=credential.params["database"],
    schema=credential.params["schema"],
    warehouse=credential.params["warehouse"]
)

### Starting Pondering 🎉

Now that we have the connection initialized. Let's read the **PONDER_BOOKS** table that already exists in your database. This dataset comes from the [Goodreads dataset from Kaggle](https://www.kaggle.com/datasets/jealousleopard/goodreadsbooks) and contains a books and their review information.

In [5]:
import modin.pandas as pd

In [6]:
df = pd.read_sql("PONDER_BOOKS", snowflake_con)

Let's first print out the dataframe and take a look at the data. 

In [7]:
df

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,language_code,num_pages,ratings_count,text_reviews_count,publication_date,publisher
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,eng,652,2095690,27591,9/16/2006,Scholastic Inc.
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,eng,870,2153167,29221,9/1/2004,Scholastic Inc.
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,eng,352,6333,244,11/1/2003,Scholastic
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,eng,435,2339585,36325,5/1/2004,Scholastic Inc.
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,eng,2690,41428,164,9/13/2004,Scholastic
...,...,...,...,...,...,...,...,...,...,...,...,...
11114,45631,Expelled from Eden: A William T. Vollmann Reader,William T. Vollmann/Larry McCaffery/Michael He...,4.06,1560254416,9781560254416,eng,512,156,20,12/21/2004,Da Capo Press
11115,45633,You Bright and Risen Angels,William T. Vollmann,4.08,0140110879,9780140110876,eng,635,783,56,12/1/1988,Penguin Books
11116,45634,The Ice-Shirt (Seven Dreams #1),William T. Vollmann,3.96,0140131965,9780140131963,eng,415,820,95,8/1/1993,Penguin Books
11117,45639,Poor People,William T. Vollmann,3.72,0060878827,9780060878825,eng,434,769,139,2/27/2007,Ecco


Now we can start hacking away with pandas! Note that every single operations you are doing here with pandas is directly being run on Snowflake.

First, let's take a look at the basic statistics around the numerical columns in our dataset.

In [8]:
df.describe()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
count,11119.0,11119.0,11119.0,11119.0,11119.0,11119.0
mean,21308.966184,3.934135,9759873000000.0,336.439788,17948.32,542.167371
std,13093.071002,0.350384,443055400000.0,241.177969,112519.0,2577.069549
min,1.0,0.0,8987060000.0,0.0,0.0,0.0
25%,10277.5,3.77,9780345000000.0,192.0,104.0,9.0
50%,20287.0,3.96,9780586000000.0,299.0,745.0,47.0
75%,32103.5,4.135,9780873000000.0,416.0,5000.5,238.0
max,45641.0,5.0,9790008000000.0,6576.0,4597666.0,94265.0


Let's say we want to normalize the numerical columns by doing a standard z-score normalization (where $\mu$ is the mean and $\sigma$ is the standard deviation). 

$$ x' = \frac{x-\mu}{\sigma}$$

In [9]:
x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()

Unnamed: 0,bookID,average_rating,isbn13,num_pages,ratings_count,text_reviews_count
0,-1.627423,1.814764,0.046421,1.308412,18.465698,10.495965
1,-1.627347,1.586444,0.046420,2.212309,18.976518,11.128467
2,-1.627194,1.386663,0.046420,0.064518,-0.103230,-0.115700
3,-1.627118,1.786224,0.046420,0.408662,20.633287,13.885086
4,-1.626888,2.414107,0.046420,9.758604,0.208673,-0.146743
...,...,...,...,...,...,...
11114,1.857626,0.359219,0.048950,0.727928,-0.158127,-0.202621
11115,1.857779,0.416299,0.045744,1.237925,-0.152555,-0.188651
11116,1.857855,0.073818,0.045744,0.325735,-0.152226,-0.173518
11117,1.858237,-0.611144,0.045565,0.404515,-0.152679,-0.156444


Next, let's look at all the columns that are non-numerical:

In [10]:
df.select_dtypes(include='object').head()

Unnamed: 0,title,authors,isbn,language_code,publication_date,publisher
0,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,0439785960,eng,9/16/2006,Scholastic Inc.
1,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,0439358078,eng,9/1/2004,Scholastic Inc.
2,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,0439554896,eng,11/1/2003,Scholastic
3,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,043965548X,eng,5/1/2004,Scholastic Inc.
4,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,0439682584,eng,9/13/2004,Scholastic


We can look at the number of distinct value in each of these non-numerical columns

In [11]:
df.select_dtypes(include='object').nunique()

title               10344
authors              6635
isbn                11119
language_code          27
publication_date     3675
publisher            2289
Name: 0, dtype: int32

We see that there are 27 different languages represented by `language_code` in this dataset.

To feed this into a machine learning model, we want to [one-hot encode](https://en.wikipedia.org/wiki/One-hot) this catagorical column to a set of binary features. 

In [12]:
encoded_df = pd.get_dummies(df, columns="language_code")
encoded_df

Unnamed: 0,bookID,title,authors,average_rating,isbn,isbn13,num_pages,ratings_count,text_reviews_count,publication_date,...,language_code_nl,language_code_nor,language_code_por,language_code_rus,language_code_spa,language_code_srp,language_code_swe,language_code_tur,language_code_wel,language_code_zho
0,1,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling/Mary GrandPré,4.57,0439785960,9780439785969,652,2095690,27591,9/16/2006,...,0,0,0,0,0,0,0,0,0,0
1,2,Harry Potter and the Order of the Phoenix (Har...,J.K. Rowling/Mary GrandPré,4.49,0439358078,9780439358071,870,2153167,29221,9/1/2004,...,0,0,0,0,0,0,0,0,0,0
2,4,Harry Potter and the Chamber of Secrets (Harry...,J.K. Rowling,4.42,0439554896,9780439554893,352,6333,244,11/1/2003,...,0,0,0,0,0,0,0,0,0,0
3,5,Harry Potter and the Prisoner of Azkaban (Harr...,J.K. Rowling/Mary GrandPré,4.56,043965548X,9780439655484,435,2339585,36325,5/1/2004,...,0,0,0,0,0,0,0,0,0,0
4,8,Harry Potter Boxed Set Books 1-5 (Harry Potte...,J.K. Rowling/Mary GrandPré,4.78,0439682584,9780439682589,2690,41428,164,9/13/2004,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11114,45631,Expelled from Eden: A William T. Vollmann Reader,William T. Vollmann/Larry McCaffery/Michael He...,4.06,1560254416,9781560254416,512,156,20,12/21/2004,...,0,0,0,0,0,0,0,0,0,0
11115,45633,You Bright and Risen Angels,William T. Vollmann,4.08,0140110879,9780140110876,635,783,56,12/1/1988,...,0,0,0,0,0,0,0,0,0,0
11116,45634,The Ice-Shirt (Seven Dreams #1),William T. Vollmann,3.96,0140131965,9780140131963,415,820,95,8/1/1993,...,0,0,0,0,0,0,0,0,0,0
11117,45639,Poor People,William T. Vollmann,3.72,0060878827,9780060878825,434,769,139,2/27/2007,...,0,0,0,0,0,0,0,0,0,0


We select out only the columns with names matching "language". This leaves us with all the converted binary columns, which is often referred to as the indicator matrix. This can be an input to a machine learning model. 

In [13]:
indicator_matrix= encoded_df.filter(regex="language")
indicator_matrix

Unnamed: 0,language_code_ale,language_code_ara,language_code_en-CA,language_code_en-GB,language_code_en-US,language_code_eng,language_code_enm,language_code_fre,language_code_ger,language_code_gla,...,language_code_nl,language_code_nor,language_code_por,language_code_rus,language_code_spa,language_code_srp,language_code_swe,language_code_tur,language_code_wel,language_code_zho
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11114,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11115,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11116,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11117,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Summary

In this tutorial, we saw how you can get started in running common data science operations in pandas directly on the `PONDER_BOOK` table in your Snowflake.

That means that every single operation that you performed in this tutorial is being executed directly in your data warehouse! The only data that is being pulled out of the warehouse is the few lines of results that is printed in the notebook!

Note that if you were to write the equivalent SQL query to run these pandas commands on Snowflake, it would take many lines of code to express the same query. If you're interested in learning about why, check out this [blogpost](https://ponder.io/pandas-vs-sql-part-2-pandas-is-more-concise/#:~:text=the%20window%20function.-,Conclusion,and%20dropping%20sparsely%20populated%20features.).

In our next tutorial, we will share more details on how Ponder works and how you can leverage Ponder to scale up your data science workflow!
