# Tutorial 3: Connecting to Your Data Source

In [1]:
import ponder.bigquery
import modin.pandas as pd
import json; import os; os.chdir("..")
creds = json.load(open(os.path.expanduser("credential.json")))
bigquery_con = ponder.bigquery.connect(creds, schema = "TEST")
ponder.bigquery.init(bigquery_con)

2023-03-22 21:02:30,149 - INFO - Establishing connection to pushdown.ponder-internal.io



Connected to
       ___               __
      / _ \___  ___  ___/ /__ ____
     / ___/ _ \/ _ \/ _  / -_) __/
    /_/___\___/_//_/\_,_/\__/_/
      / __/__ _____  _____ ____
     _\ \/ -_) __/ |/ / -_) __/
    /___/\__/_/  |___/\__/_/



Before we start can start our analysis, we need to first connect to a data source. Ponder currently supports `read_csv` for operating on CSV files and `read_sql` for operating on tables that are already stored in BigQuery.

## ``read_sql:``Working with existing tables

To work with data stored in an existing table in BigQuery, we use the ``read_sql`` command and provide the name of the table ``PONDER_CUSTOMER`` and pass in ``auto`` to the connection parameter to auto-populate the connection information based on what we provided earlier

In [2]:
df = pd.read_sql("PONDER_CUSTOMER", bigquery_con)

Now that we have a Ponder DataFrame that points to the ``PONDER_CUSTOMER`` table in your data warehouse, you can now work on your DataFrame ``df`` just like you would typically do with any pandas dataframe – with all the computation happening on your warehouse!

In [3]:
df

Unnamed: 0,C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_MKTSEGMENT,C_COMMENT
0,60082,Customer#000060082,"x3V6vEbLSeUjYdjS1MvR2,u4gB0S 9d8UEJ",0,10-729-863-1818,3645.47,BUILDING,the accounts. furiously unusual
1,60080,Customer#000060080,"g7cKdEj2mzUQLSKFFnWsmL,3GaOIrBmfi",0,10-192-161-6631,689.24,BUILDING,"slyly pending, permanent packages. special fo..."
2,60018,Customer#000060018,lQ8PB9FGW53C36XQX2uq0,0,10-310-354-8579,5759.83,BUILDING,ckly bold deposits. carefully bold accounts in...
3,60062,Customer#000060062,"1SI,x4F9 zO22 F7OGksMBSUWu5AUpP",0,10-604-525-3386,6210.99,FURNITURE,ons cajole blithely. bold theodolites along
4,60022,Customer#000060022,"I2XoZQLC,63R3zIG z6i3VMCS",0,10-513-498-1045,-759.74,FURNITURE,across the blithely ironic sentiments. thinly...
...,...,...,...,...,...,...,...,...
95,60058,Customer#000060058,"X9NS,0Ddki",23,33-146-680-6559,6672.12,MACHINERY,ess requests. special requests wake blit
96,60079,Customer#000060079,dwwsJWhDr0fnRJnyhe6gtls,24,34-197-192-3607,3329.55,BUILDING,ly special somas poach carefully. furiously un...
97,60059,Customer#000060059,"dZISBokE9NWaz13 b5WbOHrd8DifA,e2yict0",24,34-348-323-9173,2337.46,HOUSEHOLD,ndencies. excuses sleep. quickly daring dugout...
98,60033,Customer#000060033,fwvb5ua8ZcB,24,34-142-708-2404,-493.59,MACHINERY,lithely final packages. quickly regular reques...


<div class="alert alert-block alert-info"> <b>Note: </b> <span> Unlike in pandas, the data ingestion (read_*) command in Ponder does not actually load in the data into a dataframe in memory. Instead, you can think of the Ponder DataFrame acting as a pointer to the table in BigQuery that stores the data and relays all the operations to be performed on the tables in BigQuery. </span></div>

## ``read_csv:`` Working with CSV files

### Working with remote CSV files
To work with ``CSV`` files, use the ``read_csv`` command to feed in the filepath to the CSV file. If the filepath is a remote path to the CSV (e.g., filepath to S3, GCS, or a public dataset URL), you can enter the path directly as follow. Ponder will automatically process your CSV file and load it into a temporary table in your data warehouse account for analysis.

In [4]:
df = pd.read_csv("https://github.com/ponder-org/ponder-datasets/blob/main/tpch/orders.csv?raw=True", header=0)

2023-03-22 21:02:48,447 - INFO - Determining schema for file
2023-03-22 21:02:48,758 - INFO - Finished determining schema for file


Creating table in BigQuery...
Finished creating table in BigQuery...
Finished loading data into BigQuery.
Dataframe loading complete.


Now that your data is loaded into a temporary table in your data warehouse and Ponder DataFrame is pointing to the table, you can now work on your DataFrame ``df`` just like you would typically do with any pandas dataframe – with all the computation happening on your warehouse!

In [5]:
df

Unnamed: 0,O_ORDERKEY,O_CUSTKEY,O_ORDERSTATUS,O_TOTALPRICE,O_ORDERDATE,O_ORDERPRIORITY,O_CLERK,O_SHIPPRIORITY,O_COMMENT
0,603014,60040,O,102891.88,2/15/1998,5-LOW,Clerk#000000337,0,egular theodolites. always special ideas sleep...
1,611105,60011,F,85107.80,6/28/1992,2-HIGH,Clerk#000000423,0,platelets; dependencies
2,612353,60085,O,174365.24,9/3/1997,5-LOW,Clerk#000000685,0,g pending pinto beans according to the deposit...
3,613283,60002,F,81616.21,10/6/1992,2-HIGH,Clerk#000000298,0,ly unusual requests wake furiously atop the pa...
4,617699,60022,F,253288.15,10/6/1994,5-LOW,Clerk#000000421,0,ly express excuses sleep furiously packages. s...
...,...,...,...,...,...,...,...,...,...
140,242343,60067,O,130940.07,2/24/1996,1-URGENT,Clerk#000000792,0,ackages haggle fluffily against
141,242722,60064,F,82821.03,4/7/1992,5-LOW,Clerk#000000514,0,according to the silent
142,243297,60085,O,279667.51,7/16/1997,4-NOT SPECIFIED,Clerk#000000985,0,. regularly special packages
143,244579,60085,P,159397.20,6/11/1995,4-NOT SPECIFIED,Clerk#000000404,0,realms haggle blithely slyly permanent ideas. ...


### Working with your own local CSV files

If you have a CSV file locally that you want to analyze with Ponder, we provide an interface that allows you to stage the file for analysis.

**1. Uploading to Ponder:** If you have a CSV file on your local machine, you must first upload them through the notebook interface. You can upload files to your Jupyter directory using the file upload functionality provided by Jupyter notebook.

<img src="https://docs.ponder.io/_images/upload2.png" width="50%"></img>

**2. Staging CSV file to a remote path:** After uploading your files to the Jupyter directory, you will need to stage the file to a remote path so that it is accessible by read_csv, as following:

In [6]:
!wget -q "https://raw.githubusercontent.com/ponder-org/ponder-datasets/main/tpch/supplier.csv"

In [7]:
from ponder.utils.core import Teleporter
t = Teleporter()
remote_path = t.depulso("supplier.csv")

2023-03-22 21:03:00,979 - depulso - INFO - Compression took 0.0036573410034179688s
2023-03-22 21:03:00,985 - depulso - INFO - Establishing connection to remot host
2023-03-22 21:03:00,995 - depulso - INFO - Connection to remote host establised
2023-03-22 21:03:01,832 - depulso - INFO - Transfer took 0.8470602035522461s


**3. Read your CSV file with ``read_csv``**: Once the file is staged to the remote_path, you can load it in via `pd.read_csv` as usual.

In [8]:
df = pd.read_csv(remote_path, header=0)

2023-03-22 21:03:02,228 - INFO - Determining schema for file
2023-03-22 21:03:02,270 - INFO - Finished determining schema for file


Creating table in BigQuery...
Finished creating table in BigQuery...
Finished loading data into BigQuery.
Dataframe loading complete.


In [9]:
df

Unnamed: 0,S_SUPPKEY,S_NAME,S_ADDRESS,S_NATIONKEY,S_PHONE,S_ACCTBAL,S_COMMENT
0,2,Supplier#000000002,"89eJ5ksX3ImxJQBvxObC,",5,15-679-861-2259,4032.68,slyly bold instructions. idle dependen
1,3,Supplier#000000003,"q1,G3Pj6OjIuUYfUoH18BFTKP5aU9bEV3",1,11-383-516-1199,4192.40,blithely silent requests after the express dep...
2,4,Supplier#000000004,Bk7ah4CK8SYQTepEmvMkkgMwg,15,25-843-787-7479,4641.08,riously even requests above the exp
3,6,Supplier#000000006,tQxuVm7s7CnK,14,24-696-997-4969,1365.79,final accounts. regular dolphins use against t...
4,8,Supplier#000000008,9Sq4bBH2FQEmaFOocY45sRTxo6yuoG,17,27-498-742-3860,7627.85,al pinto beans. asymptotes haggl
...,...,...,...,...,...,...,...
3250,1982,Supplier#000001982,q5g5cl4V2Ssk6vsVTtPFBo8lT8gLcQrbojDyGsN,14,24-307-672-7764,2518.95,ss accounts. furiously bold accounts affix sly...
3251,1985,Supplier#000001985,iNpX5StxnUW8DlgToWvv9kZ Uk,24,34-968-184-3570,9542.91,sly regular dependencies against the bli
3252,1986,Supplier#000001986,"D2d8InHEo5MjZHcD,Ru",9,19-165-166-7955,5721.91,regular deposits wake at the silent asymptote...
3253,1990,Supplier#000001990,"DSDJkCgBJzuPg1yuM,CUdLnsRliOxkkHezTCA",3,13-430-427-6190,204.32,instructions use at the quickly regular packag...
