# Tutorial 3: Connecting to Your Data Source

<div class="alert alert-block alert-info"> <b>Before we get started: </b> 
    <ul style="list-style-type: none;margin: 0;padding: 0;">
        <li>✍️ To run this notebook, you need to have Ponder installed and set up on your machine. If you have not done so already, please refer to our <a href="https://docs.ponder.io/getting_started/quickstart.html">Quickstart guide</a> to get started.</li> 
        <li>📁 This tutorial makes use of the <code>ponder.db</code> database that we created in <a href="https://github.com/ponder-org/ponder-notebooks/blob/main/duckdb/tutorial/01-getting-started.ipynb">Tutorial #1</a>. You can also download the file <a href="https://github.com/ponder-org/ponder-datasets/raw/main/ponder.db">here</a>.</li> 
        <li>📖 Otherwise, if you're just interested in browsing through the tutorial, keep reading below!</li>
    </ul>
</div>

In [1]:
import ponder; ponder.init()
import modin.pandas as pd
import duckdb
duckdb_con = duckdb.connect("../ponder.db")

2023-05-19 17:05:38 - Creating session 4OjWgQsoYnOH6i4mWuzmw8T0-H9KvEJAnyGGBRXt8u


2023-05-19 17:05:38,079 - authenticate_and_verify - INFO - Ponder package successfully imported


Before we start can start our analysis, we need to first connect to a data source. Ponder currently supports `read_csv` for operating on CSV files and `read_sql` for operating on tables that are already stored in DuckDB.

## ``read_sql:``Working with existing tables

To work with data stored in an existing table in DuckDB, we use the ``read_sql`` command and provide the name of the table ``PONDER_CUSTOMER`` and pass in the connections object we created earlier.

In [2]:
df = pd.read_sql("PONDER_CUSTOMER", duckdb_con)

2023-05-19 17:05:39 - Ponder DataFrame successfully configured in DuckDB


Now that we have a Ponder DataFrame that points to the ``PONDER_CUSTOMER`` table in your database, you can now work on your DataFrame ``df`` just like you would typically do with any pandas dataframe – with all the computation happening in DuckDB!

In [3]:
df

Unnamed: 0,C_CUSTKEY,C_NAME,C_ADDRESS,C_NATIONKEY,C_PHONE,C_ACCTBAL,C_MKTSEGMENT,C_COMMENT
0,60001,Customer#000060001,9Ii4zQn9cX,14,24-678-784-9652,9957.56,HOUSEHOLD,l theodolites boost slyly at the platelets: pe...
1,60002,Customer#000060002,ThGBMjDwKzkoOxhz,15,25-782-500-8435,742.46,BUILDING,beans. fluffily regular packages
2,60003,Customer#000060003,"Ed hbPtTXMTAsgGhCr4HuTzK,Md2",16,26-859-847-7640,2526.92,BUILDING,fully pending deposits sleep quickly. blithely...
3,60004,Customer#000060004,"NivCT2RVaavl,yUnKwBjDyMvB42WayXCnky",10,20-573-674-7999,7975.22,AUTOMOBILE,furiously above the ironic packages. slyly br...
4,60005,Customer#000060005,"1F3KM3ccEXEtI, B22XmCMOWJMl",12,22-741-208-1316,2504.74,MACHINERY,express instructions sleep quickly. ironic bra...
...,...,...,...,...,...,...,...,...
95,60096,Customer#000060096,T9KQ0gc6NvnTSSsFkJOk,12,22-822-538-4011,4620.25,AUTOMOBILE,ial platelets wake carefully express theodolit...
96,60097,Customer#000060097,I55jg art2HQL8YEHwh8FgEx,21,31-526-630-1617,1626.61,FURNITURE,. even asymptotes sleep even dependencies. bli...
97,60098,Customer#000060098,"2y,ZeGm0u0 LYJ7waqsZkmWqmU8vn",0,10-972-910-3772,1449.68,AUTOMOBILE,al requests; packages cajole accounts; idly ev...
98,60099,Customer#000060099,Zc1GskAO8ANH8yGchAqhs31MrKzHbAlhpyy3,21,31-696-159-3613,8767.65,HOUSEHOLD,ns detect slyly quickly bold fox


<div class="alert alert-block alert-info"> <b>Note: </b> <span> Unlike in pandas, the data ingestion (read_*) command in Ponder does not actually load in the data into a dataframe in memory. Instead, you can think of the Ponder DataFrame acting as a pointer to the table in DuckDB that stores the data and relays all the operations to be performed on the tables in DuckDB. </span></div>

Going beyond ``read_sql``, we need to configure Ponder to leverage the DuckDB connection that we established earlier. 

In [4]:
ponder.configure(default_connection=duckdb_con)

## ``read_csv:`` Working with CSV files

Then, we can use the ``read_csv`` command to feed in the file path to the CSV file.

In [5]:
df = pd.read_csv("https://github.com/ponder-org/ponder-datasets/blob/main/tpch/orders.csv?raw=True", header=0)

2023-05-19 17:05:44 - Preparing table in DuckDB using CSV file...
2023-05-19 17:05:45 - Configuring Ponder DataFrame in DuckDB...
2023-05-19 17:05:45 - Ponder DataFrame successfully configured in DuckDB


Now that your data is loaded into a temporary table in your database and Ponder DataFrame is pointing to the table, you can now work on your DataFrame ``df`` just like you would typically do with any pandas dataframe – with all the computation happening on DuckDB!

## ``read_parquet:`` Working with Parquet files

To work with Parquet files, use the ``read_parquet`` command to feed in the file path to the file that you'd like to work with.

In [6]:
duckdb_con.execute("INSTALL httpfs")

<duckdb.DuckDBPyConnection at 0x1740fd2b0>

In [7]:
df = pd.read_parquet("https://github.com/ponder-org/ponder-datasets/blob/main/userdatasample.parquet?raw=True",header=0)

2023-05-19 17:05:48 - Preparing table in DuckDB using Parquet file(s)...
2023-05-19 17:05:48 - Configuring Ponder DataFrame in DuckDB...
2023-05-19 17:05:49 - Ponder DataFrame successfully configured in DuckDB


In [8]:
df

Unnamed: 0,registration_dttm,id,first_name,last_name,email,gender,ip_address,cc,country,birthdate,salary,title,comments
0,2016-02-03 07:55:29,1,Amanda,Jordan,ajordan0@com.com,Female,1.197.201.2,6759521864920116,Indonesia,3/8/1971,49756.53,Internal Auditor,1E+02
1,2016-02-03 17:04:03,2,Albert,Freeman,afreeman1@is.gd,Male,218.111.175.34,,Canada,1/16/1968,150280.17,Accountant IV,
2,2016-02-03 01:09:31,3,Evelyn,Morgan,emorgan2@altervista.org,Female,7.161.136.94,6767119071901597,Russia,2/1/1960,144972.51,Structural Engineer,
3,2016-02-03 00:36:21,4,Denise,Riley,driley3@gmpg.org,Female,140.35.109.83,3576031598965625,China,4/8/1997,90263.05,Senior Cost Accountant,
4,2016-02-03 05:05:31,5,Carlos,Burns,cburns4@miitbeian.gov.cn,,169.113.235.40,5602256255204850,South Africa,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2016-02-03 10:30:59,996,Dennis,Harris,dharrisrn@eepurl.com,Male,178.180.111.236,374288806662929,Greece,7/8/1965,263399.54,Editor,
996,2016-02-03 17:16:53,997,Gloria,Hamilton,ghamiltonro@rambler.ru,Female,71.50.39.137,,China,4/22/1975,83183.54,VP Product Management,
997,2016-02-03 05:02:20,998,Nancy,Morris,nmorrisrp@ask.com,,6.188.121.221,3553564071014997,Sweden,5/1/1979,,Junior Executive,
998,2016-02-03 02:41:32,999,Annie,Daniels,adanielsrq@squidoo.com,Female,97.221.132.35,30424803513734,China,10/9/1991,18433.85,Editor,​


Ponder will automatically process your Parquet file and load it into a temporary table in your database for analysis.

In [9]:
duckdb_con.close()

## Summary

In this tutorial, we learned how you can leverage the same pandas API for `pd.read_*` to work with your database tables, CSV and Parquet files. 

In our [next tutorial](https://github.com/ponder-org/ponder-notebooks/blob/main/duckdb/tutorial/04-writing-data.ipynb), we will discuss how you can use `pd.to_*` to save your dataframes with Ponder.