# Step 3: Data Access & Exploration
Now for the really fun part, let's connect to our Starburst instance and pull data from multiple, distinct data sources as if they all were part of the same warehouse. 

## Environment Setup
You know what to do....

In [None]:
!pip install -U -r requirements.txt

In [None]:
import json
import os 
import warnings
import pandas
import trino 
from helper import get_sql

warnings.simplefilter('ignore')

In [None]:
TRINO_HOSTNAME = os.environ.get('TRINO_HOSTNAME')
TRINO_USERNAME = os.environ.get('TRINO_USERNAME')
TRINO_PORT = os.environ.get('TRINO_PORT')

## Making Connections 
Let's make sure we can connect to Starburst using the environment variables we just assigned.

In [None]:
conn = trino.dbapi.connect(
    host=TRINO_HOSTNAME,
    port=TRINO_PORT,
    user=TRINO_USERNAME,
    catalog='kafka',
    schema='default',
)

In [None]:
sql = 'SHOW CATALOGS'
df = get_sql(sql, conn)
df.head()

You should see a list of _catalogs_ (your data sources) displayed here. 

## Simple Queries
Let's query our Kafka catalog. 
To do so, we write a SQL statement requesting data from the _mesages_ table which is part of the _default_ schema within the _kafka_ catalog: `kafka.default.mesages`. 

**Remember:** the messages we dropped on the queue are still in their raw format. 
We can use some methods included as part of the `DataFrame` [object](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) to do some data cleaning, for now. 

In [None]:
kafka_sql = 'select * from kafka.default.messages'

kafka_raw_df = get_sql(kafka_sql, conn)
kafka_df = kafka_raw_df.join(kafka_raw_df._message.apply(json.loads).apply(pandas.Series))
kafka_df = kafka_df.drop(columns=['_message'])

kafka_df.head()

**Now, for the cool part!** 
Let's query separate, completely distinct, databases to get our customer and financial data, respecitively. 

In [None]:
cust_sql = 'select * from "customer-domain".public.customer'

cust_df = get_sql(cust_sql, conn)
cust_df.head()

In [None]:
fin_sql = 'select * from "finance-domain".public.transactions'

fin_df = get_sql(fin_sql, conn)
# clean some data
fin_df["amount"] = fin_df["amount"].str.replace("$", "")
fin_df["amount"] = fin_df["amount"].astype(float)
fin_df.head()

Perfect. 
See how simple it was to get data from three _(previously)_ siloes data sources?! 

Let's explore "our" (fake) customer and transaction data too.

In [None]:
fin_df[fin_df['customerid'] == 42]

In [None]:
cust_df[cust_df['id'] == 42]

Let's join our finance data to our customer reference table so we can explore - starting with customer marketing segment by total spend. 

In [None]:
df = cust_df.join(fin_df.groupby("customerid")["amount"].sum("amount"), lsuffix="customerid", rsuffix="id").dropna()
df.sort_values('amount', ascending=False).head()

In [None]:
df.plot.scatter(x="mktsegment", y="amount", figsize=(12, 6))

## Recap
We pulled data from **three** separate sources without having to move or replicate anything. 
That's pretty impressive. 

## Next
**But**, Starburst is capable of much more than simplifying data access. 
Let's give Starburst the opportunity to build the features we'll need to build a model: [4_build_features.ipynb](4_build_features.ipynb)