In [1]:
# | output: false
# | echo: false
import builtins
import rich

::: {.callout-note}

This page leverages some basic transform syntax from later parts of the walkthrough. Don't worry too much about it for now: the core of this page is to understand how relations work in Vinyl.

:::

## Properties

As in SQL, relations are the core data model. But relations in vinyl have some unique properties:


### Relations are column aware

Take the `Seattle weather` dataset in vinyl's examples. 

<Card>

In [2]:
from vinyl.examples import seattle_weather

weather = seattle_weather()

</Card>

`weather` is a Vinytable object, which carry far more information that a standard sql table or cte. The schema, for example, can be pulled easily without running a query against the database. 


<Card>

In [3]:
# | output: asis
print(weather.schema())

</Card>

Columns are attrributes of the class, so the code below returns the `wind` column.


In [4]:
weather.wind

<vinyl.lib.column.VinylColumn at 0x28b0cf450>

And, after a <Tooltip tip="more on this in the next section">basic transform</Tooltip>


In [5]:
wind_doubled = weather.select(weather.wind * 2)
print(wind_doubled)

This means you can access the information at any point in your queries. For example,


In [6]:
temps = weather.select([col for col in weather.columns if col.startswith("temp")])
print(temps.schema())

will pull all numeric columns from `weather`.


### Relations are lazy

Vinyl keeps track of its syntax lazily, and only compiles when a variable is executed or its sql is generated.

By default, vinyl will return a string representation of the query plan associated with a variable unless specified otherwise. For example:


In [7]:
temps

r0 := DatabaseTable: seattle_weather
  date          timestamp(6)
  precipitation float64
  temp_max      float64
  temp_min      float64
  wind          float64
  weather       string

Project[r0]
  temp_max: r0.temp_max
  temp_min: r0.temp_min

If you'd like to see a graphical representation, use `.visualize()`:


In [8]:
temps.visualize()

To execute this, you only need to run:


In [9]:
temps.execute()

Unnamed: 0,temp_max,temp_min
0,12.8,5.0
1,10.6,2.8
2,11.7,7.2
3,12.2,5.6
4,8.9,2.8
...,...,...
1456,4.4,1.7
1457,5.0,1.7
1458,7.2,0.6
1459,5.6,-1.0


By default, this returns a pandas DataFrame. You can also return a text or pyarrow representation by specifying that


In [10]:
temps.execute("text")

In [11]:
temps.execute("pyarrow")

pyarrow.Table
temp_max: double
temp_min: double
----
temp_max: [[12.8,10.6,11.7,12.2,8.9,...,4.4,5,7.2,5.6,5.6]]
temp_min: [[5,2.8,7.2,5.6,2.8,...,1.7,1.7,0.6,-1,-2.1]]

You can also save it to various formats (csv, json, etc.) using `.save()`

### Relations are selectively mutable

By default, VinylTables are **immutable**. For example, the original table ast is printed when run is the same as the original `weather`.


In [12]:
weather.select(weather.wind * 2)
weather

DatabaseTable: seattle_weather
  date          timestamp(6)
  precipitation float64
  temp_max      float64
  temp_min      float64
  wind          float64
  weather       string

That said, there are two key cases where VinylTables are considered mutable. This allows for a more fluent ergonomic syntax, especially when you are chaining several transforms together.

The two cases are:
1. Within specially decorated functions (i.e. those with `@model` or `@metric` decorator)
2. Context managers

The first case is designed to support <Tooltip tip="more on this in the quickstart">pipelines</Tooltip>


In [13]:
from vinyl import T, model


@model(deps=[seattle_weather])
def weather(w: T) -> T:
    w.select(w.wind * 2)
    return w


weather()

r0 := DatabaseTable: seattle_weather
  date          timestamp(6)
  precipitation float64
  temp_max      float64
  temp_min      float64
  wind          float64
  weather       string

Project[r0]
  Multiply(wind, 2): r0.wind * 2

The second use case is designed primarily for analysis use cases. 


In [14]:
with seattle_weather() as w:
    w.select(w.wind * 2)

w

r0 := DatabaseTable: seattle_weather
  date          timestamp(6)
  precipitation float64
  temp_max      float64
  temp_min      float64
  wind          float64
  weather       string

Project[r0]
  Multiply(wind, 2): r0.wind * 2

It can also be used inside vinyl pipeline functions to create a sort of "cte", in the sense that it makes a copy of the original object.

### Relations are dialect independent

Vinyl uses the Ibis library to generate SQL. This means that you can write your queries in a dialect agnostic way. For example, the following code works in across the dialects currently supported by Vinyl:
- BigQuery
- Snowflake
- DuckDB
- Postgres

Ibis itself supports almost 20 dialects, so we plan to add more over time.

For this table

In [15]:
with seattle_weather() as w:
    w.aggregate({"temp_max": w.temp_max.collect()}, by=w.date.dt.floor(months=1))

temp_by_month = w
temp_by_month.execute("text")

Here's how vinyl translates the query to each dialect

::: {.panel-tabset}

#### BigQuery

In [16]:
# | echo: false

from vinyl import original_print

original_print(temp_by_month.to_sql("bigquery"))

SELECT
  TIMESTAMP_TRUNC(`t0`.`date`, MONTH) AS `TimestampTruncate_date_ MONTH`,
  ARRAY_AGG(`t0`.`temp_max` IGNORE NULLS) AS `temp_max`
FROM `seattle_weather` AS `t0`
GROUP BY
  1


#### Snowflake

In [17]:
# | echo: false
from rich.syntax import Syntax

Syntax(temp_by_month.to_sql("snowflake"), "sql")

#### DuckDB

In [18]:
# | echo: false
Syntax(temp_by_month.to_sql("duckdb"), "sql")

#### Postgres

In [19]:
# | echo: false
print(temp_by_month.to_sql("postgres"))

:::

:::



- Relations:
    - column-aware
    - lazy
    - selectively mutable
    - cross-dialect
- Transforms: 
    - Core
        - Select
        - Derive
        - Aggregate
        - Filter
    - `_all` variants
    - Row operations: distinct, limit
        - where, having, and qualify -> filter
- Set operators: join, union, differrence
- Column operators
- Metric: dynamically generated relations