Skip to content

Commit

Permalink
Merge pull request #26 from sumedhsakdeo/doc_updates
Browse files Browse the repository at this point in the history
Doc fixes: Trim Info, Formatting, Link fix, etc
  • Loading branch information
sumedhsakdeo committed Feb 26, 2024
2 parents 3870ef5 + a6985cb commit 6a154d3
Show file tree
Hide file tree
Showing 7 changed files with 112 additions and 102 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
.env.development.local
.env.test.local
.env.production.local
.idea

npm-debug.log*
yarn-debug.log*
Expand Down
3 changes: 1 addition & 2 deletions blog/2023-07-19-introduction-oh.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ slug: introduction-oh
title: Taking Charge of Tables, Introducing OpenHouse for Big Data Management
authors:
name: Sumedh Sakdeo
title: OpenHouse @ LinkedIn, ex-Head of Data Lyft Self Driving Division
url: https://www.linkedin.com/in/sumedhsakdeo/
image_url: https://github.com/sumedhsakdeo.png
tags: [openhouse, introduction, big data]
Expand All @@ -13,7 +12,7 @@ hide_reading_time: true
import Link from '@docusaurus/Link';

<Link
className='button button--primary button--small'
className='button button--primary button--small'
to="https://www.linkedin.com/blog/engineering/data-management/taking-charge-of-tables--introducing-openhouse-for-big-data-mana">
<small> Read More </small>
</Link>
52 changes: 36 additions & 16 deletions docs/User Guide/Catalog/SQL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,17 @@ tags:
- SQL
- API
- OpenHouse
- Iceberg
---

## Data Definition Language (DDL)

OpenHouse tables supports [Apache Iceberg](https://iceberg.apache.org/) as the underlying table format. You can use native
Spark syntax to create, alter, and drop tables, but do note there are some constraints OpenHouse imposes.

### CREATE TABLE
To create an OpenHouse table, run following SQL in Spark.

To create an OpenHouse table, run following SQL in Spark.
```sql
CREATE TABLE openhouse.db.table (id bigint COMMENT 'unique id', data string);
```
Expand All @@ -20,23 +24,25 @@ OpenHouse supports following Create clauses:
- `PARTITIONED BY`
- `TBLPROPERTIES (‘key’=’value’,..)`

List of supported DataTypes are the same as found in [Iceberg Spark Types](https://iceberg.apache.org/docs/latest/spark-writes/#spark-type-to-iceberg-type).

:::warning
OpenHouse does not support following Create clauses:
- `LOCATION (*)`
- `CLUSTERED BY (*)`
- `COMMENT ‘table documentation’`

List of supported DataTypes can be seen in [Iceberg Spark Types](https://iceberg.apache.org/docs/latest/spark-writes/#spark-type-to-iceberg-type).
:::

#### PARTITIONED BY
OpenHouse supports single timestamp column to be specified in the partitioning scheme. It also supports upto three string or integer type column based partitioning scheme.

OpenHouse supports single timestamp column to be specified in the partitioning scheme. It also supports upto three string or integer type column based partitioning scheme.
To partition your table, you can use the following SQL syntax
```sql
CREATE TABLE openhouse.db.table(datepartition string, epoch_ts timestamp)
PARTITIONED BY (
days(epoch_ts),
datepartition
)
)
```
- `days(epoch_ts)`: partitions data by applying day-granularity on the timestamp-type column "epoch_ts".

Expand All @@ -47,11 +53,14 @@ Other granularities supported are:

You can also partition your data based on string column by using identity partitioning (for example: `datepartition`).

Constraints:
- Other iceberg transforms such as bucket, truncate are not supported on timestamp column.
- No transformation is supported on string or integer type partition column. Only timestamp-type column allows transformation.
:::warning
- Iceberg transforms such as bucket, truncate are not supported on timestamp column.
- No transformation is supported on string or integer type partition column.
:::

#### TBLPROPERTIES

To set table properties, you can use the following SQL syntax
```sql
CREATE TABLE openhouse.db.table(
data string
Expand All @@ -63,18 +72,21 @@ TBLPROPERTIES (
```
:::warning
Keys with the prefix “openhouse.” (for example: “openhouse.tableId”) are preserved and cannot be set/modified.
Additionally, all Iceberg [TableProperties](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableProperties.java) are also preserved.
Additionally, all Iceberg [TableProperties](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableProperties.java)
are also preserved.
Catalog service has the ability to set these preserved properties as it finds suit.
:::

#### CREATE TABLE AS SELECT (CTAS)

To create an OpenHouse table with some data, run following SQL in Spark.
```sql
CREATE TABLE openhouse.db.table
AS
SELECT * FROM anyCatalog.srcDb.srcTable WHERE data = 'v1';
SELECT * FROM hive.srcDb.srcTable WHERE data = 'v1';
```

:::tip
:::warning
`Create table like tableName` is not supported. You can use `create table A as select * from B limit 0` to achieve same effect
:::

Expand All @@ -92,48 +104,53 @@ DROP TABLE openhouse.db.table;
```

### ALTER TABLE

OpenHouse supports following ALTER statements
- Setting or removing table properties.
- Schema Evolution:
- Adding columns in both top and nested level.
- Widening the type of int, float, and decimal fields.
- Making required columns optional.

OpenHouse doesn’t support for now
:::warning
OpenHouse doesn’t allow the following:
- Schema evolution: Drop, rename and reordering.
- Renaming a table.
- Adding, removing, and changing partitioning.
- Other iceberg alters such as: `write ordered by` / `write distributed by`
:::

#### ALTER TABLE ... SET TBLPROPERTIES
To set table properties

To set table properties, you can use the following SQL syntax
```sql
ALTER TABLE openhouse.db.table SET TBLPROPERTIES (
'key1' = 'value1',
'key2' = 'value2'
)
```
To unset table properties
To unset table properties, you can use the following SQL syntax
```sql
ALTER TABLE openhouse.db.table UNSET TBLPROPERTIES ('key1', 'key2')
```
:::warning
Keys with the prefix “openhouse.” (for example: “openhouse.tableId”) are preserved and cannot be set/modified.
Additionally, all Iceberg [TableProperties](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableProperties.java) are also preserved.
Catalog service has the ability to set these preserved properties as it finds suit.
:::

#### ALTER TABLE ... ADD COLUMN

Adding column is a supported schema evolution, to add a new column:
```sql
ALTER TABLE openhouse.db.table
ADD COLUMNS (
new_column string comment 'new_column docs'
)
```

Multiple columns can be added separated by comma.
Nested columns can be added as follows:

```sql
-- create a struct column
ALTER TABLE openhouse.db.table
Expand All @@ -153,6 +170,7 @@ ADD COLUMN points array<struct<x: double, y: double>>;
ALTER TABLE openhouse.db.table
ADD COLUMN points.element.z double
```

```sql
-- create a map column of struct key and struct value
ALTER TABLE openhouse.db.table
Expand Down Expand Up @@ -254,7 +272,9 @@ Note: response of the table is cached for the duration of the session. In order
:::

### SELECT FROM w/ Time-Travel
OpenHouse uses Iceberg as the table format. Iceberg generates a version/snapshot for every update to the table. A snapshot captures the state of the table at a specific point in time. We can query iceberg tables using the snapshot ID or timestamp from the past. Unlike Hive, Iceberg guarantees query reproducibility when querying historical data.
OpenHouse uses Iceberg as the table format. Iceberg generates a version/snapshot for every update to the table. A
snapshot captures the state of the table at a specific point in time. We can query iceberg tables using the snapshot ID
or timestamp from the past. Unlike Hive, Iceberg guarantees query reproducibility when querying historical data.

Time-travel is supported through following syntax.
```sql
Expand Down
28 changes: 14 additions & 14 deletions docs/User Guide/Catalog/Scala.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,36 +5,36 @@ tags:
- Scala
- API
- OpenHouse
- Iceberg
---

## Data Definition Language (DDL)
For DDLs such as `CREATE TABLE`, `ALTER TABLE`, `GRANT/REVOKE` etc; use `.sql()` in SparkSession. Native Scala APIs for achieving those outcomes aren't available currently.

```scala
-- Create Table
spark.sql("CREATE TABLE openhouse.db.table (id bigint COMMENT 'unique id', data string)")
OpenHouse tables supports [Apache Iceberg](https://iceberg.apache.org/) as the underlying table format. You can use native
Spark syntax to create, alter, and drop tables, but do note there are some constraints OpenHouse imposes.

-- Alter Table
spark.sql("""ALTER TABLE openhouse.db.table SET TBLPROPERTIES (
'key1' = 'value1',
'key2' = 'value2'
)""")
For DDLs such as `CREATE TABLE`, `ALTER TABLE`, `GRANT/REVOKE` etc; use `.sql()` in SparkSession. You can also use
native Spark Scala syntax that creates table if not exists.

-- Grant Select
spark.sql("GRANT SELECT ON TABLE openhouse.db.table TO <user>;")
```sql
import org.apache.spark.sql.functions;

// Your code preparing DataFrame

df.writeTo("openhouse.<dbName>.<tableName>").create()
```

[See all available DDL commands here.](SQL.md#data-definition-language-ddl)
[SQL DDL commands](SQL.md#data-definition-language-ddl)

## Reads
To query a table, run the following:

```scala
```sql
val df = spark.table("openhouse.db.table")
```

You can also filter data using custom filters as follows:
```scala
```sql
val filtered_df = df.filter(col("datepartition") > "2022-05-10")
```

Expand Down
Loading

0 comments on commit 6a154d3

Please sign in to comment.