Skip to content

Commit

Permalink
Merge pull request delta-io#10 from delta-io/master
Browse files Browse the repository at this point in the history
fork update
  • Loading branch information
JassAbidi committed Jun 23, 2021
2 parents eee16c7 + b76e231 commit cad3c3d
Show file tree
Hide file tree
Showing 281 changed files with 5,668 additions and 824 deletions.
1 change: 1 addition & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ jobs:
pipenv --python 3.7 install
pipenv run pip install pyspark==3.1.1
pipenv run pip install flake8==3.5.0 pypandoc==1.3.3
pipenv run pip install importlib_metadata==3.10.0
- run:
name: Run Scala/Java and Python tests
command: |
Expand Down
76 changes: 76 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Delta Lake Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:
shipit
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
shipit
## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the Technical Steering Committee defined [hershipite](https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md#governance). All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
2 changes: 1 addition & 1 deletion LICENSE.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (2020) The Delta Lake Project Authors. All rights reserved.
Copyright (2021) The Delta Lake Project Authors. All rights reserved.


Apache License
Expand Down
2 changes: 1 addition & 1 deletion NOTICE.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Delta Lake
Copyright (2020) The Delta Lake Project Authors.
Copyright (2021) The Delta Lake Project Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
38 changes: 24 additions & 14 deletions PROTOCOL.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,16 @@
- [Commit Provenance Information](#commit-provenance-information)
- [Action Reconciliation](#action-reconciliation)
- [Requirements for Writers](#requirements-for-writers)
- [Creation of New Log Entries](#creation-of-new-log-entries)
- [Consistency Between Table Metadata and Data Files](#consistency-between-table-metadata-and-data-files)
- [Delta Log Entries](#delta-log-entries-1)
- [Checkpoints](#checkpoints-1)
- [Checkpoint Format](#checkpoint-format)
- [Data Files](#data-files-1)
- [Append-only Tables](#append-only-tables)
- [Column Invariants](#column-invariants)
- [Generated Columns](#generated-columns)
- [Writer Version Requirements](#writer-version-requirements)
- [Appendix](#appendix)
- [Per-file Statistics](#per-file-statistics)
- [Partition Value Serialization](#partition-value-serialization)
Expand Down Expand Up @@ -416,25 +426,25 @@ When the table property `delta.appendOnly` is set to `true`:
- New log entries may rearrange data (i.e. `add` and `remove` actions where `dataChange=false`).

## Column Invariants
- The schema for a given column MAY the metadata `delta.invariants`.
- This column SHOULD be parsed as a boolean SQL expression.
- Writers MUST abort any transaction that adds a row to the table, where a present invariant evaluates to `false` or `null`.
- The `metadata` for a column in the table schema MAY contain the key `delta.invariants`.
- The value of `delta.invariants` SHOULD be parsed as a boolean SQL expression.
- Writers MUST abort any transaction that adds a row to the table, where an invariant evaluates to `false` or `null`.

## Generated Columns

- The `metadata` for a column in the table schema MAY contain the key `delta.generationExpression`.
- The value of `delta.generationExpression` SHOULD be parsed as a SQL expression.
- Writers MUST enforce that any data writing to the table satisfy the condition `(<value> <=> <generation expression>) IS TRUE`. `<=>` is the NULL-safe equal operator which performs an equality comparison like the `=` operator but returns `TRUE` rather than NULL if both operands are `NULL`

## Writer Version Requirements

The requirements of the writers according to the protocol versions are summarized in the table below. Each row inherits the requirements from the preceding row.

+------------------+----------------------------------------------+
| | Reader Version 1 |
+------------------+----------------------------------------------+
| Writer Version 2 | - Support `delta.appendOnly` |
| | - Support column invariants |
+------------------+----------------------------------------------+
| Writer Version 3 | - Enforce: |
| | - `delta.checkpoint.writeStatsAsJson` |
| | - `delta.checkpoint.writeStatsAsStruct` |
| | - `CHECK` constraints |
+------------------+----------------------------------------------+
<br> | Reader Version 1
-|-
Writer Version 2 | - Support [`delta.appendOnly`](#append-only-tables)<br>- Support [Column Invariants](#column-invariants)
Writer Version 3 | Enforce:<br>- `delta.checkpoint.writeStatsAsJson`<br>- `delta.checkpoint.writeStatsAsStruct`<br>- `CHECK` constraints
Writer Version 4 | - Support Change Data Feed<br>- Support [Generated Columns](#generated-columns)

# Appendix

Expand Down
37 changes: 10 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,25 +10,7 @@ See the [Quick Start Guide](https://docs.delta.io/latest/quick-start.html) to ge

## Latest Binaries

### Maven

Starting from 0.7.0, Delta Lake is only available with Scala version 2.12.

```xml
<dependency>
<groupId>io.delta</groupId>
<artifactId>delta-core_2.12</artifactId>
<version>0.8.0</version>
</dependency>
```

### SBT

You include Delta Lake in your SBT project by adding the following line to your build.sbt file:

```scala
libraryDependencies += "io.delta" %% "delta-core" % "0.8.0"
```
See the [online documentation](https://docs.delta.io/latest/) for the latest release.

## API Documentation

Expand All @@ -40,13 +22,14 @@ libraryDependencies += "io.delta" %% "delta-core" % "0.8.0"

### Compatibility with Apache Spark Versions

Delta Lake currently requires Apache Spark 3.0.0
See the [online documentation](https://docs.delta.io/latest/releases.html) for the releases and their compatibility with Apache Spark versions.

### API Compatibility

The only stable public APIs, currently provided by Delta Lake, are through the `DataFrameReader`/`Writer` (i.e. `spark.read`, `df.write`, `spark.readStream` and `df.writeStream`). Options to these APIs will remain stable within a major release of Delta Lake (e.g., 1.x.x).
There are two types of APIs provided by the Delta Lake project.

All other interfaces in this library are considered internal, and they are subject to change across minor/patch releases.
- Spark-based APIs - You can read Delta tables through the `DataFrameReader`/`Writer` (i.e. `spark.read`, `df.write`, `spark.readStream` and `df.writeStream`). Options to these APIs will remain stable within a major release of Delta Lake (e.g., 1.x.x).
- Direct Java/Scala/Python APIs - The classes and methods documented in the [API docs](https://docs.delta.io/latest/delta-apidoc.html) are considered as stable public APIs. All other classes, interfaces, methods that may be directly accessible in code are considered internal, and they are subject to change across releases.

### Data Storage Compatibility

Expand All @@ -56,11 +39,11 @@ Breaking changes in the protocol are indicated by incrementing the minimum reade

## Roadmap

Delta Lake is a recent open-source project based on technology developed at Databricks. We plan to open-source all APIs that are required to correctly run Spark programs that read and write Delta tables. For a detailed timeline on this effort see the [project roadmap](https://github.com/delta-io/delta/milestones).
For detailed detailed timeline, see the [project roadmap](https://github.com/delta-io/delta/milestones).

# Building

Delta Lake Core is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).
Delta Lake is compiled using [SBT](https://www.scala-sbt.org/1.x/docs/Command-Line-Reference.html).

To compile, run

Expand Down Expand Up @@ -88,9 +71,7 @@ Delta Lake ACID guarantees are predicated on the atomicity and durability guaran
2. **Mutual exclusion**: Only one writer must be able to create (or rename) a file at the final destination.
3. **Consistent listing**: Once a file has been written in a directory, all future listings for that directory must return that file.

Given that storage systems do not necessarily provide all of these guarantees out-of-the-box, Delta Lake transactional operations typically go through the [LogStore API](https://github.com/delta-io/delta/blob/master/src/main/scala/org/apache/spark/sql/delta/storage/LogStore.scala) instead of accessing the storage system directly. We can plug in custom `LogStore` implementations in order to provide the above guarantees for different storage systems. Delta Lake has built-in `LogStore` implementations for HDFS, Amazon S3 and Azure storage services. Please see [Delta Lake Storage Configuration](https://docs.delta.io/latest/delta-storage.html) for more details. If you are interested in adding a custom `LogStore` implementation for your storage system, you can start discussions in the community mailing group.

As an optimization, storage systems can also allow _partial listing of a directory, given a start marker_. Delta Lake can use this ability to efficiently discover the latest version of a table, without listing all of the files in the transaction log.
See the [online documentation on Storage Configuration](https://docs.delta.io/latest/delta-storage.html) for details.

## Concurrency Control

Expand All @@ -103,6 +84,8 @@ We use [GitHub Issues](https://github.com/delta-io/delta/issues) to track commun
# Contributing
We welcome contributions to Delta Lake. See our [CONTRIBUTING.md](https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md) for more details.

We also adhere to the [Delta Lake Code of Conduct](https://github.com/delta-io/delta/blob/master/CODE_OF_CONDUCT.md).

# License
Apache License 2.0, see [LICENSE](https://github.com/delta-io/delta/blob/master/LICENSE.txt).

Expand Down
45 changes: 43 additions & 2 deletions build.sbt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (2020) The Delta Lake Project Authors.
* Copyright (2021) The Delta Lake Project Authors.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -58,7 +58,7 @@ lazy val core = (project in file("core"))
listPythonFiles(baseDirectory.value.getParentFile / "python"),

antlr4Settings,
antlr4Version in Antlr4 := "4.7",
antlr4Version in Antlr4 := "4.8",
antlr4PackageName in Antlr4 := Some("io.delta.sql.parser"),
antlr4GenListener in Antlr4 := true,
antlr4GenVisitor in Antlr4 := true,
Expand Down Expand Up @@ -95,6 +95,47 @@ lazy val core = (project in file("core"))
(compile in Compile) := ((compile in Compile) dependsOn createTargetClassesDir).value
)

lazy val contribs = (project in file("contribs"))
.dependsOn(core % "compile->compile;test->test;provided->provided")
.settings (
name := "delta-contribs",
commonSettings,
scalaStyleSettings,
releaseSettings,
(mappings in (Compile, packageBin)) := (mappings in (Compile, packageBin)).value ++
listPythonFiles(baseDirectory.value.getParentFile / "python"),

testOptions in Test += Tests.Argument("-oDF"),
testOptions in Test += Tests.Argument(TestFrameworks.JUnit, "-v", "-a"),

// Don't execute in parallel since we can't have multiple Sparks in the same JVM
parallelExecution in Test := false,

scalacOptions ++= Seq(
"-target:jvm-1.8"
),

javaOptions += "-Xmx1024m",

// Configurations to speed up tests and reduce memory footprint
javaOptions in Test ++= Seq(
"-Dspark.ui.enabled=false",
"-Dspark.ui.showConsoleProgress=false",
"-Dspark.databricks.delta.snapshotPartitions=2",
"-Dspark.sql.shuffle.partitions=5",
"-Ddelta.log.cacheSize=3",
"-Dspark.sql.sources.parallelPartitionDiscovery.parallelism=5",
"-Xmx1024m"
),

// Hack to avoid errors related to missing repo-root/target/scala-2.12/classes/
createTargetClassesDir := {
val dir = baseDirectory.value.getParentFile / "target" / "scala-2.12" / "classes"
Files.createDirectories(dir.toPath)
},
(compile in Compile) := ((compile in Compile) dependsOn createTargetClassesDir).value
)

/**
* Get list of python files and return the mapping between source files and target paths
* in the generated package JAR.
Expand Down
2 changes: 1 addition & 1 deletion build/sbt
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
#

#
# Copyright (2020) The Delta Lake Project Authors.
# Copyright (2021) The Delta Lake Project Authors.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
Expand Down
Loading

0 comments on commit cad3c3d

Please sign in to comment.