dask-sql
understands SQL in (mostly) presto SQL syntax.
So far, not every valid SQL operator and keyword is already
implemented in dask-sql
, but a large fraction of it.
Have a look into our issue tracker
to find out what is still missing.
dask-sql
understands queries for data retrieval (SELECT
), queries on metadata information (SHOW
and DESCRIBE
), queries for table creation (CREATE TABLE
) and machine learning (CREATE MODEL
and PREDICT
).
In the following, general information for these queries are given - the sub-pages give details on each of the implemented keywords or operations.
The information on these pages apply to all ways SQL queries can be handed over to dask-sql
: via Python (:ref:`api`), the SQL server (:ref:`server`) or the command line (:ref:`cmd`).
Data in dask-sql
is - similar to most SQL systems - grouped in named tables, which consist of columns (with names and data types) and rows.
The tables are again grouped into schemas.
For simplicity, there only exists a single schema, named "schema".
For many queries, it is necessary to refer to a schema, table, or column. Identifiers can be specified with double quotes or without quotes (if there is no ambiguity with SQL keywords). Casing will be kept (with or without quotes).
SELECT
"date", "name"
FROM
"df"
"date"
definitely needs quotation marks (as DATE
is also an SQL keyword), but name
and df
can also be specified without quotation marks.
To prevent ambiguities, the full table identifier can be used:
SELECT
"df"."date"
FROM
"schema"."df"
In many cases however, the bare name is enough:
SELECT
"date"
FROM
"df"
String literals get single quotes:
SELECT 'string literal'
Note
dask-sql
can only understand a single SQL query per call to Context.sql
.
Therefore, there should also be no semicolons after the query.
Some SQL statements, like CREATE MODEL WITH
and CREATE TABLE WITH
expect a list of key-value arguments,
which resemble (not accidentally) a Python dictionary. They are in the form
(
key = value
[ , ... ]
)
with an arbitrary number of key-value pairs and always are enclosed in brackets. The keys are (similar to Pythons dict
constructor) unquoted.
A value can be any valid SQL literal (e.g. 3
, 4.2
, 'string'
), a key-value parameter list itself
or a list (ARRAY
) or a set (MULTISET
) (or another way of writing dictionaries with MAP
).
This means, the following is a valid key-value parameter list:
(
first_argument = 3,
second_argument = MULTISET [ 1, 1, 2, 3 ],
third_argument = (
sub_argument_1 = ARRAY [ 1, 2, 3 ],
sub_argument_2 = 'a string'
)
)
Please note that, in contrast to python, no comma is allowed after the last argument.
.. toctree:: :maxdepth: 1 sql/select.rst sql/creation.rst sql/ml.rst sql/describe.rst
dask-sql
needs to map between SQL and dask
(python) types.
For this, it uses the following mapping:
From Python Type | To SQL Type |
---|---|
np.bool8 |
BOOLEAN |
np.datetime64 |
TIMESTAMP |
np.float32 |
FLOAT |
np.float64 |
DOUBLE |
np.int16 |
SMALLINT |
np.int32 |
INTEGER |
np.int64 |
BIGINT |
np.int8 |
TINYINT |
np.object_ |
VARCHAR |
np.uint16 |
SMALLINT |
np.uint32 |
INTEGER |
np.uint64 |
BIGINT |
np.uint8 |
TINYINT |
pd.BooleanDtype |
BOOLEAN |
pd.Int16Dtype |
SMALLINT |
pd.Int32Dtype |
INTEGER |
pd.Int64Dtype |
BIGINT |
pd.Int8Dtype |
TINYINT |
pd.StringDtype |
VARCHAR |
pd.UInt16Dtype |
SMALLINT |
pd.UInt32Dtype |
INTEGER |
pd.UInt64Dtype |
BIGINT |
pd.UInt8Dtype |
TINYINT |
From SQL Type | To Python Type |
---|---|
BIGINT |
pd.Int64Dtype |
BOOLEAN |
pd.BooleanDtype |
CHAR(*) |
pd.StringDtype |
DATE |
np.dtype("<M8[ns]") |
DECIMAL(*) |
np.float64 |
DOUBLE |
np.float64 |
FLOAT |
np.float32 |
INTEGER |
pd.Int32Dtype() |
INTERVAL |
np.dtype("<m8[ns]") |
SMALLINT |
pd.Int16Dtype() |
TIME(*) |
np.dtype("<M8[ns]") |
TIMESTAMP(*) |
np.dtype("<M8[ns]") |
TINYINT |
pd.Int8Dtype |
VARCHAR |
pd.StringDtype |
VARCHAR(*) |
pd.StringDtype |
dask-sql
is still in early development, therefore exist some limitations:
Not all operations and aggregations are implemented already.
Note
Whenever you find a not already implemented operation, keyword or functionality, please raise an issue at our issue tracker with your use-case.
Dask/pandas and SQL treat null-values (or nan) differently on sorting, grouping and joining.
dask-sql
tries to follow the SQL standard as much as possible, so results might be different to what you expect from Dask/pandas.
Apart from those functional limitations, there is a operation which need special care: ORDER BY`
.
Normally, dask-sql
calls create a dask
data frame, which gets only computed when you call the .compute()
member.
Due to internal constraints, this is currently not the case for ORDER BY
.
Including this operation will trigger a calculation of the full data frame already when calling Context.sql()
.
Warning
There is a subtle but important difference between adding LIMIT 10
to your SQL query and calling sql(...).head(10)
.
The data inside dask
is partitioned, to distribute it over the cluster.
head
will only return the first N elements from the first partition - even if N is larger than the partition size.
As a benefit, calling .head(N)
is typically faster than calculating the full data sample with .compute()
.
LIMIT
on the other hand will always return the first N elements - no matter on how many partitions they are scattered -
but will also need to precalculate the first partition to find out, if it needs to have a look into all data or not.