As described in :ref:`quickstart`, it is possible to register an already
created dask dataframe with a call to c.create_table
.
However, it is also possible to load data directly from disk (or s3, hdfs, URL, hive, ...)
and register it as a table in dask_sql
.
Behind the scenes, a call to one of the read_<format>
of the dask.dataframe
will be executed.
Additionally, queries can be materialized into new tables for caching or faster access.
CREATE [ OR REPLACE ] TABLE [ IF NOT EXISTS ] <table-name> WITH ( <key> = <value> [ , ... ] ) CREATE [ OR REPLACE ] TABLE [ IF NOT EXISTS ] <table-name> AS ( SELECT ... ) CREATE [ OR REPLACE ] VIEW [ IF NOT EXISTS ] <table-name> AS ( SELECT ... ) DROP TABLE | VIEW [ IF EXISTS ] <table-name>
See :ref:`sql` for information on how to reference tables correctly. Please note, that there can only ever exist a single view or table with the same name.
Note
As there is only a single schema "schema" in dask-sql
,
table names should not include a separator "." in CREATE
calls.
By default, if a table with the same name does already exist, dask-sql
will raise an exception
(and in turn will raise an exception if you try to delete a table which is not present).
With the flags IF [NOT] EXISTS
and OR REPLACE
, this behavior can be controlled:
CREATE OR REPLACE TABLE | VIEW
will override an already present table/view with the same name without raising an exception.CREATE TABLE IF NOT EXISTS
will not create the table/view if it already exists (and will also not raise an exception).DROP TABLE | VIEW IF EXISTS
will only drop the table/view if it exists and will not do anything otherwise.
This will create and register a new table "df" with the data under the specified location
and format.
For information on how to specify key-value arguments properly, see :ref:`sql`.
With the persist
parameter, it can be controlled if the data should be cached
or re-read for every SQL query.
The additional parameters are passed to the particular data loading functions.
If you omit the format argument, it will be deduced from the file name extension.
More ways to load data can be found in :ref:`data_input`.
Example:
CREATE TABLE df WITH ( location = "/some/file/path", format = "csv/parquet/json/...", persist = True, additional_parameter = value, ... )
Using a similar syntax, it is also possible to create a (materialized) view of a (maybe complicated) SQL query.
With the command, you give the result of the SELECT
query a name, that you can use
in subsequent calls.
The SELECT
can also contain a call to PREDICT
, see :ref:`ml`.
Example:
CREATE TABLE my_table AS (
SELECT
a, b, SUM(c)
FROM data
GROUP BY a, b
...
)
SELECT * FROM my_table
Instead of using CREATE TABLE
it is also possible to use CREATE VIEW
.
The result is very similar, the only difference is, when the result will be computed: a view is recomputed on every usage,
whereas a table is only calculated once on creation (also known as a materialized view).
This means, if you e.g. read data from a remote file and the file changes, a query containing a view will
be updated whereas a query with a table will stay as it is.
To update a table, you need to recreate it.
Hint
Use views to simplify complicated queries (like a "shortcut") and tables for caching.
Note
The update of the view only works, if your primary data source (the files you were reading in), are not persisted during reading.
Example:
CREATE VIEW my_table AS (
SELECT
a, b, SUM(c)
FROM data
GROUP BY a, b
...
)
SELECT * FROM my_table
Remove a table or view with the given name.
Please note again, that views and tables are treated equally, so CREATE TABLE
will also delete the view with the given name and vise versa.