Skip to content

Commit

Permalink
Merge pull request #236 from mabel-dev/FEATURE/#234
Browse files Browse the repository at this point in the history
Feature/#234
  • Loading branch information
joocer committed Jun 26, 2022
2 parents cbcd0ee + 586ad20 commit 633aa7b
Show file tree
Hide file tree
Showing 17 changed files with 85 additions and 21 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ If a cluster, region or datacentre is unavailable, if you have instances able to

Opteryx supports many popular data formats, including Parquet, ORC, Feather and JSONL, stored on local disk or on Cloud Storage. You can mix-and-match formats, so one dataset can be Parquet and another JSONL, and Opteryx will be able to JOIN across them.

**Consumption-Based Billing**
**Consumption-Based Billing Friendly**

Opteryx is perfect for deployments to environments which are pay-as-you-use, like Google Cloud Run. Great for situations where you low-volume usage, or many environments, where the costs of many traditional database deployment can quickly add up.

Expand All @@ -47,6 +47,10 @@ Opteryx is an Open Source Python library, it quickly and easily integrates into

Designed for data analytics in environments where decisions need to be replayable, Opteryx allows you to query data as at a point in time in the past to replay decision algorithms against facts as they were known in the past. _(data must be structured to enable temporal queries)_

**Schema Evolution**

Changes to schemas and paritioning can be made without requiring any existing data to be updated. _(data types can only be changed to compatitble types)_

## How Can I Contribute?

All contributions, [bug reports](https://github.com/mabel-dev/opteryx/issues/new/choose), bug fixes, documentation improvements, enhancements, and [ideas](https://github.com/mabel-dev/opteryx/discussions) are welcome.
Expand Down
4 changes: 2 additions & 2 deletions bench/multiprocess_wrapper.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@
import os
import time
import logging
import psutil
import pyarrow
from queue import Empty
import multiprocessing


TERMINATE_SIGNAL = -1
MAXIMUM_SECONDS_PROCESSES_CAN_RUN = 3600
CPUS = psutil.cpu_count(logical=False)
CPUS = pyarrow.io_thread_count()


def _inner_process(func, source_queue, reply_queue, channel): # pragma: no cover
Expand Down
3 changes: 3 additions & 0 deletions docs/Features/Schema Evolution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Schema Evolution

[Opteryx](https://mabel-dev.github.io/opteryx/)
3 changes: 3 additions & 0 deletions docs/Features/Time Travel.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Time Travel

[Opteryx](https://mabel-dev.github.io/opteryx/)
2 changes: 2 additions & 0 deletions docs/Release Notes/Change Log.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Not Regular Expression match operator, `!~` added to supported set of operators. ([@joocer](https://github.com/joocer))
- [[#226](https://github.com/mabel-dev/opteryx/issues/226)] Implement `DATE_TRUNC` function. ([@joocer](https://github.com/joocer))
- [[#230](https://github.com/mabel-dev/opteryx/issues/230)] Allow addressing fields as numbers. ([@joocer](https://github.com/joocer))
- [[#234](https://github.com/mabel-dev/opteryx/issues/234)] Implement `SEARCH` function. ([@joocer](https://github.com/joocer))


**Changed**

Expand Down
1 change: 0 additions & 1 deletion docs/Release Notes/Notices.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ Component | Disposition | Copyright | Licence
[orjson](https://github.com/ijl/orjson) | Installed | . | [Apache 2.0](https://github.com/ijl/orjson/blob/master/LICENSE-APACHE)
[pyarrow](https://github.com/apache/arrow/) | Installed | . | [Apache 2.0](https://github.com/apache/arrow/blob/master/LICENSE.txt)
[pyarrow_ops](https://github.com/TomScheffers/pyarrow_ops) | Integrated | TomScheffers (assumed) | [Apache 2.0](https://github.com/TomScheffers/pyarrow_ops/blob/main/LICENSE)
[psutil](https://github.com/giampaolo/psutil) | Installed | . | [BSD-3](https://github.com/giampaolo/psutil/blob/master/LICENSE)
[pyyaml](https://pyyaml.org/) | Installed | . | [MIT](https://github.com/yaml/pyyaml/blob/master/LICENSE)
[sqloxide](https://github.com/wseaton/sqloxide) | Installed | . | [MIT](https://github.com/wseaton/sqloxide/blob/master/LICENSE)
[sqlparser-rs](https://github.com/sqlparser-rs/sqlparser-rs) | Transitive | . | [Apache 2.0](https://github.com/sqlparser-rs/sqlparser-rs/blob/main/LICENSE.TXT)
Expand Down
12 changes: 9 additions & 3 deletions docs/SQL Reference/06 Functions.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Functions

!!! note
:fontawesome-solid-asterisk: indicates a function has more than one entry on this page.

## Numeric Functions

Function | Description | Example
Expand All @@ -17,10 +20,12 @@ Functions for examining and manipulating string values.

Function | Description | Example
--------------- | ------------------------------------------------- | ---------------------------
`GET(str, n)` :fontawesome-solid-asterisk: | Gets the nth element in a string, also `str[n]` | `GET('hello', 2) -> 'e'`
`LEFT(str, n)` | Extract the left-most n characters | `LEFT('hello', 2) -> 'he'`
`LEN(str)` | Number of characters in string, also `LENGTH` | `LEN('hello') -> 5`
`LOWER(str)` | Convert string to lower case | `LOWER('Hello') -> 'hello'`
`RIGHT(str, n)` | Extract the right-most n characters | `RIGHT('hello', 2) -> 'lo'`
`SEARCH(str, val)` :fontawesome-solid-asterisk: | Return True if str contains val | `SEARCH('hello', 'lo') -> TRUE`
`STRING(any)` | Alias of `VARCHAR()` | `STRING(22) -> '22'`
`TRIM(str)` | Removes any spaces from either side of the string | `TRIM(' hello ') -> 'hello'`
`UPPER(str)` | Convert string to upper case | `UPPER('Hello') -> 'HELLO'`
Expand Down Expand Up @@ -66,15 +71,16 @@ Function | Description | Exampl
------------------- | ------------------------------------------------- | ---------------------------
`BOOLEAN(str)` | Convert input to a Boolean | `BOOLEAN('true') -> True`
`CAST(any AS type)` | Cast any to type, calls `type(any)` | `CAST(state AS BOOLEAN) -> False`
`GET(struct, a)` | Gets the element called 'a' from a struct, also `struct[a]` | `GET(dict, 'key') -> 'value'`
`GET(list, n)` | Gets the nth element in a list, also `list[n]` | `GET(names, 2) -> 'Joe'`
`GET(str, n)` | Gets the nth element in a string, also `str[n]` | `GET('hello', 2) -> 'e'`
`GET(list, n)` :fontawesome-solid-asterisk: | Gets the nth element in a list, also `list[n]` | `GET(names, 2) -> 'Joe'`
`GET(struct, a)` :fontawesome-solid-asterisk: | Gets the element called 'a' from a struct, also `struct[a]` | `GET(dict, 'key') -> 'value'`
`HASH(str)` | Calculate the [CityHash](https://opensource.googleblog.com/2011/04/introducing-cityhash.html) (64 bit) of a value | `HASH('hello') -> 'B48BE5A931380CE8'`
`LIST_CONTAINS(list, val)` | Test if a list field contains a value | `LIST_CONTAINS(letters, '1') -> false`
`LIST_CONTAINS_ANY(list, vals)` | Test if a list field contains any of a list of values | `LIST_CONTAINS_ANY(letters, ('1', 'a')) -> true`
`LIST_CONTAINS_ALL(list, vals)` | Test is a list field contains all of a list of values | `LIST_CONTAINS_ALL(letters, ('1', 'a')) -> false`
`MD5(str)` | Calculate the MD5 hash of a value | `MD5('hello') -> '5d41402abc4b2a76b9719d911017c592'`
`RANDOM()` | Random number between 0.000 and 0.999 | `RANDOM() -> 0.234`
`SEARCH(list, val)` :fontawesome-solid-asterisk: | Return True if val is an item in list | `SEARCH(names, 'John') -> TRUE`
`SEARCH(struct, val)` :fontawesome-solid-asterisk: | Return True if any of the keys or values in struct is val | `SEARCH(dict, 'key') -> TRUE`
`UNNEST(list)` | Create a virtual table with a row for each element in the LIST | `UNNEST((TRUE,FALSE)) AS Booleans`
`VERSION()` | Return the version of Opteryx | `VERSION() -> 0.1.0`

Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ In Opteryx a list is an ordered collection of zero or more values of the same da

### Functions

`SEARCH`
`LIST_CONTAINS`
`LIST_CONTAINS_ANY`
`LIST_CONTAINS_ALL`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,7 @@ In Opteryx a struct is a collection of zero or more key, value pairs. Keys must

`struct[key]`
`MAP(struct, key)`

## Functions

`SEARCH`
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Overview

[Opteryx](https://mabel-dev.github.io/opteryx/) is a SQL query engine to query large data sets designed to run in low-cost serverless environments.
is a SQL query engine to query large data sets designed to run in low-cost serverless environments.

## Use Cases

Expand Down
4 changes: 2 additions & 2 deletions opteryx/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import psutil
import pyarrow
import yaml

from pathlib import Path
Expand All @@ -31,7 +31,7 @@
# The maximum number of records to create in a CROSS JOIN frame
MAX_JOIN_SIZE: int = int(_config.get("MAX_JOIN_SIZE", 1000000))
# The maximum number of processors to use for multi processing
MAX_SUB_PROCESSES: int = int(_config.get("MAX_SUB_PROCESSES", psutil.cpu_count(logical=False)))
MAX_SUB_PROCESSES: int = int(_config.get("MAX_SUB_PROCESSES", pyarrow.io_thread_count()))
# The number of bytes to allocate for each processor
BUFFER_PER_SUB_PROCESS: int = int(_config.get("BUFFER_PER_SUB_PROCESS", 100000000))
# The number of seconds before forcably killing processes
Expand Down
18 changes: 9 additions & 9 deletions opteryx/engine/functions/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
These are a set of functions that can be applied to data.
"""
import datetime
import os

from cityhash import CityHash64
from pyarrow import compute
Expand All @@ -27,8 +28,6 @@

def get_random():
"""get a random number between 0 and 1, three decimal places"""
import os

range_min, range_max = 0, 1000
random_int = int.from_bytes(os.urandom(2), "big")
try:
Expand All @@ -40,7 +39,7 @@ def get_random():
def get_md5(item):
"""calculate MD5 hash of a value"""
# this is slow but expected to not have a lot of use
import hashlib
import hashlib # delay the import - it's rarely needed

return hashlib.md5(str(item).encode()).hexdigest() # nosec - meant to be MD5

Expand Down Expand Up @@ -70,19 +69,19 @@ def _get(value, item):
}


def cast(type):
def cast(_type):
"""cast a column to a specified type"""
if type in VECTORIZED_CASTERS:
return lambda a: compute.cast(a, VECTORIZED_CASTERS[type])
if type in ITERATIVE_CASTERS:
if _type in VECTORIZED_CASTERS:
return lambda a: compute.cast(a, VECTORIZED_CASTERS[_type])
if _type in ITERATIVE_CASTERS:

def _inner(arr):
caster = ITERATIVE_CASTERS[type]
caster = ITERATIVE_CASTERS[_type]
for i in arr:
yield [caster(i)]

return _inner
raise SqlError(f"Unable to cast to type {type}")
raise SqlError(f"Unable to cast values in column to `{_type}`")


def _iterate_no_parameters(func):
Expand Down Expand Up @@ -160,6 +159,7 @@ def get_len(obj):
"LIST_CONTAINS": _iterate_double_parameter(other_functions._list_contains),
"LIST_CONTAINS_ANY": _iterate_double_parameter(other_functions._list_contains_any),
"LIST_CONTAINS_ALL": _iterate_double_parameter(other_functions._list_contains_all),
"SEARCH": other_functions._search,
# NUMERIC
"ROUND": compute.round,
"FLOOR": compute.floor,
Expand Down
31 changes: 31 additions & 0 deletions opteryx/engine/functions/other_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import numpy
import pyarrow

from pyarrow import compute


def _list_contains(array, item):
if array is None:
Expand All @@ -27,3 +32,29 @@ def _list_contains_all(array, items):
if array is None:
return False
return set(array).issuperset(items)


def _search(array, item):
"""
`search` provides a way to look for values across different field types, rather
than doing a LIKE on a string, IN on a list, `search` adapts to the field type.
"""
if len(array) > 0:
array_type = type(array[0])
else:
return None
if array_type == str:
# return True if the value is in the string
res = compute.find_substring(array, pattern=item, ignore_case=True)
res = ~(res.to_numpy() < 0)
return ([r] for r in res)
if array_type == numpy.ndarray:
return ([False] if record is None else [item in record] for record in array)
if array_type == dict:
return (
[False]
if record is None
else [item in record.keys() or item in record.values()]
for record in array
)
return [False] * array.shape[0]
2 changes: 1 addition & 1 deletion opteryx/version.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@
2) we can import it in setup.py for the same reason
"""

__version__ = "0.0.3-beta.19"
__version__ = "0.0.3-beta.20"
1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,4 @@ orjson
cityhash
sqloxide
pyarrow
psutil
pyyaml
9 changes: 9 additions & 0 deletions tests/sql_battery/test_battery_shape.py
Original file line number Diff line number Diff line change
Expand Up @@ -354,6 +354,15 @@
("SELECT DISTINCT * FROM (SELECT DATE_TRUNC('year', birth_date) AS BIRTH_YEAR FROM $astronauts)", 54, 1),
("SELECT DISTINCT * FROM (SELECT DATE_TRUNC('month', birth_date) AS BIRTH_YEAR_MONTH FROM $astronauts)", 247, 1),

("SELECT SEARCH(name, 'al'), name FROM $satellites", 177, 2),
("SELECT name FROM $satellites WHERE SEARCH(name, 'al')", 18, 1),
("SELECT SEARCH(missions, 'Apollo 11'), missions FROM $astronauts", 357, 2),
("SELECT name FROM $astronauts WHERE SEARCH(missions, 'Apollo 11')", 3, 1),
("SELECT name, SEARCH(birth_place, 'Italy') FROM $astronauts", 357, 2),
("SELECT name, birth_place FROM $astronauts WHERE SEARCH(birth_place, 'Italy')", 1, 2),
("SELECT name, birth_place FROM $astronauts WHERE SEARCH(birth_place, 'Rome')", 1, 2),
("SELECT name, birth_place FROM $astronauts WHERE SEARCH(birth_place, 'town')", 357, 2),

# These are queries which have been found to return the wrong result or not run correctly
# FILTERING ON FUNCTIONS
("SELECT DATE(birth_date) FROM $astronauts FOR TODAY WHERE DATE(birth_date) < '1930-01-01'", 14, 1),
Expand Down
3 changes: 3 additions & 0 deletions tests/sql_battery/test_expect_to_fail.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@
# DECLARE isn't supported
("DELARE @variable AS NUMERIC = 3 SELECT * FROM $planets WHERE ID = @variable"),

# SELECT EXCEPT isn't supported
# https://towardsdatascience.com/4-bigquery-sql-shortcuts-that-can-simplify-your-queries-30f94666a046
("SELECT * EXCEPT id FROM $satellites"),
]
# fmt:on

Expand Down

0 comments on commit 633aa7b

Please sign in to comment.