Skip to content

italia/daf-srv-storage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Storage Manager

Deployment

requirements

  1. you should be logged into the internal nexus repository

Since the registry uses https but without a valid certificate, you need to add the registry to the insecure registries. Then you can run the following command.

$ docker login https://10.98.74.120:5000

Publish Image

  1. sbt clean compile
  2. sbt docker:publish

Deploy application

Now we can deploy the application in Kubernetes, with the following steps:

  1. ssh in the edge1 or edge2 and check the folder /glusterfs/volume1/storage-manager/conf exists
  2. place a file production.conf with all the production configuration overrides.
  3. cd kubernetes
  4. execute $ deploy.sh. it creates a file output.yml that replaces the variable STORAGE_MANAGER_VERSION with 1.0.0-SNAPSTHOT, calls kubectl delete -f output.yml and kubectl create -f output.yml

Capabilities

This component exposes APIs to query and download data from HDFS.

In particular it allows to:

  1. download a full dataset by a logical_uri
  2. get the schema of a dataset by a logical_uri
  3. get the results of a query operated on a dataset, referred to by its logical_uri

Query API

The query API allows filtering of a dataset by application of a subset of functionality as derived from traditional SQL. The exposed functions come in the form of nested JSON objects, as explained below, with an obvious tradeoff between abstraction, verbosity and functionality.

Representation

Below you can find an example of body that can be used to make queries.

http://[host]:[port]/dataset-manager/v1/dataset/[logical-url]/search

where logical-url would be a URL-encoded URI as defined by the catalog-manager APIs, e.g. daf%3A%2F%2Fdataset%2Ford%2Falessandro%2Fdefault_org%2FAGRI%2Forganizzazioni%2Fagency_infer_ale

A query is made up of clauses, of which the following are currently supported:

  1. select
  2. where
  3. groupBy
  4. having
  5. limit
  6. join
  7. union
1. select

In the example below, we are selecting a column named col1 and a column named col2 with an alias of alias_col2.

{
  "select": [
    { "name": "col1" }, 
    { "name": "col2", "alias": "alias_col2" }
  ]
}

This is equivalent to an SQL of

SELECT col1, col2 AS alias_col2
FROM table

Note that the select clause is optional, where the wildcard * is assumed in case it is omitted.

2. where

In the example below, we are applying a filter where col1 must be greater than col2.

{
  "where": {
    "gt": { "left": "col1", "right": "col2" }
  }
}

This is equivalent to an SQL of

SELECT *
FROM table
WHERE col1 > col2

More complex filters are also possible by nesting logical operators such as and, or and not

{
  "where": {
    "not": {
      "and": [{
        "or": [{
          "gt": { "left": "col1", "right": "col2" }
        }, {
          "eq": { "left": "col3", "right": false }
        }]
      }, {
        "neq": { "left": "col4", "right": "'string'" }
      }]
    }
  }
}

This is equivalent to an SQL of

SELECT *
FROM table
WHERE NOT(
      col1 > col2 
   OR col3 == false
  AND col4 <> 'string'
)

Note how string is enclosed in single quotes to represent the constant string in the JSON example.

3. groupBy

In this example, we are grouping by col1 and col2, aggregating col3 using max and counting all in-group occurrences, applying no filters.

{
  "select": [
    { "name": "col1" },
    { "name": "col2" },
    {
      "max": { "name": "col3", "alias": "max_col3" }
    }, {
      "count": { "name": "*" }
    }
  ],
  "groupBy": [
    { "name": "col1" }, 
    { "name": "col2" }
  ]
}

This is equivalent to an SQL of

SELECT col1, col2, MAX(col3) AS max_col3, COUNT(*)
FROM table
GROUP BY col1, col2

Currently, the validation for groupBy queries is pushed down to the query-engine, which validates the query before applying it to the dataset. This relieves he web-layer from complex validations, but in turn gives up control over the error messages that can be returned in case of invalid group-by syntax.

4. having

In this example, we extend the one we've seen above in 3., adding a having clause, which is used to filter out group-by result by aggregations.

{
  "select": [
    { "name": "col1" },
    { "name": "col2" },
    {
      "max": { "name": "col3" },
      "alias": "max_col3"
    }, {
      "count": { "name": "*" }
    }
  ],
  "groupBy": [
    { "name": "col1" }, 
    { "name": "col2" }
  ],
  "having": [
    { 
      "gt": { "left": "max_col3", "right": 50 } 
    }
  ]
}

This is equivalent to an SQL of

SELECT col1, col2, MAX(col3) AS max_col3, COUNT(*)
FROM table
GROUP BY col1, col2
HAVING max_col3 > 50 

Note that the application will not allow a query to have a having clause without a groupBy.

5. limit

The limit clause will simply return at most the requested number of rows in the result.

{
  "limit": 5
}

This is equivalent to an SQL of

SELECT * 
FROM table
LIMIT 5
6. join

The join clause will accepts any number of join clauses whose type is one of left, right, inner or outer. All join types, share the same representation, with the joined table being represented as a URI that would be used in other APIs to fetch catalog data. This URI is resolved by the engine, its permissions validated and aliases applied as shown below.

{
  "select": [
    { "name": "T1.col1", "alias": "t1_col1" },
    { "name": "JT1.col2", "alias": "t2_col2" }
  ],
  "join": [
    {
      "inner": {
        "uri": "daf://dataset/test/otherTable",
        "on": { 
          "eq": { "left": "JT1.col3", "right": "T1.col3" } 
        } 
      }
    }
  ],
  "where": { 
    "eq": { "left": "JT1.col4", "right": 100 }
  }
}

This is equivalent to an SQL of

SELECT T1.col1 AS t1_col1, T2.col2 AS t2_col2
FROM table T1
JOIN otherTable JT1 
  ON JT1.col3 = T1.col3 
WHERE JT1.col4 = 100

Note how the aliases are applied to the tables in the SQL statement; the engine will internally apply an alias of T1 to the main table in the relation, i.e. the table in the FROM part of the query. The other tables in any of the subsequent join clauses will have aliases starting from JT1 going up. So, for instance, a table which is joined two ways would produce an alias T1 for the main table, and then another two JT1 and JT2 for the joined tables.

7. union

The union clause accepts what is essentially a sub-query which supports only select and where clauses that are operated on a uri. Note that both where and select are optional, where no select clause produces an SQL equivalent of SELECT *.

{
  "union": [
    {
      "uri": "daf://dataset/test",
      "where": { 
        "eq": { "left": "col2", "right": 150 }
    }
    }
  ],
  "where": { 
    "eq": { "left": "col3", "right": false }
  }
}

This is equivalent to an SQL of

SELECT *
FROM table
WHERE col3 = false
UNION ALL 
SELECT * 
FROM otherTable 
WHERE col2 = 150

Note that for performance reasons, only UNION ALL is supported. The hight cost of applying DISTINCT over a large union might be prohibitive for most practical cases.

Appendix

This section will outline the different types of representations exposed by the Query API, attempting to make parallels with SQL where possible for simplicity.

Column Representation

A Column can take one various shapes, where each is possible to use in different contexts.

named-column

Represents a column in a table, referred to by name and possibily given an alias.

{ "name": "col1" } 
{ "name": "col2", "alias": "col2_alias" }

If used in a select, in SQL this would be equivalent to

SELECT col1, col2 AS col2_alias
constant-value-column

Represents a column with a constant and possibly an alias.

{ "value": "string" } 
{ "value": 1, "alias": "one_alias" }

If used in a select, in SQL this would be equivalent to

SELECT 'string', 1 AS one_alias
aggregated-column

Represents a column to which an aggregation function is applies, possibly given an alias

{ 
  "max": { "name": "col1", "alias": "max_col1" }
} 

In a select in SQL, this would be equivalent to

SELECT MAX(col1) AS max_col1

The API currently supports only min, max, count, sum and avg, which may be extended to support more aggregation in the near future.

Filter Operator Representation

Filter operators can be either comparison operators, which compare columns to values or other column, or logical, which act to affect comparison operators.

An example of a comparison operator would be gt, which can be represented as

{ 
  "gt": { "left": "col3", "right": 50 }
}
{
  "eq": { "left": "col1", "right": "col2" }
}

Here we see first a gt comparison between a column col3 and a constant value 50, whereas in the second we see an eq comparison between two columns col1 and col2. The API current supports only gt, gte, lt, lte, eq and neq, including support for in and like (with limitations) in the future.

logical operators can be used to decorate or group a number of comparisons or other filters. This means that logical operators can be nested to create complex filters.

{
  "not": { 
    "gt": { "left": "col3", "right": 50 }
  }
}
{
  "and": [{
    "gt": { "left": "col3", "right": 50 }
  }, {
    "eq": { "left": "col1", "right": "col2" }
  }]  
}
{
  "or": [{
    "gt": { "left": "col3", "right": 50 }
  }, {
    "eq": { "left": "col1", "right": "col2" }
  }]  
}