Semantic Query Support

macan edited this page May 26, 2011 · 5 revisions

Features

Native semantic query support is missing in mainstream file systems. Users who want to do semantic query have to index the file system metadata or extended attributes by themselves.

For example, if you want to do a query "give me a file list that was created in last week", you may have to do brute-force search on the whole file system. Think if the file system supports native semantic query, you may just do query as "search whole_fs 'range: ctime in LAST_WEEK'". A semantic query engine do this search by distributed searching in 'ctime' indexs.

Native semantic query in PomegranateFS has the following features:

  • Stream indexer which can by re-configured at any time;
  • User defined indexer which can index many different standard or extended attributes;
  • Some predefined analysis operators which provides statistical information;
  • Integrated framework built with file system to index files automatically;

Query Interface

There is no standard on how to do semantic query in file systems. Thus, we have to build our 'standard'. Basically, we decide to reuse POSIX interface of extend attributes. In detail, we build a special (operational) namespace "pfs" in extend attributes. Operations in this namespace are transformed to semantic queries automatically.

Class Column Operation Other region Note or Example
native [0-5] read .offset.len If len == -1, read whole content.
write [.len] Length is optional.
lookup

{return column info}

triple: "itbid.len.offset"

dt ignore create .type.where.priority.local_path
cat
clear
branch ignore create .name.tag.level.op_list
delete .name
tag [0-5] set .B.kv_list
delete .B.key
update .B.key.value
test .B.key
search .B.dbname.prefix.search_expr

Some of the atomic placeholder (B, search_expr, etc) are defined in the following table:

Name Definition Examples
op_list

filter:id:rid:[l|r]:reg

sum:id:rid:[l|r]:reg:[left|right|all|match]

count:id:rid:[l|r]:reg:[left|right|all|match]

avg:id:rid:[l|r]:reg:[left|right|all|match]

max:id:rid:[l|r]:reg:[left|right|all|match]

min:id:rid:[l|r]:reg:[left|right|all|match]

knn:id:rid:[l|r]:reg:[left|right|all|match]:[linear|xlinear]:center:+/-distance

groupby:id:rid:[l|r]:reg:[left|right|all|match]:sum/avg/max/min/count

indexer:id:rid:[l|r]:[plain|bdb]:dbname:prefix

filter:1:0:l:.*; count:2:0:r:.*:all; avg:3:1:l:.*:all; sum:4:1:r:.*:all; max:5:2:l:.*:all; min:6:2:r:.*:all; knn:7:3:l:.*:linear:100:+-10; groupby:8:3:r:.*:all:sum/avg/max/min; indexer:9:4:l:bdb:DB:00;
kv_list key=value;key2=value2;... type=png;@ctime=10000;
B B[:branch_name[:key1[:key2[...]]]] B:hello_branch:type:@ctime
search_expr

[r|p]: key [=<>] value [&|] key2 [=<>] value2

r => range query; p => point query

r: type=png & @ctime > 100

How It Works

Gather Information

Process Events

Index Events

Reference

Analyzing, Indexing and Searching Streams of File System triggered Events