Extend filter vector ability (focus on off-line featurestore) #1604

george0st · 2022-01-03T13:43:42Z

It will be very useful to support in the vector rich filtering e.g.:

High priority

support logical conditions 'fn2 > 500 and (fn3<=500 or fn4==500)'

Medium priority

support like operator 'fn5 like %sdsd%'
support between and in

Low priority

fuzzy match for string

BTW: you are supporting in get_offline_features only exact match see part of the code

data = pd.DataFrame({"fn0": [39560793709,35392257080], "fn1": [27203050525,13749105613]})
resp = fstore.get_offline_features(vector, entity_rows=data)

The text was updated successfully, but these errors were encountered:

yaronha · 2022-01-03T20:34:04Z

@george0st i assume this filter should be done on the result set (after join)?
as for the syntax it would be best to align the filter string with the engine (pandas/spark) filter/where syntax

george0st · 2022-01-03T23:16:33Z

@yaronha,

I do not think, that usage of filters after join is the best strategy. Filter conditions can affected amount of data for join (and I can imagine different type of joins see Extend join vector ability (focus on on/off-line featurestore) #1605 ) and from these reasons view to the statistic of feature sets can improve execution strategy and total time/cpu spending (see typical scenario for buiding of execution plans in databases).
BTW: I saw that you can do data ingestion with statistic switch-off but I did not see how to force calculation of statistic (but it can be gap on my side).
BTW: But I can imagine, that first version of implementation can be not the final and can work only with a few logical ideas (without detail statistic information how much null items, etc. because it can impacted statistic content, etc.)
I mention sample independent syntaxt 'like SQL' only for better requirement understanding but it is important to mention that spark support standard SQL (ANSI SQL) also see https://spark.apache.org/docs/latest/sql-ref-syntax.html , It will be useful to evaluate the best way ... (and syntax like SQL or syntax .net linq or others are fully acceptable)
last point with fuzzy match can be only as optional (I used it for identification of frodulent payment transaction via black lists, etc. check based on transcription from different languages to EN, etc. see legal and payment law in SWIFT/SEPA, SWIFT CSP, AML - KYC/KYCC, Black/PEP Lists, etc. It is very relevant scenario for financial and bank environment)

What do you think?

BTW: it is nice issue in case of deepdive :-)

george0st · 2022-01-04T07:43:40Z

@yaronha , to be honest:

it is very very hard and neverending topic for DB (see MS SQL, Terradata, Oracele, ... where you can see a lot of optimalization in level of indexes, hints, tuning what keep in memory etc.)
it does not make sense to build the topic from scratch, but reuse efficiency solution from open source solutions (it can be win-win solution with quick maturity improvement but I do not know your strategy/direction in this point)
I am more than sure that you know complexity of that and from this point of view, my issue-business description seems very funny ;-) (I know)

yaronha · 2022-01-04T20:35:22Z

@yaronha,

I do not think, that usage of filters after join is the best strategy. Filter conditions can affected amount of data for join (and I can imagine different type of joins see Extend join vector ability (focus on on/off-line featurestore) #1605 ) and from these reasons view to the statistic of feature sets can improve execution strategy and total time/cpu spending (see typical scenario for buiding of execution plans in databases).
BTW: I saw that you can do data ingestion with statistic switch-off but I did not see how to force calculation of statistic (but it can be gap on my side).
BTW: But I can imagine, that first version of implementation can be not the final and can work only with a few logical ideas (without detail statistic information how much null items, etc. because it can impacted statistic content, etc.)

I mention sample independent syntaxt 'like SQL' only for better requirement understanding but it is important to mention that spark support standard SQL (ANSI SQL) also see https://spark.apache.org/docs/latest/sql-ref-syntax.html , It will be useful to evaluate the best way ... (and syntax like SQL or syntax .net linq or others are fully acceptable)

last point with fuzzy match can be only as optional (I used it for identification of frodulent payment transaction via black lists, etc. check based on transcription from different languages to EN, etc. see legal and payment law in SWIFT/SEPA, SWIFT CSP, AML - KYC/KYCC, Black/PEP Lists, etc. It is very relevant scenario for financial and bank environment)

What do you think?

BTW: it is nice issue in case of deepdive :-)

@george0st if the filter is done before the join a user would need to specify different filters per source feature-set, so logically the filtering should be post join, depending on the engine it may do lazy eval and query compilation (e.g. in Spark & Dask) which will in practice result in doing some filtering before the join in practice.

i agree that SQL semantics are the best, but as you know SQL has different sub dialects, and since we pass the filter arg as where clause to the engine it means it would use the dialect for that engine (spark, pandas, ..) and the supported capabilities (likes, ..) in that engine.

george0st · 2022-01-04T21:17:41Z

@yaronha, I understand the logic because if you unify filter logic directly to output in level of results (independent of targets because each can have different filter language), the situation will be not so complicated. You need only support filtering in level of data frame pandas and data frame spark.

It makes sense to do performance tests for bigger FeatureSets

george0st · 2022-05-18T13:59:15Z

See the relation to #1956

Hedingber assigned urihoenig Jan 3, 2022

george0st closed this as completed Jun 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend filter vector ability (focus on off-line featurestore) #1604

Extend filter vector ability (focus on off-line featurestore) #1604

george0st commented Jan 3, 2022 •

edited

yaronha commented Jan 3, 2022 •

edited

george0st commented Jan 3, 2022 •

edited

george0st commented Jan 4, 2022 •

edited

yaronha commented Jan 4, 2022 •

edited

george0st commented Jan 4, 2022 •

edited

george0st commented May 18, 2022

Extend filter vector ability (focus on off-line featurestore) #1604

Extend filter vector ability (focus on off-line featurestore) #1604

Comments

george0st commented Jan 3, 2022 • edited

yaronha commented Jan 3, 2022 • edited

george0st commented Jan 3, 2022 • edited

george0st commented Jan 4, 2022 • edited

yaronha commented Jan 4, 2022 • edited

george0st commented Jan 4, 2022 • edited

george0st commented May 18, 2022

george0st commented Jan 3, 2022 •

edited

yaronha commented Jan 3, 2022 •

edited

george0st commented Jan 3, 2022 •

edited

george0st commented Jan 4, 2022 •

edited

yaronha commented Jan 4, 2022 •

edited

george0st commented Jan 4, 2022 •

edited