Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend filter vector ability (focus on off-line featurestore) #1604

Closed
george0st opened this issue Jan 3, 2022 · 6 comments
Closed

Extend filter vector ability (focus on off-line featurestore) #1604

george0st opened this issue Jan 3, 2022 · 6 comments
Assignees

Comments

@george0st
Copy link
Collaborator

george0st commented Jan 3, 2022

It will be very useful to support in the vector rich filtering e.g.:

High priority

  • support logical conditions 'fn2 > 500 and (fn3<=500 or fn4==500)'

Medium priority

  • support like operator 'fn5 like %sdsd%'
  • support between and in

Low priority

  • fuzzy match for string

BTW: you are supporting in get_offline_features only exact match see part of the code

data = pd.DataFrame({"fn0": [39560793709,35392257080], "fn1": [27203050525,13749105613]})
resp = fstore.get_offline_features(vector, entity_rows=data)
@yaronha
Copy link
Collaborator

yaronha commented Jan 3, 2022

@george0st i assume this filter should be done on the result set (after join)?
as for the syntax it would be best to align the filter string with the engine (pandas/spark) filter/where syntax

@george0st
Copy link
Collaborator Author

george0st commented Jan 3, 2022

@yaronha,

  1. I do not think, that usage of filters after join is the best strategy. Filter conditions can affected amount of data for join (and I can imagine different type of joins see Extend join vector ability (focus on on/off-line featurestore) #1605 ) and from these reasons view to the statistic of feature sets can improve execution strategy and total time/cpu spending (see typical scenario for buiding of execution plans in databases).
    BTW: I saw that you can do data ingestion with statistic switch-off but I did not see how to force calculation of statistic (but it can be gap on my side).
    BTW: But I can imagine, that first version of implementation can be not the final and can work only with a few logical ideas (without detail statistic information how much null items, etc. because it can impacted statistic content, etc.)

  2. I mention sample independent syntaxt 'like SQL' only for better requirement understanding but it is important to mention that spark support standard SQL (ANSI SQL) also see https://spark.apache.org/docs/latest/sql-ref-syntax.html , It will be useful to evaluate the best way ... (and syntax like SQL or syntax .net linq or others are fully acceptable)

  3. last point with fuzzy match can be only as optional (I used it for identification of frodulent payment transaction via black lists, etc. check based on transcription from different languages to EN, etc. see legal and payment law in SWIFT/SEPA, SWIFT CSP, AML - KYC/KYCC, Black/PEP Lists, etc. It is very relevant scenario for financial and bank environment)

What do you think?

BTW: it is nice issue in case of deepdive :-)

@george0st
Copy link
Collaborator Author

george0st commented Jan 4, 2022

@yaronha , to be honest:

  • it is very very hard and neverending topic for DB (see MS SQL, Terradata, Oracele, ... where you can see a lot of optimalization in level of indexes, hints, tuning what keep in memory etc.)
  • it does not make sense to build the topic from scratch, but reuse efficiency solution from open source solutions (it can be win-win solution with quick maturity improvement but I do not know your strategy/direction in this point)
  • I am more than sure that you know complexity of that and from this point of view, my issue-business description seems very funny ;-) (I know)

@yaronha
Copy link
Collaborator

yaronha commented Jan 4, 2022

@yaronha,

  1. I do not think, that usage of filters after join is the best strategy. Filter conditions can affected amount of data for join (and I can imagine different type of joins see Extend join vector ability (focus on on/off-line featurestore) #1605 ) and from these reasons view to the statistic of feature sets can improve execution strategy and total time/cpu spending (see typical scenario for buiding of execution plans in databases).
    BTW: I saw that you can do data ingestion with statistic switch-off but I did not see how to force calculation of statistic (but it can be gap on my side).
    BTW: But I can imagine, that first version of implementation can be not the final and can work only with a few logical ideas (without detail statistic information how much null items, etc. because it can impacted statistic content, etc.)
  2. I mention sample independent syntaxt 'like SQL' only for better requirement understanding but it is important to mention that spark support standard SQL (ANSI SQL) also see https://spark.apache.org/docs/latest/sql-ref-syntax.html , It will be useful to evaluate the best way ... (and syntax like SQL or syntax .net linq or others are fully acceptable)
  3. last point with fuzzy match can be only as optional (I used it for identification of frodulent payment transaction via black lists, etc. check based on transcription from different languages to EN, etc. see legal and payment law in SWIFT/SEPA, SWIFT CSP, AML - KYC/KYCC, Black/PEP Lists, etc. It is very relevant scenario for financial and bank environment)

What do you think?

BTW: it is nice issue in case of deepdive :-)

@george0st if the filter is done before the join a user would need to specify different filters per source feature-set, so logically the filtering should be post join, depending on the engine it may do lazy eval and query compilation (e.g. in Spark & Dask) which will in practice result in doing some filtering before the join in practice.

i agree that SQL semantics are the best, but as you know SQL has different sub dialects, and since we pass the filter arg as where clause to the engine it means it would use the dialect for that engine (spark, pandas, ..) and the supported capabilities (likes, ..) in that engine.

@george0st
Copy link
Collaborator Author

george0st commented Jan 4, 2022

@yaronha, I understand the logic because if you unify filter logic directly to output in level of results (independent of targets because each can have different filter language), the situation will be not so complicated. You need only support filtering in level of data frame pandas and data frame spark.

It makes sense to do performance tests for bigger FeatureSets

@george0st
Copy link
Collaborator Author

See the relation to #1956

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants