## Imports

In [1]:
import io.hops.util.Hops

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
15,application_1541960591245_0046,spark,idle,Link,Link,✔


SparkSession available as 'spark'.
import io.hops.util.Hops


## Get Project Featurestore

Each project with the featurestore enabled gets its own Hive database for the featurestore, the name of the featurestore database is 'projectname_featurestore' and can be retrieved from the hops-util-py featurestore API

In [2]:
Hops.getProjectFeaturestore

res1: String = test_featurestore


## Get all Featurestores Accessible in the Current Project

Feature stores can be shared across projects just like other Hopsworks datasets. You can use this API function to list all the featurestores accessible in the project programmatically.

In [3]:
Hops.getProjectFeaturestores

res2: java.util.List[String] = [test_featurestore]


## Get Individual Feature

When retrieving a single feature from the featurestore, the hops-util-py library will infer which featuregroup the feature belongs to by querying the metastore, but you can also explicitly specify which featuregroup and version to query. If there are multiple features of the same name in the featurestore, it is required to specify enough information to uniquely identify the feature (e.g which featuregroup and which version).  If no featurestore is provided it will default to the project's featurestore.

Without specifying featuregroup:

In [4]:
Hops.getFeature(spark, "action", Hops.getProjectFeaturestore).show(5)

+------+
|action|
+------+
|     0|
|     0|
|     0|
|     0|
|     0|
+------+
only showing top 5 rows



With specifed featuregroup and version:

In [5]:
Hops.getFeature(spark, "action", Hops.getProjectFeaturestore, "web_logs_features", 1).show(5)

+------+
|action|
+------+
|     0|
|     0|
|     0|
|     0|
|     0|
+------+
only showing top 5 rows



## Get Featuregroup

You can get an entire featuregroup from the API. If no featurestore is provided the API will default to the project's featurestore, if no version is provided it will default to version 1 of the featuregroup.

In [6]:
Hops.getFeaturegroup(spark, "trx_summary_features", Hops.getProjectFeaturestore, 1).show(5)

+-------+---------+---------+---------+---------+
|cust_id|  min_trx|  max_trx|  avg_trx|count_trx|
+-------+---------+---------+---------+---------+
|    148| 390.4109|2094.9958| 1090.509|       16|
|    496| 9.235389|1464.5397| 738.1404|       16|
|    463|33.797318|1828.2426|899.89594|       30|
|    471|578.16833|636.18713|607.17773|        4|
|    243|119.73669| 1582.427| 698.5791|       28|
+-------+---------+---------+---------+---------+
only showing top 5 rows



## Get Set of Features

When retrieving a list of features from the featurestore, the hops-util-py library will infer which featuregroup the features belongs to by querying the metastore. If the features reside in different featuregroups, the library will also **try** to infer how to join the features together based on common columns. If the JOIN query cannot be inferred due to existence of multiple features with the same name or non-obvious JOIN query, the user need to supply enough information to the API call to be able to query the featurestore. If the user already knows the JOIN query it can also run `Hops.queryFeaturestore(joinQuery)` directly (an example of using this approach is shown further down in this notebook). If no featurestore is provided it will default to the project's featurestore.

In [7]:
import scala.collection.JavaConversions._
val features = List("pagerank", "triangle_count", "avg_trx")

import scala.collection.JavaConversions._
features: List[String] = List(pagerank, triangle_count, avg_trx)


In [8]:
Hops.getFeatures(spark, features, Hops.getProjectFeaturestore).show(5)

+--------+--------------+---------+
|pagerank|triangle_count|  avg_trx|
+--------+--------------+---------+
|     1.0|           3.0|963.64233|
|     1.0|          12.0| 746.5783|
|     1.0|           7.0|687.91376|
|     1.0|          12.0| 732.6695|
|     1.0|           4.0|  641.785|
+--------+--------------+---------+
only showing top 5 rows



Without specifying the join key but specifying featuregroups:

In [9]:
val featuregroupsMap = Map[String, Integer]("trx_graph_summary_features"->1,"trx_summary_features"->1)
val javaFeaturegroupsMap = new java.util.HashMap[String, Integer](featuregroupsMap)

featuregroupsMap: scala.collection.immutable.Map[String,Integer] = Map(trx_graph_summary_features -> 1, trx_summary_features -> 1)
javaFeaturegroupsMap: java.util.HashMap[String,Integer] = {trx_summary_features=1, trx_graph_summary_features=1}


In [10]:
Hops.getFeatures(spark, features, Hops.getProjectFeaturestore, javaFeaturegroupsMap).show(5)

+--------+--------------+---------+
|pagerank|triangle_count|  avg_trx|
+--------+--------------+---------+
|     1.0|           3.0|963.64233|
|     1.0|          12.0| 746.5783|
|     1.0|           7.0|687.91376|
|     1.0|          12.0| 732.6695|
|     1.0|           4.0|  641.785|
+--------+--------------+---------+
only showing top 5 rows



Specifying both featuregroups and join key:

In [11]:
Hops.getFeatures(spark, features, Hops.getProjectFeaturestore, javaFeaturegroupsMap, "cust_id").show(5)

+--------+--------------+---------+
|pagerank|triangle_count|  avg_trx|
+--------+--------------+---------+
|     1.0|           3.0|963.64233|
|     1.0|          12.0| 746.5783|
|     1.0|           7.0|687.91376|
|     1.0|          12.0| 732.6695|
|     1.0|           4.0|  641.785|
+--------+--------------+---------+
only showing top 5 rows



### Advanced examples

Getting 10 features from two different featuregroups without specifying the featuregroups

In [12]:
val features1 = List("pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx", "balance", "birthdate", "join_date", "number_of_accounts")
Hops.getFeatures(spark, features1, Hops.getProjectFeaturestore).show(5)

features1: List[String] = List(pagerank, triangle_count, avg_trx, count_trx, max_trx, min_trx, balance, birthdate, join_date, number_of_accounts)
+--------+--------------+---------+---------+---------+---------+---------+-------------------+-------------------+------------------+
|pagerank|triangle_count|  avg_trx|count_trx|  max_trx|  min_trx|  balance|          birthdate|          join_date|number_of_accounts|
+--------+--------------+---------+---------+---------+---------+---------+-------------------+-------------------+------------------+
|     1.0|           5.0| 1090.509|       16|2094.9958| 390.4109|12920.496|2003-04-12 00:00:00|1998-09-06 00:00:00|                10|
|     1.0|           5.0| 738.1404|       16|1464.5397| 9.235389| 11096.28|1985-09-14 00:00:00|2016-07-06 00:00:00|                 7|
|     1.0|           6.0|899.89594|       30|1828.2426|33.797318|1868.0168|2006-09-07 00:00:00|1973-02-13 00:00:00|                14|
|     1.0|           4.0|607.17773|        4

If you try to get features that exist in multiple featuregroups, the library will not be able to infer from which featuregroup to get the features, so you must specify the featuregroups explicitly as an argument

In [13]:
val features2 = List("pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx", "balance", "birthdate", "join_date", "number_of_accounts", "pep")
Hops.getFeatures(spark, features2, Hops.getProjectFeaturestore).show(5)

java.lang.IllegalArgumentException: Found the feature with name: pep in more than one of the featuregroups of the featurestore test_featurestore please specify featuregroup that you want to get the feature from. The matched featuregroups are: pep_lookup_1, customer_type_lookup_1, gender_lookup_1, trx_type_lookup_1, country_lookup_1, industry_sector_lookup_1, alert_type_lookup_1, rule_name_lookup_1, web_address_lookup_1, browser_action_lookup_1, demographic_features_1, trx_graph_edge_list_1, trx_graph_summary_features_1, trx_features_1, trx_summary_features_1, hipo_features_1, alert_features_1, police_report_features_1, web_logs_features_1
  at io.hops.util.featurestore.FeaturestoreHelper.findFeature(FeaturestoreHelper.java:177)
  at io.hops.util.featurestore.FeaturestoreHelper.findFeaturegroupsThatContainsFeatures(FeaturestoreHelper.java:134)
  at io.hops.util.Hops.getFeatures(Hops.java:1231)
  ... 52 elided



If we specify the featuregroup to get the feature that exists in multiple featuregroups, the library can infer how to get the features:

In [14]:
val featuregroupsMap1 = Map[String, Integer](
    "trx_graph_summary_features"->1,
    "trx_summary_features"->1,
    "demographic_features" ->1
)
val javaFeaturegroupsMap1 = new java.util.HashMap[String, Integer](featuregroupsMap1)
Hops.getFeatures(spark, features2, Hops.getProjectFeaturestore, javaFeaturegroupsMap1).show(5)

featuregroupsMap1: scala.collection.immutable.Map[String,Integer] = Map(trx_graph_summary_features -> 1, trx_summary_features -> 1, demographic_features -> 1)
javaFeaturegroupsMap1: java.util.HashMap[String,Integer] = {demographic_features=1, trx_summary_features=1, trx_graph_summary_features=1}
+--------+--------------+---------+---------+---------+---------+---------+-------------------+-------------------+------------------+-------------+
|pagerank|triangle_count|  avg_trx|count_trx|  max_trx|  min_trx|  balance|          birthdate|          join_date|number_of_accounts|          pep|
+--------+--------------+---------+---------+---------+---------+---------+-------------------+-------------------+------------------+-------------+
|     1.0|           5.0| 1090.509|       16|2094.9958| 390.4109|12920.496|2003-04-12 00:00:00|1998-09-06 00:00:00|                10| 309237645312|
|     1.0|           5.0| 738.1404|       16|1464.5397| 9.235389| 11096.28|1985-09-14 00:00:00|2016-07-06 0

Example of getting 19 features from 5 different featuregroups:

In [15]:
val features3 = List("pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx",
    "balance", "birthdate", "join_date", "number_of_accounts", "pep", "customer_type", "gender", "web_id",
    "time_spent_seconds", "address", "action", "report_date", "report_id")
val featuregroupsMap2 = Map[String, Integer](
    "trx_graph_summary_features"->1,
    "trx_summary_features"->1,
    "demographic_features" ->1,
    "web_logs_features" -> 1,
    "police_report_features" -> 1
)
val javaFeaturegroupsMap2 = new java.util.HashMap[String, Integer](featuregroupsMap2)
Hops.getFeatures(spark, features3, Hops.getProjectFeaturestore, javaFeaturegroupsMap2).show(5)

features3: List[String] = List(pagerank, triangle_count, avg_trx, count_trx, max_trx, min_trx, balance, birthdate, join_date, number_of_accounts, pep, customer_type, gender, web_id, time_spent_seconds, address, action, report_date, report_id)
featuregroupsMap2: scala.collection.immutable.Map[String,Integer] = Map(police_report_features -> 1, web_logs_features -> 1, trx_graph_summary_features -> 1, trx_summary_features -> 1, demographic_features -> 1)
javaFeaturegroupsMap2: java.util.HashMap[String,Integer] = {demographic_features=1, police_report_features=1, web_logs_features=1, trx_summary_features=1, trx_graph_summary_features=1}
+--------+--------------+---------+---------+---------+---------+---------+-------------------+-------------------+------------------+------------+-------------+------------+------+------------------+-------+------+-------------------+---------+
|pagerank|triangle_count|  avg_trx|count_trx|  max_trx|  min_trx|  balance|          birthdate|          join_date

Sometimes you might want to get a feature that exist in multiple featuregroups and you want to include all of these featuregroups in your query, then you can specify from which of the featuregroup to get the feature by prepending the feature-name with the featuregroup name + '_version', e.g: 'demographic_features_1.cust_id'. If you don't specify this the query will fail as the library won't know from which of your specified featuregroups to get the feature:

In [16]:
val features4 = List("pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx",
    "balance", "birthdate", "join_date", "number_of_accounts", "pep", "customer_type", "gender", "web_id",
    "time_spent_seconds", "address", "action", "report_date", "report_id", "cust_id")
Hops.getFeatures(spark, features4, Hops.getProjectFeaturestore, javaFeaturegroupsMap2).show(5)

org.apache.spark.sql.AnalysisException: Reference 'cust_id' is ambiguous, could be: demographic_features_1.cust_id, police_report_features_1.cust_id, web_logs_features_1.cust_id, trx_summary_features_1.cust_id, trx_graph_summary_features_1.cust_id.; line 1 pos 219
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:97)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$37.apply(Analyzer.scala:826)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$37.apply(Analyzer.scala:828)
  at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveReferences$$resolve(Analyzer.scala:825)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$ano

If we change 'cust_id' to 'featuregroupname_version.cust_id' the library knows where to get the feature from and the query works:

In [17]:
val features5 = List("pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx",
    "balance", "birthdate", "join_date", "number_of_accounts", "pep", "customer_type", "gender", "web_id",
    "time_spent_seconds", "address", "action", "report_date", "report_id", "demographic_features_1.cust_id")
Hops.getFeatures(spark, features5, Hops.getProjectFeaturestore, javaFeaturegroupsMap2).show(5)

features5: List[String] = List(pagerank, triangle_count, avg_trx, count_trx, max_trx, min_trx, balance, birthdate, join_date, number_of_accounts, pep, customer_type, gender, web_id, time_spent_seconds, address, action, report_date, report_id, demographic_features_1.cust_id)
+--------+--------------+---------+---------+---------+---------+---------+-------------------+-------------------+------------------+------------+-------------+------------+------+------------------+-------+------+-------------------+---------+-------+
|pagerank|triangle_count|  avg_trx|count_trx|  max_trx|  min_trx|  balance|          birthdate|          join_date|number_of_accounts|         pep|customer_type|      gender|web_id|time_spent_seconds|address|action|        report_date|report_id|cust_id|
+--------+--------------+---------+---------+---------+---------+---------+-------------------+-------------------+------------------+------------+-------------+------------+------+------------------+-------+------+--

## Free Text Query from Feature Store

For complex queries that cannot be inferred by the helper functions, enter the sql directly to the method `Hops.queryFeaturestore()` it will default to the project specific feature store but you can also specify it explicitly.

Without specifying the featurestore it will default to the project-specific featurestore:

In [18]:
Hops.queryFeaturestore(
    spark,
    "SELECT * FROM trx_graph_summary_features_1 WHERE triangle_count > 5",
    null
).show(5)

+-------+--------+--------------+
|cust_id|pagerank|triangle_count|
+-------+--------+--------------+
|     29|     1.0|          12.0|
|    474|     1.0|           7.0|
|     65|     1.0|          12.0|
|    222|     1.0|          13.0|
|    270|     1.0|           8.0|
+-------+--------+--------------+
only showing top 5 rows



You can also specify the featurestore to query explicitly:

In [19]:
Hops.queryFeaturestore(
    spark,
    "SELECT * FROM trx_graph_summary_features_1 WHERE triangle_count > 5",
    Hops.getProjectFeaturestore
).show(5)

+-------+--------+--------------+
|cust_id|pagerank|triangle_count|
+-------+--------+--------------+
|     29|     1.0|          12.0|
|    474|     1.0|           7.0|
|     65|     1.0|          12.0|
|    222|     1.0|          13.0|
|    270|     1.0|           8.0|
+-------+--------+--------------+
only showing top 5 rows



## Write to the Feature Store

Lets first get some sample data to insert

In [20]:
val sampleDataMap = Map("hops_customer_1"-> 3, "hops_customer_2"-> 4)
val sampleDataDf = sampleDataMap.toSeq.toDF("customer_type", "id")

sampleDataMap: scala.collection.immutable.Map[String,Int] = Map(hops_customer_1 -> 3, hops_customer_2 -> 4)
sampleDataDf: org.apache.spark.sql.DataFrame = [customer_type: string, id: int]


In [21]:
sampleDataDf.show()

+---------------+---+
|  customer_type| id|
+---------------+---+
|hops_customer_1|  3|
|hops_customer_2|  4|
+---------------+---+



Lets inspect the contents of the featuregroup 'customer_type_lookup' that we are going to insert the sample data into

In [22]:
val sparkDf = Hops.getFeaturegroup(spark, "customer_type_lookup", Hops.getProjectFeaturestore, 1)

sparkDf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [customer_type: string, id: bigint]


In [23]:
sparkDf.show()

+-------------+------------+
|customer_type|          id|
+-------------+------------+
|    corporate|420906795008|
|      private|893353197568|
+-------------+------------+



In [24]:
sparkDf.count()

res19: Long = 2


Now we can insert the sample data and verify the new contents of the featuregroup. By default the insert mode is "append", the featurestore is the project's featurestore and the version is 1

In [25]:
Hops.insertIntoFeaturegroup(
    sampleDataDf, 
    spark, 
    "customer_type_lookup",
    Hops.getProjectFeaturestore,
    1,
    "append"
)

In [26]:
Hops.getFeaturegroup(spark, "customer_type_lookup", Hops.getProjectFeaturestore, 1).show()

+---------------+------------+
|  customer_type|          id|
+---------------+------------+
|hops_customer_1|           3|
|hops_customer_2|           4|
|      corporate|420906795008|
|        private|893353197568|
+---------------+------------+



In [27]:
Hops.getFeaturegroup(spark, "customer_type_lookup", Hops.getProjectFeaturestore, 1).count

res22: Long = 4


The two supported insert modes are "append" and "overwrite"

In [28]:
Hops.insertIntoFeaturegroup(
    sampleDataDf, 
    spark, 
    "customer_type_lookup",
    Hops.getProjectFeaturestore,
    1,
    "overwrite"
)

In [29]:
Hops.getFeaturegroup(spark, "customer_type_lookup", Hops.getProjectFeaturestore, 1).show()

+---------------+---+
|  customer_type| id|
+---------------+---+
|hops_customer_1|  3|
|hops_customer_2|  4|
+---------------+---+



In [30]:
Hops.getFeaturegroup(spark, "customer_type_lookup", Hops.getProjectFeaturestore, 1).count

res25: Long = 2


## Get Featurestore Metadata
To explore the contents of the featurestore we recommend using the featurestore page in the Hopsworks UI but you can also get the metadata programmatically from the REST API with the following method

In [31]:
Hops.getFeaturestoreMetadata(Hops.getProjectFeaturestore)

res26: java.util.List[io.hops.util.featurestore.FeaturegroupDTO] = [io.hops.util.featurestore.FeaturegroupDTO@448bc589, io.hops.util.featurestore.FeaturegroupDTO@780247f0, io.hops.util.featurestore.FeaturegroupDTO@40ec1f32, io.hops.util.featurestore.FeaturegroupDTO@35b297a6, io.hops.util.featurestore.FeaturegroupDTO@2f0fab14, io.hops.util.featurestore.FeaturegroupDTO@45d844c2, io.hops.util.featurestore.FeaturegroupDTO@65edd6c4, io.hops.util.featurestore.FeaturegroupDTO@44c37278, io.hops.util.featurestore.FeaturegroupDTO@164626ff, io.hops.util.featurestore.FeaturegroupDTO@113f9ea1, io.hops.util.featurestore.FeaturegroupDTO@1a6f31b4, io.hops.util.featurestore.FeaturegroupDTO@4560d033, io.hops.util.featurestore.FeaturegroupDTO@62b677be, io.hops.util.featurestore.FeaturegroupDTO@f9137df, io...