## Imports

In [1]:
from hops import featurestore

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
4,application_1544974908167_0026,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


## Get Project Featurestore

Each project with the featurestore enabled gets its own Hive database for the featurestore, the name of the featurestore database is 'projectname_featurestore' and can be retrieved from the hops-util-py featurestore API

In [None]:
featurestore.project_featurestore()

## Get all Featurestores Accessible in the Current Project

Feature stores can be shared across projects just like other Hopsworks datasets. You can use this API function to list all the featurestores accessible in the project programmatically.

In [None]:
featurestore.get_project_featurestores()

## Get Individual Feature

When retrieving a single feature from the featurestore, the hops-util-py library will infer which featuregroup the feature belongs to by querying the metastore, but you can also explicitly specify which featuregroup and version to query. If there are multiple features of the same name in the featurestore, it is required to specify enough information to uniquely identify the feature (e.g which featuregroup and which version).  If no featurestore is provided it will default to the project's featurestore.

Without specifying featuregroup:

In [None]:
featurestore.get_feature("action").show(5)

With specifed featuregroup:

In [None]:
featurestore.get_feature("action", 
                         featurestore=featurestore.project_featurestore(), 
                         featuregroup="web_logs_features", 
                         featuregroup_version = 1).show(5)

## Get Featuregroup

You can get an entire featuregroup from the API. If no featurestore is provided the API will default to the project's featurestore, if no version is provided it will default to version 1 of the featuregroup.

In [None]:
featurestore.get_featuregroup("trx_summary_features").show(5)

In [None]:
featurestore.get_featuregroup("trx_summary_features", 
                              featurestore=featurestore.project_featurestore(), 
                              featuregroup_version = 1).show(5)

## Get Set of Features

When retrieving a list of features from the featurestore, the hops-util-py library will infer which featuregroup the features belongs to by querying the metastore. If the features reside in different featuregroups, the library will also **try** to infer how to join the features together based on common columns. If the JOIN query cannot be inferred due to existence of multiple features with the same name or non-obvious JOIN query, the user need to supply enough information to the API call to be able to query the featurestore. If the user already knows the JOIN query it can also run `featurestore.sql(joinQuery)` directly (an example of this is shown further down in this notebook). If no featurestore is provided it will default to the project's featurestore.

Without specifying featuregroups and join key:

In [None]:
featurestore.get_features(["pagerank", "triangle_count", "avg_trx"], 
             featurestore=featurestore.project_featurestore()).show(5)

Without specifying the join key but specifying featuregroups:

In [None]:
featurestore.get_features(["pagerank", "triangle_count", "avg_trx"], 
             featurestore=featurestore.project_featurestore(), 
             featuregroups_version_dict={
                 "trx_graph_summary_features": 1, 
                "trx_summary_features": 1
             }).show(5)

Specifying both featuregroups and join key:

In [None]:
featurestore.get_features(["pagerank", "triangle_count", "avg_trx"], 
             featurestore=featurestore.project_featurestore(), 
             featuregroups_version_dict={
                 "trx_graph_summary_features": 1, 
                "trx_summary_features": 1
             }, 
             join_key="cust_id").show(5)

### Advanced examples of Reading Features

Getting 10 features from two different featuregroups without specifying the featuregroups

In [None]:
featurestore.get_features(
    ["pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx",
    "balance", "birthdate", "join_date", "number_of_accounts"], 
             featurestore=featurestore.project_featurestore()).show(5)

If you try to get features that exist in multiple featuregroups, the library will not be able to infer from which featuregroup to get the features, so you must specify the featuregroups explicitly as an argument

In [None]:
featurestore.get_features(
    ["pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx",
    "balance", "birthdate", "join_date", "number_of_accounts", "pep"], 
             featurestore=featurestore.project_featurestore()).show(5)

If we specify the featuregroup to get the feature where that exists in multiple featuregroups, the library can infer how to get the features:

In [None]:
featurestore.get_features(
    ["pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx",
    "balance", "birthdate", "join_date", "number_of_accounts", "pep"], 
             featurestore=featurestore.project_featurestore(),
    featuregroups_version_dict={
                "trx_graph_summary_features": 1, 
                "trx_summary_features": 1,
                "demographic_features": 1
             }).show(5)

Example of getting 19 features from 5 different featuregroups: 

In [None]:
featurestore.get_features(
    ["pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx",
    "balance", "birthdate", "join_date", "number_of_accounts", "pep", "customer_type", "gender", "web_id",
    "time_spent_seconds", "address", "action", "report_date", "report_id"], 
             featurestore=featurestore.project_featurestore(),
    featuregroups_version_dict={
                "trx_graph_summary_features": 1, 
                "trx_summary_features": 1,
                "demographic_features": 1,
                "web_logs_features": 1,
                "police_report_features": 1
             }).show(5)

Sometimes you might want to get a feature that exist in multiple featuregroups and you want to include all of these featuregroups in your query, then you can specify from which of the featuregroup to get the feature by prepending the feature-name with the featuregroup name + '\_version', e.g: 'demographic\_features_1.cust_id'. If you don't specify this the query will fail as the library won't know from which of your specified featuregroups to get the feature: 

In [None]:
featurestore.get_features(
    ["pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx",
    "balance", "birthdate", "join_date", "number_of_accounts", "pep", "customer_type", "gender", "web_id",
    "time_spent_seconds", "address", "action", "report_date", "report_id", "cust_id"], 
             featurestore=featurestore.project_featurestore(),
    featuregroups_version_dict={
                "trx_graph_summary_features": 1, 
                "trx_summary_features": 1,
                "demographic_features": 1,
                "web_logs_features": 1,
                "police_report_features": 1
             }).show(5)

If we change 'cust\_id' to 'featuregroupname\_version.cust\_id' it works: 

In [None]:
featurestore.get_features(
    ["pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx",
    "balance", "birthdate", "join_date", "number_of_accounts", "pep", "customer_type", "gender", "web_id",
    "time_spent_seconds", "address", "action", "report_date", "report_id", "demographic_features_1.cust_id"], 
             featurestore=featurestore.project_featurestore(),
    featuregroups_version_dict={
                "trx_graph_summary_features": 1, 
                "trx_summary_features": 1,
                "demographic_features": 1,
                "web_logs_features": 1,
                "police_report_features": 1
             }).show(5)

## Free Text Query from Feature Store

For complex queries that cannot be inferred by the helper functions, enter the sql directly to the method `featurestore.sql()` it will default to the project specific feature store but you can also specify it explicitly. If you are proficient in SQL, this is the most efficient and preferred way to query the featurestore

Without specifying the featurestore it will default to the project-specific featurestore:

In [None]:
featurestore.sql("SELECT * FROM trx_graph_summary_features_1 WHERE triangle_count > 5").show(5)

You can also specify the featurestore to query explicitly:

In [None]:
featurestore.sql("SELECT * FROM trx_graph_summary_features_1 WHERE triangle_count > 5", 
                 featurestore=featurestore.project_featurestore()).show(5)

## Insert Into the Feature Store

Lets first get some sample data to insert

In [None]:
from pyspark.sql import SQLContext
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, LongType
sqlContext = SQLContext(sc)
schema = StructType([StructField("id", LongType(), True),
                         StructField("customer_type", StringType(), True)
                        ])
sampleDf = sqlContext.createDataFrame([(3, "hops_customer_1"), (4, "hops_customer_2")], schema)

In [None]:
sampleDf.show()

Lets inspect the contents of the featuregroup 'customer_type_lookup' that we are going to insert the sample data into

In [None]:
sparkDf = featurestore.get_featuregroup("customer_type_lookup")

In [None]:
sparkDf.show()

In [None]:
sparkDf.count()

Now we can insert the sample data and verify the new contents of the featuregroup. By default the insert mode is "append", the featurestore is the project's featurestore and the version is 1

In [None]:
featurestore.insert_into_featuregroup(sampleDf, "customer_type_lookup")

In [None]:
featurestore.get_featuregroup("customer_type_lookup").show()

In [None]:
featurestore.get_featuregroup("customer_type_lookup").count()

You can also explicitly specify featurestore, featuregroup version, and the insert mode:

In [None]:
featurestore.insert_into_featuregroup(sampleDf, 
                         "customer_type_lookup", 
                         featurestore=featurestore.project_featurestore(), 
                         featuregroup_version=1, 
                         mode="append")

In [None]:
featurestore.get_featuregroup("customer_type_lookup").show()

The two supported insert modes are "append" and "overwrite"

In [None]:
featurestore.insert_into_featuregroup(sampleDf, 
                         "customer_type_lookup", 
                         featurestore=featurestore.project_featurestore(), 
                         featuregroup_version=1, 
                         mode="overwrite")

In [None]:
featurestore.get_featuregroup("customer_type_lookup").show()

## Compute Featuregroup Statistics

Statistics about a featuregroup can be useful in the stage of feature engineering and when deciding which features to use for training. 

To compute statistics about an existing featuregroup (that should not be empty of course), you can use the API call `update_featuregroup_stats`. By default it will compute all statistics (descriptive, feature correlation, histograms, and cluster analysis), use the project's featurestore, use version 1 of the featuregroup and use all columns for computing statistics:

In [None]:
featurestore.update_featuregroup_stats("trx_summary_features")

You can also be explicitly specify featuregroup details and what statistics to compute: 

In [None]:
featurestore.update_featuregroup_stats(
    "trx_summary_features", 
    featuregroup_version=1, 
    featurestore=featurestore.project_featurestore(), 
    descriptive_statistics=True,
    feature_correlation=True, 
    feature_histograms=True,
    cluster_analysis=True,
    stat_columns=None)

If you only want to compute statistics for certain set of columns and exclude surrogat key-columns for example, you can use the optional argument `stat_columns` to specify which columns to include:

In [None]:
featurestore.update_featuregroup_stats(
    "trx_summary_features", 
    featuregroup_version=1, 
    featurestore=featurestore.project_featurestore(), 
    descriptive_statistics=True,
    feature_correlation=True, 
    feature_histograms=True,
    cluster_analysis=True,
    stat_columns=['avg_trx', 'count_trx', 'max_trx', 'min_trx'])

## Create a Featuregroup From a Spark Dataframe

In most cases it is recommended that featuregroups are created in the UI on Hopsworks and that care is taken in documenting the featuregroup. However, sometimes it is practical to create a featuregroup directly from a spark dataframe and fill in the metadata about the featuregroup later in the UI. This can be done through the create_featuregroup API function.

Lets create a new featuregroup that contains the same contents as the featuregroup trx_summary except the the column  count_trx is dropped

In [None]:
trx_summary_df = featurestore.get_featuregroup("trx_summary_features")
trx_summary_df1 = trx_summary_df.drop("count_trx")

In [None]:
trx_summary_df1.show(5)

Lets now create a new featuregroup using the transformed dataframe

In [None]:
featurestore.create_featuregroup(
    trx_summary_df1,
    "trx_summary_features_2",
    description="trx_summary_features without the column count_trx"
)

By default the new featuregroup will be created in the project's featurestore and the statistics for the new featuregroup will be computed based on the provided spark dataframe. You can configure this behaviour by modifying the default arguments and filling in extra metadata:

In [None]:
featurestore.create_featuregroup(
    trx_summary_df1,
    "trx_summary_features_2_2",
    description="trx_summary_features without the column count_trx",
    featurestore=featurestore.project_featurestore(),
    featuregroup_version=1,
    job_id=None,
    dependencies=[],
    descriptive_statistics=False,
    feature_correlation=False,
    feature_histograms=False,
    cluster_analysis=False,
    stat_columns=None
)

## Create Training Datasets from a Set of Features

After you have found the features you need in the featurestore you can materialize the features into a training dataset so that you can train a machine learning model using the features. Just as for featuregroups, it is useful to version and document training datasets, for this reason HopsML supports **managed training datasets** which enables you to easily version, document and automate the materialization of training datasets.

Metadata for a training dataset can be created from the Hopsworks UI or directly from the API with the function `create_training_dataset`. The training datasets in a project are stored in a top-level dataset called `Training_Datasets`, (i.e `hdfs:///Projects/<ProjectName>/Training_Datasets`.

Once a training dataset have been created you can find it in the featurestore UI in hopsworks under the tab `Training datasets`, from there you can also edit the metadata if necessary. After a training dataset have been created with the necessary metadata you can save the actual data in the training dataset by using the API function `insert_into_training_dataset`.

Lets create a dataset called `AML_dataset` by using a set of relevant features from the featurestore.

First we select the features (and/or labels) that we want

In [None]:
dataset_df = featurestore.get_features(
    ["pagerank", "triangle_count", "avg_trx", "count_trx", "max_trx", "min_trx",
    "balance", "number_of_accounts", "pep"], 
             featurestore=featurestore.project_featurestore(),
    featuregroups_version_dict={
                "trx_graph_summary_features": 1, 
                "trx_summary_features": 1,
                "demographic_features": 1
             })

In [None]:
dataset_df.show(5)

Now we can create a training dataset from the dataframe with some extended metadata such as schema (automatically inferred). By default when you create a training dataset it will be in "tfrecords" format and statistics will be computed for all features. After the dataset have been created you can view and/or update the metadata about the training dataset from the Hopsworks featurestore UI

In [None]:
featurestore.create_training_dataset(dataset_df, "AML_dataset")

You can override the default configuration if necessary:

In [None]:
featurestore.create_training_dataset(
    dataset_df, "TestDataset",
    description="",
    featurestore=featurestore.project_featurestore(),
    data_format="csv",
    training_dataset_version=1,
    job_id=None,
    dependencies=[],
    descriptive_statistics=False,
    feature_correlation=False,
    feature_histograms=False,
    cluster_analysis=False,
    stat_columns=None)

## Inserting Into an Existing Training Dataset

Once a dataset have been created, its metadata is browsable in the featurestore registry in the Hopsworks UI. If you don't want to create a new training dataset but just overwrite or insert new data into an existing training dataset, you can use the API function 'insert_into_training_dataset'

In [None]:
featurestore.insert_into_training_dataset(dataset_df, "TestDataset")

By default the `insert_into_training_dataset` will use the project's featurestore, version 1 of the training dataset, and update the training dataset statistics, this configuration can be overridden:

In [None]:
featurestore.insert_into_training_dataset(
    dataset_df,
    "TestDataset",
    featurestore=featurestore.project_featurestore(),
    training_dataset_version=1,
    descriptive_statistics=False,
    feature_correlation=False,
    feature_histograms=False,
    cluster_analysis=False,
    stat_columns=None
)

## Get Training Dataset Path

After a **managed dataset** have been created, it is easy to share it and re-use it for training various models. For example if the dataset have been materialized in tf-records format you can call the method `get_training_dataset_path(training_dataset)` to get the HDFS path and read it directly in your tensorflow code.

In [None]:
featurestore.get_training_dataset_path("AML_dataset")

By default the library will look for the training dataset in the project's featurestore and use version 1, but this can be override if required:

In [None]:
featurestore.get_training_dataset_path(
    "AML_dataset", 
    featurestore=featurestore.project_featurestore(),
    training_dataset_version=1
)

## Get Featurestore Metadata
To explore the contents of the featurestore we recommend using the featurestore page in the Hopsworks UI but you can also get the metadata programmatically from the REST API

### List all Feature Stores Accessible In the Project

In [2]:
featurestore.get_project_featurestores()

['fs_demo_featurestore']

### List all Feature Groups in a Feature Store

In [3]:
featurestore.get_featuregroups()

['customer_type_lookup', 'pep_lookup', 'gender_lookup', 'trx_type_lookup', 'country_lookup', 'alert_type_lookup', 'industry_sector_lookup', 'rule_name_lookup', 'web_address_lookup', 'browser_action_lookup', 'demographic_features', 'hipo_features', 'trx_graph_summary_features', 'trx_features', 'trx_summary_features', 'trx_graph_edge_list', 'alert_features', 'web_logs_features', 'police_report_features']

By default `get_featuregroups()` will use the project's feature store, but this can also be specified with the optional argument `featurestore`

In [4]:
featurestore.get_featuregroups(featurestore=featurestore.project_featurestore())

['customer_type_lookup', 'pep_lookup', 'gender_lookup', 'trx_type_lookup', 'country_lookup', 'alert_type_lookup', 'industry_sector_lookup', 'rule_name_lookup', 'web_address_lookup', 'browser_action_lookup', 'demographic_features', 'hipo_features', 'trx_graph_summary_features', 'trx_features', 'trx_summary_features', 'trx_graph_edge_list', 'alert_features', 'web_logs_features', 'police_report_features']

### List all Training Datasets in a Feature Store

In [5]:
featurestore.get_training_datasets()

['AML_dataset', 'TestDataset']

By default `get_training_datasets()` will use the project's feature store, but this can also be specified with the optional argument featurestore

In [8]:
featurestore.get_training_datasets(featurestore=featurestore.project_featurestore())

['AML_dataset', 'TestDataset']

### Get All Metadata (Features, Feature groups, Training Datasets) for a Feature Store

In [9]:
featurestore.get_featurestore_metadata()

{'featuregroups': [{'clusterAnalysis': None, 'created': '2018-12-17T10:18:46Z', 'creator': 'admin@kth.se', 'dependencies': [{'path': '/Projects/fs_demo/sample_data/kyc.csv', 'modification': '2018-12-17T07:11:12.234Z', 'inodeId': 101078, 'dir': False}], 'description': 'lookup table for id to customer type, used when converting from numeric to categrorical representation and vice verse', 'descriptiveStatistics': None, 'featureCorrelationMatrix': None, 'features': [{'name': 'customer_type', 'type': 'string', 'description': 'The categorical customer_type', 'primary': False}, {'name': 'id', 'type': 'bigint', 'description': 'The numeric id of the customer_type', 'primary': True}], 'featuresHistogram': None, 'featurestoreId': 1, 'featurestoreName': 'fs_demo_featurestore', 'id': 45, 'inodeId': 103829, 'jobId': 1, 'jobName': 'customer_type_lookup_job', 'jobStatus': 'Succeeded', 'lastComputed': '2018-12-17T07:25:27Z', 'name': 'customer_type_lookup', 'version': 1, 'hdfsStorePaths': ['hdfs://10.0.

By default `get_featurestore_metadata` will use the project's feature store, but this can also be specified with the optional argument featurestore

In [10]:
featurestore.get_featurestore_metadata(featurestore=featurestore.project_featurestore())

{'featuregroups': [{'clusterAnalysis': None, 'created': '2018-12-17T10:18:46Z', 'creator': 'admin@kth.se', 'dependencies': [{'path': '/Projects/fs_demo/sample_data/kyc.csv', 'modification': '2018-12-17T07:11:12.234Z', 'inodeId': 101078, 'dir': False}], 'description': 'lookup table for id to customer type, used when converting from numeric to categrorical representation and vice verse', 'descriptiveStatistics': None, 'featureCorrelationMatrix': None, 'features': [{'name': 'customer_type', 'type': 'string', 'description': 'The categorical customer_type', 'primary': False}, {'name': 'id', 'type': 'bigint', 'description': 'The numeric id of the customer_type', 'primary': True}], 'featuresHistogram': None, 'featurestoreId': 1, 'featurestoreName': 'fs_demo_featurestore', 'id': 45, 'inodeId': 103829, 'jobId': 1, 'jobName': 'customer_type_lookup_job', 'jobStatus': 'Succeeded', 'lastComputed': '2018-12-17T07:25:27Z', 'name': 'customer_type_lookup', 'version': 1, 'hdfsStorePaths': ['hdfs://10.0.