---
title: "Online transformation functions"
date: 2021-05-18
type: technical_note
draft: false
---

## Create connection to hsfs

In [1]:
import hsfs
connection = hsfs.connection()
# get a reference to the feature store, you can access also shared feature stores by providing the feature store name
fs = connection.get_feature_store();

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
22,application_1621263126797_0002,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

# Define online transformation
#### Online tranformation function has to be part of library installed in Hopsworks. Please refer to [documenation](https://hopsworks.readthedocs.io/en/stable/user_guide/hopsworks/python.html?highlight=install#installing-libraries) how to install python libraries in Hopsworks. For this demo we intalled library from [this](https://github.com/davitbzh/hsfs-transformer-template) repository.
#### When defining transformation function don't decorate with spark `@udf` or `@pandas_udfs`, as well as don't use any spark dependecies. HSFS will apply decorations if it used inside spark application.

In [2]:
from hsfs_transformers import transformer
plus_one_float = fs.create_transformation_function(transformation_function=transformer.plus_one, 
                                                   output_type=float, 
                                                   version=1)
plus_one_float.save()

In [3]:
plus_one_int = fs.create_transformation_function(transformation_function=transformer.plus_one, 
                                                 output_type=int, 
                                                 version=2)
plus_one_int.save()

In [4]:
plus_one_double = fs.create_transformation_function(transformation_function=transformer.plus_one, 
                                                    output_type="double", version=3)
plus_one_double.save()

In [5]:
date_string_to_timestamp = fs.create_transformation_function(
    transformation_function=transformer.date_string_to_timestamp,
    output_type="long", version=1)
date_string_to_timestamp.save()

In [6]:
print(plus_one_float.name)
print(plus_one_int.name)
print(date_string_to_timestamp.name)

plus_one
plus_one
date_string_to_timestamp

## Get all online transformations available in the feature store

In [7]:
fs.get_transformation_functions()

[<hsfs.transformation_function.TransformationFunction object at 0x7f0988f6c910>, <hsfs.transformation_function.TransformationFunction object at 0x7f0988f6c390>, <hsfs.transformation_function.TransformationFunction object at 0x7f0988f6c790>, <hsfs.transformation_function.TransformationFunction object at 0x7f0988f6c090>]

## Get online transformation by name and version

In [8]:
plus_one = fs.get_transformation_function(name="plus_one")
print(plus_one.name)
print(plus_one.version)

<hsfs.transformation_function.TransformationFunction object at 0x7f0988f6ccd0>

In [9]:
plus_one_float = fs.get_transformation_function(name="plus_one", version=1)
print(plus_one_float.name)
print(plus_one_float.version)

plus_one
1

In [10]:
plus_one_int = fs.get_transformation_function(name="plus_one", version=2)
print(plus_one_int.name)
print(plus_one_int.version)

plus_one
2

In [11]:
date_string_to_timestamp = fs.get_transformation_function(name="date_string_to_timestamp", version=1)
print(date_string_to_timestamp.name)
print(date_string_to_timestamp.version)

date_string_to_timestamp
1

# View online transformation source code
##### Since we are using pyspark kernel hsfs will add udf decorator 

In [12]:
print(plus_one_float.transformer_code)

import numpy as np
import pandas as pd
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf(FloatType())
def plus_one(value):
    return value + 1

In [13]:
print(plus_one_int.transformer_code)

import numpy as np
import pandas as pd
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf(IntegerType())
def plus_one(value):
    return value + 1

In [14]:
print(date_string_to_timestamp.transformer_code)

import numpy as np
import pandas as pd
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf(LongType())
def date_string_to_timestamp(input_date):
    date_format = "%Y%m%d%H%M%S"
    return int(float(datetime.strptime(input_date, date_format).timestamp()) * 1000)

## Delete transformation function

In [None]:
plus_one_double = fs.get_transformation_function(name="plus_one", version=3)
plus_one_double.delete()

# Create training dataset with online transformation
### To use online transoformation function for training dataset it must be created from hsfs `Query` object.

In [15]:
economy_fg = fs.get_feature_group('economy_fg',2)
demography_fg = fs.get_feature_group('demography_fg',2)

In [16]:
economy_fg.read().show()

+---------+----------+---+--------------+---------+-----+------+--------+----+
|     loan|commission| id|      datetime|   salary|  car|hyears|  hvalue|year|
+---------+----------+---+--------------+---------+-----+------+--------+----+
| 354724.2|       0.0|  1|20200101010101|110499.73|car15|    30|235000.0|2020|
|395015.34|       0.0|  2|20200102010101|140893.77|car20|     2|135000.0|2020|
|122025.08|       0.0|  3|20200103010101|119159.65| car1|    22|145000.0|2020|
| 99629.62|  52593.63|  4|20200104010101|  20000.0| car9|    30|185000.0|2020|
+---------+----------+---+--------------+---------+-----+------+--------+----+

In [17]:
economy_fg.read().printSchema()

root
 |-- loan: float (nullable = true)
 |-- commission: float (nullable = true)
 |-- id: integer (nullable = true)
 |-- datetime: string (nullable = true)
 |-- salary: float (nullable = true)
 |-- car: string (nullable = true)
 |-- hyears: integer (nullable = true)
 |-- hvalue: float (nullable = true)
 |-- year: integer (nullable = true)

## Training dataset needs to be created from hsfs `Query` object 

In [18]:
query = demography_fg.select(['age','elevel','zipcode']).join(economy_fg.select_all())

#### Provide transformation functions as dict, where key is feature name and value online transformation function name    

In [19]:
td = fs.create_training_dataset(name="economy_td",
                               description="Dataset to train the some model",
                               data_format="csv",
                               transformation_functions={"hyears":plus_one_int, 
                                                         "loan":plus_one_float, 
                                                         "datetime": date_string_to_timestamp},
                               statistics_config=None, 
                               version=1)

In [20]:
td._transformation_functions

{'hyears': <hsfs.transformation_function.TransformationFunction object at 0x7f0988f842d0>, 'loan': <hsfs.transformation_function.TransformationFunction object at 0x7f0988f840d0>, 'datetime': <hsfs.transformation_function.TransformationFunction object at 0x7f0988f845d0>}

In [21]:
td._transformation_functions['datetime'].name

'date_string_to_timestamp'

In [22]:
td.transformation_functions

{'hyears': <hsfs.transformation_function.TransformationFunction object at 0x7f0988f842d0>, 'loan': <hsfs.transformation_function.TransformationFunction object at 0x7f0988f840d0>, 'datetime': <hsfs.transformation_function.TransformationFunction object at 0x7f0988f845d0>}

In [None]:
td.save(query)

### Online tranformation functions are now attached to training dataset as medadata and contain information to which feature groups they will be applied 

In [3]:
td = fs.get_training_dataset("economy_td")



In [4]:
td.transformation_functions

{'loan': <hsfs.transformation_function.TransformationFunction object at 0x7f5f21090510>, 'datetime': <hsfs.transformation_function.TransformationFunction object at 0x7f5f21093710>, 'hyears': <hsfs.transformation_function.TransformationFunction object at 0x7f5f21096890>}

In [5]:
td.read().show()

+---+------+--------+---------+----------+---+-------------+---------+-----+------+--------+----+
|age|elevel| zipcode|     loan|commission| id|     datetime|   salary|  car|hyears|  hvalue|year|
+---+------+--------+---------+----------+---+-------------+---------+-----+------+--------+----+
| 49|level2|zipcode4|122026.08|       0.0|  3|1578013261000|119159.65| car1|    23|145000.0|2020|
| 56|level0|zipcode2| 99630.62|  52593.63|  4|1578099661000|  20000.0| car9|    31|185000.0|2020|
| 54|level3|zipcode5| 354725.2|       0.0|  1|1577840461000|110499.73|car15|    31|235000.0|2020|
| 44|level4|zipcode8|395016.34|       0.0|  2|1577926861000|140893.77|car20|     3|135000.0|2020|
+---+------+--------+---------+----------+---+-------------+---------+-----+------+--------+----+