## Tags

The feature store enables users to attach tags to feature groups or to training datasets. Tags are aditional metadata attached to your artifacts and thus they can be used for an enhance full text search (by default Hopsworks makes all metadata searchable, users can opt out for particular featurestores if they want to keep them private). Adding tags to a feature group provides users with a more dynamic metadata content that can be used for both storage as well as enhancing artifact discoverability.  

A tag is a {<b>key</b>: <b>value</b>} association, providing additional information about the data, such as for example geographic origin. This is useful in an organization as it adds more context to your data making it easier to share and discover data and artifacts. Tagging is only available in the enterprise version.

### Tag Schemas
The first step is to define the schemas of tags that can later be attached. These schemas follow the https://json-schema.org as reference. The schemas define legal jsons and these can be primitives such as integer, float(number), boolean, string, it can also be json objects or arrays. The schemas themselves are also defined as jsons. 

A schema of a <b>primitive</b> type looks like: 
```
{ "type" : "string" }
```
Allowed primitive types are:
* string
* boolean
* integer
* number (float)
    
A schema of a <b>object</b> type looks like:
```
{
  "type" : "object", 
  "properties" : 
  {
    "field1": { "type" : "string" }, 
    "field2": { "type" : "integer" }
  }
}
```  

Type object is the default type for schemas, so you can ommit it if you want to keep the schema short. The properties field defines a dictionary of field names and their types.

A schema of a <b>array</b> of strings type looks like: 
```
{
  "type" : "array", 
  "items" : { "type" : "string" }
}
```

The array and object types can be arbitrarily nested of course.

#### Creating schemas
Schemas are defined at a cluster level, so they are available to all projects. They can only be defined by a user with admin rights.

![Attach tags using the UI](images/create_schemas.gif)

#### Attach tags using the UI
Tags can be attached using the feature store UI or programmatically using the API. This API will be described in the rest of this notebook.

![Attach tags using the UI](images/attach_tags.gif)

### Notebook required schema setup

In order for this notebook to work properly you have to correct schemas setup.

* Primitive string
    * name: <b>test_string</b> 
    * value: <b>{"type":"string"}</b>
* Array (of strings)
    * name: <b>test_array</b>
    * value: <b>{"type":"array","items":{"type":"string"}}</b>
* Json object
    * name: <b>test_obj</b>
    * value: <b>{"properties":{"f":{"type":"string"}}}</b>
* Array of objects
    * name: <b>test_obj2</b>
    * value: <b>{"type":"array","items":{"properties":{"f":{"type":"string"}}}}</b>
    

#### Featurestore name
Change the name of the featurestore according to the project you are running from. The example was written within the project names: <b>demo_fs_meb10000</b>, which is the feature store demo tour.

In [1]:
import hsfs
connection = hsfs.connection()
fs = connection.get_feature_store(name="demo_fs_meb10000_featurestore")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
5,application_1617826195233_0007,pyspark,idle,Link,Link


SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

#### Creating a feature group and a training dataset
The sections used to create the feature group and the training dataset might fail if the artifacts already exist, created by a previous run of this notebook.

In [2]:
fg_name = 'fg1'
td_name = 'td1'

Create the feature group used in this notebook to attach tags to.

In [3]:
fg_data = []
fg_data.append((1, 1, 1))
fg_col_1 = 'fg1' + '_col1'
fg_col_2 = 'fg1' + '_col2'

fg_spark_df = spark.createDataFrame(fg_data, ['id', fg_col_1, fg_col_2])
fg_description = "synthetic " + fg_name

fg_write = fs.create_feature_group(name=fg_name, version=1, description=fg_description, primary_key=['id'], time_travel_format=None, statistics_config=False)
fg_write.save(fg_spark_df)

<hsfs.feature_group.FeatureGroup object at 0x7fe95eefec90>

In [4]:
fg_read = fs.get_feature_group(fg_name)



Create the training dataset used in this notebook to attach tags to.

In [5]:
td_query = fg_read.select_all()
td_description = "synthetic " + td_name
td = fs.create_training_dataset(name=td_name, description=td_description, data_format="csv", version=1)
td.save(td_query)

<hsfs.training_dataset.TrainingDataset object at 0x7fe95ee98a10>

In [6]:
td_read = fs.get_training_dataset(td_name, 1)

#### Working with tags on featuregroups

##### Attaching tags

Attaching a simple key-value(string) tag to your featuregroup.

<b>Note</b>: You can only attach one tag value for a tag name, so by calling the add operation on the same tag multiple times, you perform an update operation.
If you require attaching multiple values to a tag, like maybe a sequence, consider changing the tag type to an array of the type you just defined.

In [7]:
tag1_name="test_string"
tag1_value="test"

In [8]:
fg_read.add_tag(tag1_name, tag1_value)

##### Listing tags
Reading a tag value use the tag key.

In [9]:
fg_read.get_tag(tag1_name)

'test'

Reading all the tags attached to a feature group.

In [10]:
fg_read.get_tags()

{'test_string': 'test'}

##### Deleting tags

In [11]:
fg_read.delete_tag(tag1_name)

Tag is no longer in the list of attached tags, but can be re-attached at a later time.

In [12]:
fg_read.get_tags()

{}

##### Using tags with more complex values
Attaching a simple json object tag.

In [13]:
tag2_name="test_obj"

In [14]:
tag2_value={}
tag2_value['f']='test'
fg_read.add_tag(tag2_name, tag2_value)

In [15]:
fg_read.get_tag(tag2_name)

{'f': 'test'}

Attaching repeated values of same type - arrays.

In [16]:
tag3_name="test_array"

In [17]:
tag3_value=["test", "not"]
fg_read.add_tag(tag3_name, tag3_value)

In [18]:
fg_read.get_tag(tag3_name)

['test', 'not']

More on arrays of objects.

In [19]:
tag4_name="test_obj2"

In [20]:
tag4_value_1={}
tag4_value_1['f']='test'
tag4_value_2={}
tag4_value_2['f']='not'
tag4_value=[tag4_value_1, tag4_value_2]
fg_read.add_tag(tag4_name, tag4_value)

In [21]:
fg_read.get_tag(tag4_name)

[{'f': 'test'}, {'f': 'not'}]

Get all tags attached to a feature group.

In [22]:
fg_read.get_tags()

{'test_obj': {'f': 'test'}, 'test_array': ['test', 'not'], 'test_obj2': [{'f': 'test'}, {'f': 'not'}]}

#### Working with tags on training datasets
The API calls for attaching, reading and deleting tags are exactly the same on training datasets as they are on feature groups.

##### Attaching tags

In [23]:
td_read.add_tag(tag1_name, tag1_value)

##### Listing tags

In [24]:
td_read.get_tags()

{'test_string': 'test'}

In [25]:
td_read.get_tag(tag1_name)

'test'

##### Deleting tags

In [26]:
td_read.delete_tag(tag1_name)

In [27]:
td_read.get_tags()

{}

#### Search with Tags
Once tags are attached, the feature groups are now searchable also by their tags, both keys and values.

![Attach tags using the UI](images/search_by_tags.gif)

#### Cleaning up

If you want to be able to rerun the notebook with no failed paragraphs you will need to delete the feature group <b>fg1</b> and the training dataset <b>td1</b>.

In [28]:
connection.close()

Connection closed.