# Data validation 

We can define rules for data validation, so that bad data do not get inserted into the feature store.

First, let's just connect and get a reference to the feature store.

In [5]:
import hsfs
from hsfs.rule import Rule

# Replace with appropriate values
HOST = "791bb4a0-bb1c-11ec-8721-7bd8cdac0b54.cloud.hopsworks.ai"
PROJECT = "fraud_tutorial"
API_KEY_FILE = "api-key.txt"

# If you are trying this in a Hopsworks Jupyter notebook, you can just do
# conn = hsfs.connection() here

conn = hsfs.connection(
    host=HOST,
    project=PROJECT,
    hostname_verification=False,
    secrets_store="local",
    api_key_file=API_KEY_FILE,
    engine='hive'
)

fs = conn.get_feature_store()  

Connected. Call `.close()` to terminate connection gracefully.


## Create a rule 

As a simple example, let's create a rule that enforces that transaction amounts should be nonnegative.

You can find a list of the possible rules in the [documentation](https://docs.hopsworks.ai/feature-store-api/2.5.4/generated/feature_validation/). In this case it seems reasonable to use the rule named `IS_POSITIVE`.

The `Rule` definition has some parameters except for the rule name. The `level` parameter indicates the severity level: "WARNING" or "ERROR". If it is set to "ERROR", any attempt to update the feature store with data that does not adhere to the rule will throw an error. If it is set to "WARNING", the result of an attempt to insert such data will depend on the `validation_type` setting when creating a feature group (see below). You'll also need to set either the `max` or `min` parameter; in this case it does not matter what you use, but it would have been important if you had required the value to be between certain defined limits.

In [7]:
nonneg_rule = fs.create_expectation('nonneg_rule',
                          description='Check that transaction amount is not negative',
                          features=['amount'], 
                          rules=[Rule(name="IS_POSITIVE", level="ERROR", min=0)])
                                        
nonneg_rule.save()

#### Attaching an expectation to a feature group

TODO: Check which feature group we should attach it to

In [8]:
# If you already created the expectation before and did not run the cell above, use
# nonneg = fs.get_expectation('nonneg_rule')

trans_fg = fs.get_feature_group('transactions', version = 2)
trans_fg.attach_expectation(nonneg_rule)

In [12]:
trans_fg.avro_schema

'{"type":"record","name":"transactions_2","namespace":"fraud_tutorial_featurestore.db","fields":[{"name":"tid","type":["null","string"]},{"name":"datetime","type":["null",{"type":"long","logicalType":"timestamp-micros"}]},{"name":"cc_num","type":["null","long"]},{"name":"category","type":["null","string"]},{"name":"amount","type":["null","double"]},{"name":"latitude","type":["null","double"]},{"name":"longitude","type":["null","double"]},{"name":"city","type":["null","string"]},{"name":"country","type":["null","string"]},{"name":"fraud_label","type":["null","long"]}]}'

#### What happens if we try to insert faulty data?

Let's construct a very small data frame where one of the rows has -5.0 as the amount. This is negative and the datapoint should therefore be rejected.

In [9]:
import pandas as pd

# Create a fake transaction
fake_trans = pd.DataFrame.from_dict({'tid': ['1234567', '222121'],
                                  'datetime': [pd.to_datetime('2021-01-01'), pd.to_datetime('2021-01-02')],
                                  'cc_num': [1234567, 678943],
                                  'category': ['Unknown', 'Groceries'],
                                  'amount': [-5.0, 18.0], 
                                  'latitude': [0.00, 1.0],
                                  'longitude': [0.0, 2.0],
                                  'city': ['Austin', 'NYC'],
                                  'country': ['US', 'Russia'], 'fraud_label': [1, 0]}, 
                                 orient='columns')

Now we try to ingest the faulty data. The command below should trigger an ingestion job and print out a link where you can follow the progress of the job. It should fail with an error message such as `FeatureStoreException: The Hopsworks Job failed, use the Hopsworks UI to access the job logs`. Follow the job progress link and look at the output for the stout log. It should show an error like 

```HTTP code: 417, HTTP reason: Expectation Failed, error code: 270149, error msg: Feature group validation checks did not pass, will not persist the data., user msg: Results: [ExpectationResult{status=Failure, results=[ValidationResult{status=Failure, message='Value: 0.5 does not meet the constraint requirement! IS_POSITIVE', value='0.5', features='[amount]', rule=Rule{name=IS_POSITIVE, level=ERROR, min=0.0, max=0.0, pattern='null', acceptedType=null, feature=null, legalValues=[]}}], expectation=Expectation{name='nonneg_rule', features=[amount], rules=[Rule{name=IS_POSITIVE, level=ERROR, min=0.0, max=0.0, pattern='null', acceptedType=null, feature=null, legalValues=[]}]}}]```

This lets you know that an expectation was violated.

In [10]:
trans_fg.insert(fake_trans)

Configuring ingestion job...
Uploading Pandas dataframe...
Launching ingestion job...
Ingestion Job started successfully, you can follow the progress at https://791bb4a0-bb1c-11ec-8721-7bd8cdac0b54.cloud.hopsworks.ai/p/120/jobs/named/transactions_2_insert_fg_19042022121143/executions
