In [2]:
%pip install feast[ge]

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pyarrow.parquet
import pandas as pd

from feast import FeatureView, Entity, FeatureStore, Field, BatchFeatureView
from feast.types import Float64, Int64
from feast.value_type import ValueType
from feast.data_format import ParquetFormat
from feast.on_demand_feature_view import on_demand_feature_view
from feast.infra.offline_stores.file_source import FileSource
from feast.infra.offline_stores.file import SavedDatasetFileStorage
from datetime import timedelta

import numpy as np

from feast.dqm.profilers.ge_profiler import ge_profiler

from great_expectations.core.expectation_suite import ExpectationSuite
from great_expectations.dataset import PandasDataset

from feast.dqm.errors import ValidationFailed

# Data Validation with Feast Saved Datasets & Great Expectations

The purpose of this notebook is to showcase the potential for data validation with Feast's Saved Dataset feature, harnessing the power of the Great Expectations validation engine.

## The Data

### trip_stats.parquet

Driver statistics for a taxi service. Each column corresponds to a driver's statistics for the day, given that they had at least one trip. The columns are: `'taxi_id', 'day', 'total_miles_travelled', 'total_trip_seconds', 'total_earned', 'trip_count'`

### trip_stats_featurized.parquet

This is my attempt at featurizing the trip stats data into something a logistic regression model could use. Please bear with me, I have zero data science experience, so these features are arbitrarily chosen without regard and the code to get there (in demo.ipynb) is NOT ideal lol. Anyway, the columns are: `'taxi_id', 'day', 'trip_count_1_to_10', 'trip_count_11_to_25', 'trip_count_26_to_50', 'trip_count_51_to_100', 'trip_count_101_or_higher', 'total_miles_0_to_10', 'total_miles_10_to_25', 'total_miles_25_to_50', 'total_miles_50_to_100', 'total_miles_100_to_250', 'total_miles_250_to_500', 'total_miles_500_to_1000', 'total_miles_1000_or_higher', 'seconds_0_3600', 'seconds_3600_to_7200', 'seconds_7200_to_14400', 'seconds_14400_to_28800', 'seconds_28800_to_43200', 'seconds_43200_or_higher', 'earned_0_to_50', 'earned_50_to_100', 'earned_100_to_250', 'earned_250_to_500', 'earned_500_to_1000', 'earned_1000_or_higher'`

## The Goal

The goal is to show how Feast & Great Expectations can be used to:

1. Validate the structure and content of pre-featurization data. This will break the assumption that Feast stores only feature data, but it will show the power of Great Expectations.
2. Track the drift of distributions of features over time.



# Part 1: Validating the schema of pre-featurization data

I am first going to initialize a Feast feature store using the `feature_store.yaml` config file.

Then I am going to load the `trip_stats.parquet` file, and register it as `FileSource` for Feast.

`taxi_id` is the identifier for different drivers, so I will register an Entity for the `taxi_id`s.

I'll also make a feature view that corresponds with the columns of this table. This isn't featurized data yet, but it is needed in the Feast context. Great Expectations (when configured outside of Feast) does not need this, but I will do so for ease of demo.

I then apply these to the store object, saving these objects' metadata in the Feast Registry.

In [17]:
store = FeatureStore(repo_path=".", fs_yaml_file="feature_store.yaml")

raw_data_source = FileSource(
    path="trips_stats.parquet",
    name="raw_data",
    file_format=ParquetFormat(),
    description="Raw trip stat data - not featurized.",
    owner="Malcolm Keyes",
    timestamp_field="day"
)

taxi_entity = Entity(
    name='taxi_id',
    join_keys=['taxi_id'],
    value_type=ValueType.STRING,
    description="The ID unique to every driver.",
    owner="Malcolm Keyes"
)

raw_data_fv = FeatureView(
    name="raw_data_fv",
    source= raw_data_source,
    entities=[taxi_entity],
    description="The columns of the raw data.",
    owner="Malcolm Keyes"
)

store.apply([raw_data_source, taxi_entity, raw_data_fv])

In [18]:
store.get_feature_view("raw_data_fv")

<FeatureView(name = raw_data_fv, entities = ['taxi_id'], ttl = 0:00:00, stream_source = None, batch_source = {
  "type": "BATCH_FILE",
  "timestampField": "day",
  "fileOptions": {
    "fileFormat": {
      "parquetFormat": {}
    },
    "uri": "trips_stats.parquet"
  },
  "name": "raw_data",
  "description": "Raw trip stat data - not featurized.",
  "owner": "Malcolm Keyes"
}, entity_columns = [taxi_id-String], features = [total_miles_travelled-Float64, total_trip_seconds-Int64, total_earned-Float64, trip_count-Int64], description = The columns of the raw data., tags = {}, owner = Malcolm Keyes, projection = FeatureViewProjection(name='raw_data_fv', name_alias=None, desired_features=[], features=[total_miles_travelled-Float64, total_trip_seconds-Int64, total_earned-Float64, trip_count-Int64], join_key_map={}), created_timestamp = 2023-08-11 19:39:45.382344, last_updated_timestamp = 2023-08-11 19:39:45.382344, online = True, materialization_intervals = [])>

I have registered registered the raw data source to the Feast Registry, along with the `taxi_id` entity, and a `raw_data` feature view.

The data in the `trip_stats.parquet` file spans from January 1st, 2019 to December 31st, 2020. Before we do our data cleaning and featurization, lets create a suite of Expectations to describe the expected schema of the data.

This will be a little hacky, but we're going to set aside a chunk of the data, from Feb 1, 2019 to Mar 1, 2019, and save it as a dataset. Then we will run a profiler on it and create a profile. Then we will save the dataset and profile together, as a reference dataset. Once this is done, then we can use this reference dataset to validate a `RetrievalJob` (the result of a `get_historical_features()` call) on some other chunk of the data - like the month of July 2019, for example.

In [23]:
# let's generate an entity dataframe with the timestamps we want,
# and the entities we want (in this case, all of them)

timestamps = pd.DataFrame()
timestamps["event_timestamp"] = pd.date_range("2019-02-01", "2019-03-01", freq="D")

# an array of the taxi_ids has already been saved to entities.parquet
taxi_ids = pyarrow.parquet.read_table("entities.parquet").to_pandas()

# Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:
entity_df = pd.merge(taxi_ids, timestamps, how='cross')

# retrieve the data
february_job = store.get_historical_features(
    entity_df=entity_df,
    features=[ # could have also made a feature service here instead of listing individual features
        "raw_data_fv:total_miles_travelled",
        "raw_data_fv:total_trip_seconds",
        "raw_data_fv:total_earned",
        "raw_data_fv:trip_count"
    ]
)

# turn this retrieved data into a saved dataset, persisting this query
# to a file and the metadata to the feast registry
store.create_saved_dataset(
    from_=february_job,
    name="february_saved_dataset",
    storage=SavedDatasetFileStorage(path="february_saved_ds.parquet")
)

<SavedDataset(name = february_saved_dataset, features = ['raw_data_fv:total_miles_travelled', 'raw_data_fv:total_trip_seconds', 'raw_data_fv:total_earned', 'raw_data_fv:trip_count'], join_keys = ['taxi_id'], storage = <feast.infra.offline_stores.file_source.SavedDatasetFileStorage object at 0x1376ba080>, full_feature_names = False, tags = {}, feature_service_name = None, _retrieval_job = <feast.infra.offline_stores.file.FileRetrievalJob object at 0x138018490>, min_event_timestamp = 2019-02-01 00:00:00+00:00, max_event_timestamp = 2019-03-01 00:00:00+00:00, created_timestamp = 2023-08-11 20:30:38.170170, last_updated_timestamp = 2023-08-11 20:30:38.170170)>

Now that we have our saved dataset, it's time to create a profile of Great Expectations "Expectations". Each Expectation is an assertion about the data.

There is a gallery of all the different expectations (there are over 300) here: [Great Expectations - Expectation Gallery](https://greatexpectations.io/expectations/?filterType=Backend+support&viewType=Summary&showFilters=true&subFilterValues=pandas)

Let's make sure that the data meets the structure we expect.

The following columns should be present and have the following datatypes:

- `total_miles_travelled` : `float64`
- `total_trip_seconds` : `int64`
- `total_earned` : `float64`
- `trip_count` : `int64`

In [26]:
@ge_profiler
def structure_profiler(ds: PandasDataset) -> ExpectationSuite:

    ds.expect_column_to_exist(
        column="total_miles_travelled"
    )

    ds.expect_column_values_to_be_of_type(
        column="total_miles_travelled",
        type_="float64"
    )

    ds.expect_column_to_exist(
        column="total_trip_seconds"
    )

    ds.expect_column_values_to_be_of_type(
        column="total_trip_seconds",
        type_="int64"
    )

    ds.expect_column_to_exist(
        column="total_earned"
    )

    ds.expect_column_values_to_be_of_type(
        column="total_earned",
        type_="float64"
    )

    ds.expect_column_to_exist(
        column="trip_count"
    )

    ds.expect_column_values_to_be_of_type(
        column="trip_count",
        type_="int64"
    )

    return ds.get_expectation_suite()

In [28]:
# Load saved dataset from earlier and show contents
ds = store.get_saved_dataset('february_saved_dataset')
ds.to_df()

# Now let's print out the profile that will be generated - this should match
# the expectations we included

profile = ds.get_profile(profiler=structure_profiler)
profile

<GEProfile with expectations: [
  {
    "expectation_type": "expect_column_to_exist",
    "kwargs": {
      "column": "total_miles_travelled"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_values_to_be_of_type",
    "kwargs": {
      "column": "total_miles_travelled",
      "type_": "float64"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_to_exist",
    "kwargs": {
      "column": "total_trip_seconds"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_values_to_be_of_type",
    "kwargs": {
      "column": "total_trip_seconds",
      "type_": "int64"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_to_exist",
    "kwargs": {
      "column": "total_earned"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_values_to_be_of_type",
    "kwargs": {
      "column": "total_earned",
      "type_": "float64"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_to_exist",
    

This output isn't super interesting by itself - mostly because this profiler didn't pull any information from the dataset. But - stick with me - the next one will pull values from the characteristics of the dataset, which will be used to generate values for quantitative expectations.

Let's register this dataset and profile as a reference dataset in feast.

In [29]:
structural_validation_reference = ds.as_reference(
    name="structural_validation_reference_dataset",
    profiler=structure_profiler
)

store.apply([structural_validation_reference])

Now let's pull down the data for July 2019, and see that it meets our structural expectations!

In [30]:
timestamps = pd.DataFrame()
timestamps["event_timestamp"] = pd.date_range("2019-07-01", "2019-08-01", freq="D")

# an array of the taxi_ids has already been saved to entities.parquet
taxi_ids = pyarrow.parquet.read_table("entities.parquet").to_pandas()

# Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:
entity_df = pd.merge(taxi_ids, timestamps, how='cross')

# retrieve the data
july_job = store.get_historical_features(
    entity_df=entity_df,
    features=[ # could have also made a feature service here instead of listing individual features
        "raw_data_fv:total_miles_travelled",
        "raw_data_fv:total_trip_seconds",
        "raw_data_fv:total_earned",
        "raw_data_fv:trip_count"
    ]
)

In [32]:
july = july_job.to_df(
    validation_reference=structural_validation_reference
)
july

Unnamed: 0,taxi_id,event_timestamp,total_miles_travelled,total_trip_seconds,total_earned,trip_count
0,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-08-01 00:00:00+00:00,26.00,5820,103.25,11
1,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-07-31 00:00:00+00:00,26.00,5820,103.25,11
2,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-07-30 00:00:00+00:00,26.00,5820,103.25,11
3,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-07-29 00:00:00+00:00,26.00,5820,103.25,11
4,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-07-28 00:00:00+00:00,26.00,5820,103.25,11
...,...,...,...,...,...,...
154592,fe1501e37a281f36abea1503071c99fd6611d42b465739...,2019-08-01 00:00:00+00:00,62.91,13007,194.50,11
154593,41a48c4607a84559d8760d511527018e3da952a7b33a82...,2019-08-01 00:00:00+00:00,54.80,6622,142.75,4
154594,a33e818f102d2dc8ba6ec21c582cf1aa1cd034a85db091...,2019-08-01 00:00:00+00:00,94.18,22111,243.75,5
154595,b968bad5a2daed924a10e8ec4fb35513e060a076c575f7...,2019-08-01 00:00:00+00:00,105.14,24266,382.25,37


As Feast pulls the data into a dataframe, it uses our reference to validate the new data against our expectations.

Since we didn't encounter any exceptions, we know our expectations were met - the columns we specified are present and the data types are as expected!

Now let's break the July data and see what happens.

In [37]:
# let's rename a column
july1 = july.rename(
    columns={"total_miles_travelled":"total_miles"}
)

# and also change the datatype of a column
july2 = july1.astype({"total_earned":str})

# then save as a parquet
july2.to_parquet(
    path="broken_july.parquet"
)

In [None]:
# Let's pull it back in as a saved dataset
