In [1]:
%pip install feast[ge]

Looking in indexes: https://gitlab.redchimney.com/api/v4/projects/5663/packages/pypi/simple
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pyarrow.parquet
import pandas as pd

from feast import FeatureView, Entity, FeatureStore, FeatureService, Field, BatchFeatureView
from feast.types import Float64, Int64
from feast.value_type import ValueType
from feast.data_format import ParquetFormat
from feast.on_demand_feature_view import on_demand_feature_view
from feast.infra.offline_stores.file_source import FileSource
from feast.infra.offline_stores.file import SavedDatasetFileStorage
from datetime import timedelta

import numpy as np

from feast.dqm.profilers.ge_profiler import ge_profiler

from great_expectations.core.expectation_suite import ExpectationSuite
from great_expectations.dataset import PandasDataset

from feast.dqm.errors import ValidationFailed

# Data Validation with Feast Saved Datasets & Great Expectations

The purpose of this notebook is to showcase the potential for data validation with Feast's Saved Dataset feature, harnessing the power of the Great Expectations validation engine.

## The Data

### trips_stats.parquet

Driver statistics for a taxi service. Each column corresponds to a driver's statistics for the day, given that they had at least one trip. The columns are: `'taxi_id', 'day', 'total_miles_travelled', 'total_trip_seconds', 'total_earned', 'trip_count'`

### trips_stats_featurized.parquet

This is my attempt at featurizing the trip stats data into something a logistic regression model could use. Please bear with me, I have zero data science experience, so these features are arbitrarily chosen without regard and the code to get there (in demo.ipynb) is NOT ideal lol. Anyway, the columns are: `'taxi_id', 'day', 'trip_count_1_to_10', 'trip_count_11_to_25', 'trip_count_26_to_50', 'trip_count_51_to_100', 'trip_count_101_or_higher', 'total_miles_0_to_10', 'total_miles_10_to_25', 'total_miles_25_to_50', 'total_miles_50_to_100', 'total_miles_100_to_250', 'total_miles_250_to_500', 'total_miles_500_to_1000', 'total_miles_1000_or_higher', 'seconds_0_3600', 'seconds_3600_to_7200', 'seconds_7200_to_14400', 'seconds_14400_to_28800', 'seconds_28800_to_43200', 'seconds_43200_or_higher', 'earned_0_to_50', 'earned_50_to_100', 'earned_100_to_250', 'earned_250_to_500', 'earned_500_to_1000', 'earned_1000_or_higher'`

## The Goal

The goal is to show how Feast & Great Expectations can be used to:

1. Validate the structure and content of pre-featurization data. This will break the assumption that Feast stores only feature data, but it will show the power of Great Expectations.
2. Track the drift of distributions of features over time.



# Part 1: Validating the schema of pre-featurization data

I am first going to initialize a Feast feature store using the `feature_store.yaml` config file.

Then I am going to load the `trips_stats.parquet` file, and register it as `FileSource` for Feast.

`taxi_id` is the identifier for different drivers, so I will register an Entity for the `taxi_id`s.

I'll also make a feature view that corresponds with the columns of this table. This isn't featurized data yet, but it is needed in the Feast context. Great Expectations (when configured outside of Feast) does not need this, but I will do so for ease of demo.

I then apply these to the store object, saving these objects' metadata in the Feast Registry.

In [3]:
store = FeatureStore(repo_path=".", fs_yaml_file="feature_store.yaml")

raw_data_source = FileSource(
    path="trips_stats.parquet",
    name="raw_data",
    file_format=ParquetFormat(),
    description="Raw trip stat data - not featurized.",
    owner="Malcolm Keyes",
    timestamp_field="day"
)

taxi_entity = Entity(
    name='taxi_id',
    join_keys=['taxi_id'],
    value_type=ValueType.STRING,
    description="The ID unique to every driver.",
    owner="Malcolm Keyes"
)

raw_data_fv = FeatureView(
    name="raw_data_fv",
    source= raw_data_source,
    entities=[taxi_entity],
    description="The columns of the raw data.",
    owner="Malcolm Keyes"
)

store.apply([raw_data_source, taxi_entity, raw_data_fv])

In [18]:
store.get_feature_view("raw_data_fv")

<FeatureView(name = raw_data_fv, entities = ['taxi_id'], ttl = 0:00:00, stream_source = None, batch_source = {
  "type": "BATCH_FILE",
  "timestampField": "day",
  "fileOptions": {
    "fileFormat": {
      "parquetFormat": {}
    },
    "uri": "trips_stats.parquet"
  },
  "name": "raw_data",
  "description": "Raw trip stat data - not featurized.",
  "owner": "Malcolm Keyes"
}, entity_columns = [taxi_id-String], features = [total_miles_travelled-Float64, total_trip_seconds-Int64, total_earned-Float64, trip_count-Int64], description = The columns of the raw data., tags = {}, owner = Malcolm Keyes, projection = FeatureViewProjection(name='raw_data_fv', name_alias=None, desired_features=[], features=[total_miles_travelled-Float64, total_trip_seconds-Int64, total_earned-Float64, trip_count-Int64], join_key_map={}), created_timestamp = 2023-08-11 19:39:45.382344, last_updated_timestamp = 2023-08-11 19:39:45.382344, online = True, materialization_intervals = [])>

I have registered registered the raw data source to the Feast Registry, along with the `taxi_id` entity, and a `raw_data` feature view.

The data in the `trip_stats.parquet` file spans from January 1st, 2019 to December 31st, 2020. Before we do our data cleaning and featurization, lets create a suite of Expectations to describe the expected schema of the data.

This will be a little hacky, but we're going to set aside a chunk of the data, from Feb 1, 2019 to Mar 1, 2019, and save it as a dataset. Then we will run a profiler on it and create a profile. Then we will save the dataset and profile together, as a reference dataset. Once this is done, then we can use this reference dataset to validate a `RetrievalJob` (the result of a `get_historical_features()` call) on some other chunk of the data - like the month of July 2019, for example.

In [23]:
# let's generate an entity dataframe with the timestamps we want,
# and the entities we want (in this case, all of them)

timestamps = pd.DataFrame()
timestamps["event_timestamp"] = pd.date_range("2019-02-01", "2019-03-01", freq="D")

# an array of the taxi_ids has already been saved to entities.parquet
taxi_ids = pyarrow.parquet.read_table("entities.parquet").to_pandas()

# Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:
entity_df = pd.merge(taxi_ids, timestamps, how='cross')

# retrieve the data
february_job = store.get_historical_features(
    entity_df=entity_df,
    features=[ # could have also made a feature service here instead of listing individual features
        "raw_data_fv:total_miles_travelled",
        "raw_data_fv:total_trip_seconds",
        "raw_data_fv:total_earned",
        "raw_data_fv:trip_count"
    ]
)

# turn this retrieved data into a saved dataset, persisting this query
# to a file and the metadata to the feast registry
store.create_saved_dataset(
    from_=february_job,
    name="february_saved_dataset",
    storage=SavedDatasetFileStorage(path="february_saved_ds.parquet")
)

<SavedDataset(name = february_saved_dataset, features = ['raw_data_fv:total_miles_travelled', 'raw_data_fv:total_trip_seconds', 'raw_data_fv:total_earned', 'raw_data_fv:trip_count'], join_keys = ['taxi_id'], storage = <feast.infra.offline_stores.file_source.SavedDatasetFileStorage object at 0x1376ba080>, full_feature_names = False, tags = {}, feature_service_name = None, _retrieval_job = <feast.infra.offline_stores.file.FileRetrievalJob object at 0x138018490>, min_event_timestamp = 2019-02-01 00:00:00+00:00, max_event_timestamp = 2019-03-01 00:00:00+00:00, created_timestamp = 2023-08-11 20:30:38.170170, last_updated_timestamp = 2023-08-11 20:30:38.170170)>

Now that we have our saved dataset, it's time to create a profile of Great Expectations "Expectations". Each Expectation is an assertion about the data.

There is a gallery of all the different expectations (there are over 300) here: [Great Expectations - Expectation Gallery](https://greatexpectations.io/expectations/?filterType=Backend+support&viewType=Summary&showFilters=true&subFilterValues=pandas)

Let's make sure that the data meets the structure we expect.

The following columns should be present and have the following datatypes:

- `total_miles_travelled` : `float64`
- `total_trip_seconds` : `int64`
- `total_earned` : `float64`
- `trip_count` : `int64`

In [8]:
@ge_profiler
def structure_profiler(ds: PandasDataset) -> ExpectationSuite:

    ds.expect_column_to_exist(
        column="total_miles_travelled"
    )

    ds.expect_column_values_to_be_of_type(
        column="total_miles_travelled",
        type_="float64"
    )

    ds.expect_column_to_exist(
        column="total_trip_seconds"
    )

    ds.expect_column_values_to_be_of_type(
        column="total_trip_seconds",
        type_="int64"
    )

    ds.expect_column_to_exist(
        column="total_earned"
    )

    ds.expect_column_values_to_be_of_type(
        column="total_earned",
        type_="float64"
    )

    ds.expect_column_to_exist(
        column="trip_count"
    )

    ds.expect_column_values_to_be_of_type(
        column="trip_count",
        type_="int64"
    )

    return ds.get_expectation_suite()

In [9]:
# Load saved dataset from earlier and show contents
ds = store.get_saved_dataset('february_saved_dataset')
ds.to_df()

# Now let's print out the profile that will be generated - this should match
# the expectations we included

profile = ds.get_profile(profiler=structure_profiler)
profile

<GEProfile with expectations: [
  {
    "expectation_type": "expect_column_to_exist",
    "kwargs": {
      "column": "total_miles_travelled"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_values_to_be_of_type",
    "kwargs": {
      "column": "total_miles_travelled",
      "type_": "float64"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_to_exist",
    "kwargs": {
      "column": "total_trip_seconds"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_values_to_be_of_type",
    "kwargs": {
      "column": "total_trip_seconds",
      "type_": "int64"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_to_exist",
    "kwargs": {
      "column": "total_earned"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_values_to_be_of_type",
    "kwargs": {
      "column": "total_earned",
      "type_": "float64"
    },
    "meta": {}
  },
  {
    "expectation_type": "expect_column_to_exist",
    

This output isn't super interesting by itself - mostly because this profiler didn't pull any information from the dataset. But - stick with me - the next one will pull values from the characteristics of the dataset, which will be used to generate values for quantitative expectations.

Let's register this dataset and profile as a reference dataset in feast.

In [10]:
structural_validation_reference = ds.as_reference(
    name="structural_validation_reference_dataset",
    profiler=structure_profiler
)

store.apply([structural_validation_reference])

Now let's pull down the data for July 2019, and see that it meets our structural expectations!

In [4]:
timestamps = pd.DataFrame()
timestamps["event_timestamp"] = pd.date_range("2019-07-01", "2019-08-01", freq="D")

# an array of the taxi_ids has already been saved to entities.parquet
taxi_ids = pyarrow.parquet.read_table("entities.parquet").to_pandas()

# Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:
entity_df = pd.merge(taxi_ids, timestamps, how='cross')

# retrieve the data
july_job = store.get_historical_features(
    entity_df=entity_df,
    features=[ # could have also made a feature service here instead of listing individual features
        "raw_data_fv:total_miles_travelled",
        "raw_data_fv:total_trip_seconds",
        "raw_data_fv:total_earned",
        "raw_data_fv:trip_count"
    ]
)

In [12]:
july = july_job.to_df(
    validation_reference=structural_validation_reference
)
july.head()

Unnamed: 0,taxi_id,event_timestamp,total_miles_travelled,total_trip_seconds,total_earned,trip_count
0,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-08-01 00:00:00+00:00,26.0,5820,103.25,11
1,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-07-31 00:00:00+00:00,26.0,5820,103.25,11
2,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-07-30 00:00:00+00:00,26.0,5820,103.25,11
3,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-07-29 00:00:00+00:00,26.0,5820,103.25,11
4,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-07-28 00:00:00+00:00,26.0,5820,103.25,11


As Feast pulls the data into a dataframe, it uses our reference to validate the new data against our expectations.

Since we didn't encounter any exceptions, we know our expectations were met - the columns we specified are present and the data types are as expected!

Now let's break a subset of the July data and see what happens.

In [60]:
# let's pull from the actual data source - not what Feast returned

trip_stats_parquet = pd.read_parquet("trips_stats.parquet")

july_subset = trip_stats_parquet.loc[trip_stats_parquet['day'].isin(["2019-07-01 00:00:00+00:00"])].head(5)

july_subset.rename(
    columns={
        "total_miles_travelled":"total_miles",
        "event_timestamp":"day"
    },
    inplace=True
)

# and also change the datatype of a column
july_subset = july_subset.astype({"total_earned":str})

# then save as a parquet
july_subset.to_parquet(
    path="broken_july.parquet",
    index=False
)

In [62]:
# Let's pull it back in as a saved dataset
# ... which means we need to set it up as a FileSource
# and add a FeatureView, which breaks our Feast assumptions.

broken_july_raw_data_source = FileSource(
    path="broken_july.parquet",
    name="broken_july_raw_data",
    file_format=ParquetFormat(),
    description="(Broken) raw trip stat data for July - not featurized.",
    owner="Malcolm Keyes",
    timestamp_field="day"
)

broken_july_raw_data_fv = FeatureView(
    name="broken_july_raw_data_fv",
    source= broken_july_raw_data_source,
    entities=[taxi_entity],
    description="The columns of the raw data.",
    owner="Malcolm Keyes"
)

store.apply([broken_july_raw_data_source, broken_july_raw_data_fv])


In [14]:
# Now let's try to pull the data and validate it
# We'll use the previously made entity dataframe

july_first_timestamps = pd.DataFrame()
july_first_timestamps["event_timestamp"] = pd.date_range("2019-07-01", "2019-07-01", freq="D")

taxi_ids_subset = pd.DataFrame()
taxi_ids_subset["taxi_id"] = ['3d5fcccd2f2e4fb12eba0d94f8cdf657b20d405fece7c8a26571a7a0db7d8f8cf4f257a86ad49ad864417c6d8ceb0763b2207fbb57a78950f2566409e1b93fcd',
 'd133de68d7dfb2069cd26b91cc2e0a934b2e3d125f7ce76dcbe456015776df5e688727e0abdd8b0016cdff2072ef3f65f6e10b7694560473c0bea15f75d1d562',
 '617d878f0486c82431cd73a6c1cee0992947dea1140244ca6e4ad1e1e167c0e7f1f8437e3b03c2a0ba0fd5b3f3eed869ee7e1ba9c3e0afe008e819cd4aa477d4',
 'c531f081cad817a366cbae2254ce7a3bb370394b3b0f1399dfc8781333c3565ca188035bbff4d1d7652181b79712c103d01ab168cb6ca2580f2da4b324cb00a5',
 '59cfe5aef1ecdd4418c373fe7848fb1f9defcbcfe70f17f70bf11f05711c380494e26d4f4a1b9e1c1d873a2f741318864d34f23735ee3b20f07fc10af300d66a']

# Cross merge (aka relation multiplication) produces entity dataframe with each taxi_id repeated for each timestamp:
entity_df = pd.merge(taxi_ids_subset, july_first_timestamps, how='cross')

broken_july_job = store.get_historical_features(
    entity_df=entity_df,
    features=[ # could have also made a feature service here instead of listing individual features
        "broken_july_raw_data_fv:total_miles_travelled",
        "broken_july_raw_data_fv:total_trip_seconds",
        "broken_july_raw_data_fv:total_earned",
        "broken_july_raw_data_fv:trip_count"
    ]
)

KeyError: 'Feature total_miles_travelled not found in projection broken_july_raw_data_fv'

Ok, awesome! Feast caught that the features we requested didn't match the columns found in our broken table. We can check that box off for Feast.

Let's try again, passing in the correct features this time.

In [73]:
# Now let's try to pull the data and validate it
# We'll use the previously made entity dataframe

broken_july_job = store.get_historical_features(
    entity_df=entity_df,
    features=[ # could have also made a feature service here instead of listing individual features
        "broken_july_raw_data_fv:total_miles", # this is the name we changed
        "broken_july_raw_data_fv:total_trip_seconds",
        "broken_july_raw_data_fv:total_earned",
        "broken_july_raw_data_fv:trip_count"
    ]
)

This was successful! Now let's try to pull it into a Pandas dataframe, and verify it with our Validations we made earlier.

In [74]:
try:
    broken_july = broken_july_job.to_df(
        validation_reference=structural_validation_reference
    )
except ValidationFailed as exc:
    print(exc.validation_report)

[
  {
    "success": false,
    "expectation_config": {
      "expectation_type": "expect_column_to_exist",
      "kwargs": {
        "column": "total_miles_travelled",
        "result_format": "COMPLETE"
      },
      "meta": {}
    },
    "result": {},
    "meta": {},
    "exception_info": {
      "raised_exception": false,
      "exception_message": null,
      "exception_traceback": null
    }
  },
  {
    "success": false,
    "expectation_config": {
      "expectation_type": "expect_column_values_to_be_of_type",
      "kwargs": {
        "column": "total_miles_travelled",
        "type_": "float64",
        "result_format": "COMPLETE"
      },
      "meta": {}
    },
    "result": {},
    "meta": {},
    "exception_info": {
      "raised_exception": true,
      "exception_traceback": "Traceback (most recent call last):\n  File \"/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py\", line 3802, in get_loc\n    return self._e

If you open this error message, you'll see:
* the `total_miles_travelled` column is missing
* because the `total_miles_travelled` column is missing, the datatype check also failed - so the previous column check is redundant
* values in the `total_earned` column were of the wrong type, and great expectations lists the wrong values. some basic stats are also included.

Great Expectations has successfully triggered a ValidationError and reported to us which values broke our expectations!

# Part 2: Validating the distributions of featurized data

The previous example was not truly Feast-esque in its design. We wouldn't make more sources and featureviews for testing new data. And, it's possible these types of errors would be caught by transformation logic prior to even getting pushed to the offline store. Lastly, we want to store featurized data in the offline store - not raw data. However, it served to show part of the power of Great Expectations.

Now on to a much better demonstration of Great Expectations in combination with Feast and featurized data.

I have created a "featurized" version of the `trip_stats` file:

In [76]:
featurized = pd.read_parquet("trip_stats_featurized.parquet")
featurized.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1647134 entries, 0 to 1647133
Data columns (total 27 columns):
 #   Column                      Non-Null Count    Dtype              
---  ------                      --------------    -----              
 0   taxi_id                     1647134 non-null  object             
 1   day                         1647134 non-null  datetime64[ns, UTC]
 2   trip_count_1_to_10          1647134 non-null  int64              
 3   trip_count_11_to_25         1647134 non-null  int64              
 4   trip_count_26_to_50         1647134 non-null  int64              
 5   trip_count_51_to_100        1647134 non-null  int64              
 6   trip_count_101_or_higher    1647134 non-null  int64              
 7   total_miles_0_to_10         1647134 non-null  int64              
 8   total_miles_10_to_25        1647134 non-null  int64              
 9   total_miles_25_to_50        1647134 non-null  int64              
 10  total_miles_50_to_100       16

In [16]:
# Now lets add a source & feature view

feature_data_source = FileSource(
    path="trips_stats_featurized.parquet",
    name="feature_data",
    file_format=ParquetFormat(),
    description="Featurized trip stat data.",
    owner="Malcolm Keyes",
    timestamp_field="day"
)

feature_data_fv = FeatureView(
    name="feature_data_fv",
    source= feature_data_source,
    entities=[taxi_entity],
    description="The columns of the featurized data.",
    owner="Malcolm Keyes"
)

# Let's also make a feature service so we don't need to indicate every feature individually
feature_data_fs = FeatureService(
    name="driver_activity",
    features=[feature_data_fv]
)

store.apply([feature_data_source, feature_data_fv, feature_data_fs])

Feast's Saved Dataset / Data Quality Monitoring feature works best with a single saved dataset that can be used as a reference for other retrieval jobs. By analyzing a single saved dataset, you can generate a profile that can be applied to other data retrieval jobs, based on the characteristics of the reference saved dataset. However, this does not bide well for comparisons where you also want to adjust for the characteristics of comparison retrieval job.

Nonetheless, here is an example of comparing a day of feature data against the previous 7 days, 30 days, and 90 days.

In [17]:
# save july 1st, 2019 as saved dataset

# what should I use for entities? all entities in the dataset?
# all for 2019 (which is what is in entities.parquet)? all in
# the last 90 days? idk man

# Ideally I wouldn't be generating these point-in-time joins at all
# and just looking at the source feature data. I don't care about the
# entities - I just want to look at the general distributions

# I guess I'll use the included 2019 entities and see how it works

taxi_ids = pyarrow.parquet.read_table("entities.parquet").to_pandas()

entity_df = pd.merge(taxi_ids, july_first_timestamps, how='cross')

july_first_features = store.get_historical_features(
    entity_df=entity_df,
    features=feature_data_fs
)

july_first_df = july_first_features.to_df()
july_first_df.head()

Unnamed: 0,taxi_id,event_timestamp,trip_count_1_to_10,trip_count_11_to_25,trip_count_26_to_50,trip_count_51_to_100,trip_count_101_or_higher,total_miles_0_to_10,total_miles_10_to_25,total_miles_25_to_50,...,seconds_7200_to_14400,seconds_14400_to_28800,seconds_28800_to_43200,seconds_43200_or_higher,earned_0_to_50,earned_50_to_100,earned_100_to_250,earned_250_to_500,earned_500_to_1000,earned_1000_or_higher
0,e45a4dfa53b65fb37c627b50cb466d6c7b14f87c5d991a...,2019-07-01 00:00:00+00:00,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
1,f58be545baf2a44a9575194c9d6d8276e32781fae982ac...,2019-07-01 00:00:00+00:00,0,1,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
2,51a62d1739477e4b43c9a98c33788451f01eedde48c0fe...,2019-07-01 00:00:00+00:00,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
3,e67eb932b76835a3010b0a51a8d0624d72714fee077c59...,2019-07-01 00:00:00+00:00,0,1,0,0,0,0,0,1,...,1,0,0,0,0,0,1,0,0,0
4,f219c2491cd2a30a788a0e4f1ce3797ef607a3ffe98ecd...,2019-07-01 00:00:00+00:00,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0


In [95]:
# next grab and checkout the 7, 30, and 90 day descriptions

# previous 7
last_week_of_june_timestamps = pd.DataFrame()
last_week_of_june_timestamps["event_timestamp"] = pd.date_range("2019-06-24", "2019-06-30", freq="D")

entity_df = pd.merge(taxi_ids, last_week_of_june_timestamps, how='cross')

last_week_of_june_features = store.get_historical_features(
    entity_df=entity_df,
    features=feature_data_fs
)

last_week_of_june_df = last_week_of_june_features.to_df()

In [97]:
# previous 30
june_timestamps = pd.DataFrame()
june_timestamps["event_timestamp"] = pd.date_range("2019-06-01", "2019-06-30", freq="D")

entity_df = pd.merge(taxi_ids, june_timestamps, how='cross')

june_features = store.get_historical_features(
    entity_df=entity_df,
    features=feature_data_fs
)

june_df = june_features.to_df()

In [99]:
# previous 90
apr_to_july_timestamps = pd.DataFrame()
apr_to_july_timestamps["event_timestamp"] = pd.date_range("2019-04-01", "2019-06-30", freq="D")

entity_df = pd.merge(taxi_ids, apr_to_july_timestamps, how='cross')

apr_to_july_features = store.get_historical_features(
    entity_df=entity_df,
    features=feature_data_fs
)

apr_to_july_df = apr_to_july_features.to_df()

Now that we've retrieved the data for the last 7, 30, and 90 days from July 1st, 2019, we want to see how July 1st compares to these different time scales.

At first I thought this might be doable by creating a saved dataset for July 1st and creating a profile for it. Then I would have used this profile to validate the previous time periods. However, since profile parameters are based off the reference dataset, this would mean I would be looking for changes from the new day of data, when I really want to see if today has changed from the previous time periods. For example, I may want to see if the mean value of `trip_count_1_to_10` has changed by more than 10% from the mean value of the last 30 days, but with a profile based off the single day of data, I would be testing if the 30-day mean was more than 10% different than the single-day data, which is not the same.

So then, I would need to create a saved dataset and profiler for every different time scale, which equates to lots of data being re-saved to s3 and then needing to be thrown out again and multiple functions that are functionally very similar, but with slightly different parameters. The multiple profilers could actually be solved easily using partial functions, but creating multiple saved datasets takes time and requires file creation and deletion.

Additionally, I am not certain that we actually want to do data quality monitoring on the returned `RetrievalJobs`, as they are point in time jobs, and we really just want to look at the raw feature data. What also makes this annoying is the fact that calls to `get_historical_features` requires a set of entities, and we have no interest in the entity ids, just the general distributions of data.

Lastly, even if we pursued this option, there is no way to adjust the parameters based on the size of the non-reference dataset. If we receive a day of data that is really small, then the distributions could be thrown way off, simply due to noise and overall lack of data. In effect, without being able to adjust for the size of the data, the constraints on distribution changes would need to either be too big to minimize false alarms, leading to possible missed alerts - or the constraints being too tight, and causing false alarms whenever the data size is smaller than normal.